首页 > 解决方案 > 用 html 中的问号返回带有请求和 BS4 汤内容的抓取网站

问题描述

我正在使用以下 url 和标题抓取一个网站:

网址:'https://tennistonic.com/tennis-news/'

标题:

{
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
    "Cache-Control": "no-cache",
    "content-length": "0",
    "content-type": "text/plain",
    "cookie": "IDE=AHWqTUl3YRZ8Od9MzGofphNI-OCOFESmxlN69Ekm4Sbh9tcBDXGJQ1LVwbDd2uX_; DSID=AAO-7r74ByYt6ieW2yasN78hFsOGY6mrhpN5pEOWQ1vGRnAOdolIlKv23JqCRf11OpFUGFdZ-yxB3Ii1VE6UjcK-jny-4mcJ5uO-_BaV3bEFbLvU7rJNBlc",
    "origin": "https//tennistonic.com",
    "Connection": "keep-alive",
    "Pragma": "no-cache",
    "Referer": "https://tennistonic.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.80 Safari/537.36",
    "x-client-data": "CI22yQEIprbJAQjBtskBCKmdygEIl6zKAQisx8oBCPXHygEI58jKAQjpyMoBCOLNygEI3NXKAQjB18oBCP2XywEIj5nLARiKwcoB"}

x 客户端数据之后有一个解码部分,我已将其省略但也尝试过。开发工具的完整请求如下所示:

:authority: stats.g.doubleclick.net
:method: POST
:path: /j/collect?t=dc&aip=1&_r=3&v=1&_v=j87&tid=UA-13059318-2&cid=1499412700.1601628730&jid=598376897&gjid=243704922&_gid=1691643639.1604317227&_u=QACAAEAAAAAAAC~&z=1736278164
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: en-GB,en-US;q=0.9,en;q=0.8
cache-control: no-cache
content-length: 0
content-type: text/plain
cookie: IDE=AHWqTUl3YRZ8Od9MzGofphNI-OCOFESmxlN69Ekm4Sbh9tcBDXGJQ1LVwbDd2uX_; DSID=AAO-7r74ByYt6ieW2yasN78hFsOGY6mrhpN5pEOWQ1vGRnAOdolIlKv23JqCRf11OpFUGFdZ-yxB3Ii1VE6UjcK-jny-4mcJ5uO-_BaV3bEFbLvU7rJNBlc
origin: https://tennistonic.com
pragma: no-cache
referer: https://tennistonic.com/
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.80 Safari/537.36
x-client-data: CI22yQEIprbJAQjBtskBCKmdygEIl6zKAQisx8oBCPXHygEI58jKAQjpyMoBCOLNygEI3NXKAQjB18oBCP2XywEIj5nLARiKwcoB
Decoded:
message ClientVariations {
  // Active client experiment variation IDs.
  repeated int32 variation_id = [3300109, 3300134, 3300161, 3313321, 3315223, 3318700, 3318773, 3318887, 3318889, 3319522, 3320540, 3320769, 3329021, 3329167];
  // Active client experiment variation IDs that trigger server-side behavior.
  repeated int32 trigger_variation_id = [3317898];
}

    r = requests.get(url2, headers=headers2)
    soup_cont = soup(r.content, 'html.parser')

我的回复中的汤内容如下:

汤的内容

这个网站是受保护的还是我发送了错误的请求?

标签: pythonbeautifulsouppython-requestsheaderscreen-scraping

解决方案


尝试使用selenium

from selenium import webdriver
from bs4 import BeautifulSoup
import time

driver = webdriver.Chrome()
driver.get('https://tennistonic.com/tennis-news/')

time.sleep(3)

soup = BeautifulSoup(driver.page_source,'html5lib')

print(soup.prettify())

driver.close()

推荐阅读