python - 用 html 中的问号返回带有请求和 BS4 汤内容的抓取网站
问题描述
我正在使用以下 url 和标题抓取一个网站:
网址:'https://tennistonic.com/tennis-news/'
标题:
{
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
"Cache-Control": "no-cache",
"content-length": "0",
"content-type": "text/plain",
"cookie": "IDE=AHWqTUl3YRZ8Od9MzGofphNI-OCOFESmxlN69Ekm4Sbh9tcBDXGJQ1LVwbDd2uX_; DSID=AAO-7r74ByYt6ieW2yasN78hFsOGY6mrhpN5pEOWQ1vGRnAOdolIlKv23JqCRf11OpFUGFdZ-yxB3Ii1VE6UjcK-jny-4mcJ5uO-_BaV3bEFbLvU7rJNBlc",
"origin": "https//tennistonic.com",
"Connection": "keep-alive",
"Pragma": "no-cache",
"Referer": "https://tennistonic.com/",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "cross-site",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.80 Safari/537.36",
"x-client-data": "CI22yQEIprbJAQjBtskBCKmdygEIl6zKAQisx8oBCPXHygEI58jKAQjpyMoBCOLNygEI3NXKAQjB18oBCP2XywEIj5nLARiKwcoB"}
x 客户端数据之后有一个解码部分,我已将其省略但也尝试过。开发工具的完整请求如下所示:
:authority: stats.g.doubleclick.net
:method: POST
:path: /j/collect?t=dc&aip=1&_r=3&v=1&_v=j87&tid=UA-13059318-2&cid=1499412700.1601628730&jid=598376897&gjid=243704922&_gid=1691643639.1604317227&_u=QACAAEAAAAAAAC~&z=1736278164
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: en-GB,en-US;q=0.9,en;q=0.8
cache-control: no-cache
content-length: 0
content-type: text/plain
cookie: IDE=AHWqTUl3YRZ8Od9MzGofphNI-OCOFESmxlN69Ekm4Sbh9tcBDXGJQ1LVwbDd2uX_; DSID=AAO-7r74ByYt6ieW2yasN78hFsOGY6mrhpN5pEOWQ1vGRnAOdolIlKv23JqCRf11OpFUGFdZ-yxB3Ii1VE6UjcK-jny-4mcJ5uO-_BaV3bEFbLvU7rJNBlc
origin: https://tennistonic.com
pragma: no-cache
referer: https://tennistonic.com/
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.80 Safari/537.36
x-client-data: CI22yQEIprbJAQjBtskBCKmdygEIl6zKAQisx8oBCPXHygEI58jKAQjpyMoBCOLNygEI3NXKAQjB18oBCP2XywEIj5nLARiKwcoB
Decoded:
message ClientVariations {
// Active client experiment variation IDs.
repeated int32 variation_id = [3300109, 3300134, 3300161, 3313321, 3315223, 3318700, 3318773, 3318887, 3318889, 3319522, 3320540, 3320769, 3329021, 3329167];
// Active client experiment variation IDs that trigger server-side behavior.
repeated int32 trigger_variation_id = [3317898];
}
r = requests.get(url2, headers=headers2)
soup_cont = soup(r.content, 'html.parser')
我的回复中的汤内容如下:
这个网站是受保护的还是我发送了错误的请求?
解决方案
尝试使用selenium
:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://tennistonic.com/tennis-news/')
time.sleep(3)
soup = BeautifulSoup(driver.page_source,'html5lib')
print(soup.prettify())
driver.close()
推荐阅读
- python - 在一个 3D 图像中绘制多个折线图(Python)
- wordpress - is_product_category() 不工作 woocommerce
- android - 如何在插入时自动设置 USB Tethering (Android 8.1)
- c# - 如何发送获取请求然后解析它以获取令牌并再次使用同一会话将带有该令牌的发布请求发送到 C# 中的网站?
- c# - 在 C# 中从 XML 序列化时将空数组添加到 JSON
- javascript - 如果元素不适合 A4 尺寸,请为元素留出空间
- c++ - 读取表示整数的 6 字节、大端二进制字段
- laravel - 我在哪里可以更改 Laravel 中的身份验证错误消息?
- java - 使用 Selenium Java 传递用户名和密码以及用于登录的 url 时出错
- javascript - Angular 5:从组件访问元素内部 html 以获取动态生成的元素