python - 使用请求 Python 抓取网页
问题描述
table
我想刮掉这个页面,但我在BeautifulSoup
请求后找不到
代码
headers = {"Referer": "https://www.atptour.com/en/scores/results-archive",
'User-Agent': 'my-user-agent'
}
url = 'https://www.atptour.com/en/scores/results-archive?year=2016'
page = requests.get(url, headers=headers)
print(page)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', class_="results-archive-table mega-table")
print(table)
输出<Response [403]>
None
解决方案
我正在Response [200]
使用scrapy-selenium
with selenium stealth
。
代码:
import scrapy
from scrapy_selenium import SeleniumRequest
from selenium_stealth import stealth
from selenium import webdriver
from shutil import which
from selenium.webdriver.chrome.options import Options
class AtpSpider(scrapy.Spider):
name = 'atptour'
chrome_path = which("chromedriver")
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=chrome_path,options=chrome_options)
stealth(driver,user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36',
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=False)
def start_requests(self):
yield SeleniumRequest(
url='https://www.atptour.com/en/scores/results-archive?year=2016',
wait_time =5,
callback = self.parse,
)
def parse(self, response):
pass
输出:
2021-07-31 10:25:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-07-31 10:25:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atptour.com/en/scores/results-archive> (referer: None)
2021-07-31 10:25:05 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-31 10:25:05 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:53662/session/039ca0bb0a64b7b9eb48ab26a0f464a0 {}
2021-07-31 10:25:05 [urllib3.connectionpool] DEBUG: http://127.0.0.1:53662 "DELETE /session/039ca0bb0a64b7b9eb48ab26a0f464a0 HTTP/1.1" 200 14
2021-07-31 10:25:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-07-31 10:25:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 15142,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
推荐阅读
- python - 动态比较常见事件列表
- uiimage - Auto Scaling 在 Xcode 10 xcasset 图像属性中做了什么
- javascript - console.log、Object.keys 和 Object.getOwnPropertyNames 不一致
- symfony - 让 Symfony 表单使用简单的 GET 名称
- php - 分解数据数组
- javascript - React-Redux 程序返回“TypeError: Cannot read property 'map' of undefined”
- php - htaccess 重写 - 文件名无法识别
- java - Firebase - 实时数据库为用户设置附加值引发异常
- c++ - 向量的初始化
- java - 带有 c 规则的 Mullers 方法总是打印 NaN