首页 > 解决方案 > 能够抓取静态网站但不能抓取动态网站

问题描述

我正在尝试从 ESPN 获取下一场即将到来的比赛的时间,您可以在 ESPN 上找到:https ://www.espn.com/ (现在看来是尤文图斯和 AC 米兰之间的足球比赛)

我的 webscrape 有以下 python 代码:

import requests
from lxml import html
from selenium import webdriver
import chromedriver_binary

driver = webdriver.Chrome()
driver.get('https://www.espn.com/')

tree = html.fromstring(driver.page_source)

time = tree.xpath('//*[@id="news-feed"]/section[1]/header/a/div[2]/span[2]/span')

print(time)

但它返回此错误:

Traceback (most recent call last):
  File "c:\Users\akash\Coding\test\scrape.py", line 9, in <module>
    tree = html.fromstring(driver.page_source)
  File "C:\Users\akash\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 679, in page_source
    return self.execute(Command.GET_PAGE_SOURCE)['value']
  File "C:\Users\akash\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Users\akash\AppData\Local\Programs\Python\Python38-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=83.0.4103.97)

我怀疑问题是因为这是 ESPN 网站上的动态内容,因为我能够使用相同的代码(除了更改 URL 和 XPath)从另一个具有恒定数据的网站上抓取数据。任何人都可以帮助解决此错误吗?

我已经在代码中安装了每个 python 库。(注意:我已经看过Scraping using python and xpathPython Selenium Chrome Webdriver

标签: pythonseleniumselenium-webdriverweb-scraping

解决方案


就我而言,我使用了从chromium.org下载的二进制文件。代码如下:

from lxml import html
from selenium import webdriver
from selenium.webdriver.chrome.options import Options  
chrome_options = Options()  
chrome_options.add_argument("--headless") 

driver = webdriver.Chrome(r'./chromedriver', chrome_options=chrome_options)
driver.get('https://www.espn.com/')
tree = html.fromstring(driver.page_source)
time = tree.xpath("//*[@id='news-feed']//span[@class='game-time']/text()")[0].strip()
print(time)

--headless传递给的参数chrome_options是可选的(这只是在其“无头模式”下运行 Chrome)。


推荐阅读