python - 使用 selenium 和 BeautifulSoup 抓取动态网页,但新页面不断弹出
问题描述
我正在从动态网页中抓取内容。https://www.nytimes.com/search?query=china+COVID-19我想获取所有新闻文章的内容(共 26,783 篇)。我无法迭代页面,因为在此网站上您必须单击“显示更多”才能加载下一页。
因此,我使用 webdriver.ActionChians。该代码没有显示任何错误消息,但每隔几秒钟就会弹出一个新窗口,并且每次看起来都是同一个页面。这个过程似乎没完没了,我在 2 小时后中断了它。我使用了代码“print(article)”,但没有显示。有人可以帮我解决这个问题吗?非常感谢您的帮助!
import time
import requests
from bs4 import BeautifulSoup
import json
import string
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
# Initialize webdriver.Chrome and webdriver.ActionChains only once
chromedriver_path = 'C:/chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
action = webdriver.ActionChains(driver)
# Get to the page
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# While button is present
while soup.find('button', {'data-testid': 'search-show-more-button'}) != None:
# Find button
button = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
# Move to it to avoid false-clicking other elements
action.move_to_element(button).perform()
# Click the button
button.click()
# Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
soup = BeautifulSoup(driver.page_source, 'html.parser')
search_results = soup.find('ol', {'data-testid':'search-results'})
links = search_results.find_all('a')
for link in links:
link_url = link['href']
response = requests.get(base + link_url)
soup_link = BeautifulSoup(response.text, 'html.parser')
scripts = soup_link.find_all('script')
for script in scripts:
if 'window.__preloadedData = ' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('window.__preloadedData = ')[-1]
jsonStr = jsonStr.rsplit(';',1)[0]
jsonData = json.loads(jsonStr)
article = []
for k, v in jsonData['initialState'].items():
w=1
try:
if v['__typename'] == 'TextInline':
article.append(v['text'])
#print (v['text'])
except:
continue
article = [ each.strip() for each in article ]
article = ''.join([('' if c in string.punctuation else ' ')+c for c in article]).strip()
print(article)
myarticle.append(article)
df = pd.DataFrame(myarticle, columns = ['article'])
df.to_csv('NYtimes.csv')
print("Complete")
browser.quit()
输出
---------------------------------------------------------------------------
ElementClickInterceptedException Traceback (most recent call last)
<ipython-input-7-1515a65b3c60> in <module>
24 try:
---> 25 button.click()
26 break
~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in click(self)
79 """Clicks the element."""
---> 80 self._execute(Command.CLICK_ELEMENT)
81
~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in _execute(self, command, params)
632 params['id'] = self._id
--> 633 return self._parent.execute(command, params)
634
~\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
320 if response:
--> 321 self.error_handler.check_response(response)
322 response['value'] = self._unwrap_value(
~\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
241 raise exception_class(message, screen, stacktrace, alert_text)
--> 242 raise exception_class(message, screen, stacktrace)
243
ElementClickInterceptedException: Message: element click intercepted: Element <button data-testid="search-show-more-button" type="button">...</button> is not clickable at point (509, 656). Other element would receive the click: <div class="css-1n5jm1v">...</div>
(Session info: chrome=83.0.4103.61)
During handling of the above exception, another exception occurred:
NameError Traceback (most recent call last)
<ipython-input-7-1515a65b3c60> in <module>
25 button.click()
26 break
---> 27 except ElementClickInterceptedException:
28 time.sleep(0.5)
29 # Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
NameError: name 'ElementClickInterceptedException' is not defined
解决方案
弹出“新窗口”是因为您在每个循环的迭代中重新创建了驱动程序。
一步步。首先,您在此处创建驱动程序并进入页面:
browser = webdriver.Chrome('C:/chromedriver.exe')
browser.get('https://www.nytimes.com/search?query=china+COVID-19')
然后在循环内每次迭代创建一个驱动程序:
while True:
try:
driver = webdriver.Chrome('C:/chromedriver.exe')
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
这就是为什么您每次都会看到新窗口的原因。
要解决此问题,您可以应用此代码(这仅包括迭代部分):
from selenium.common.exceptions import ElementClickInterceptedException
from selenium import webdriver
from bs4 import BeautifulSoup
import time
# Initialize webdriver.Chrome and webdriver.ActionChains only once
chromedriver_path = 'C:/chromedriver.exe'
driver = webdriver.Chrome(chromedriver_path)
action = webdriver.ActionChains(driver)
# Get to the page
driver.get('https://www.nytimes.com/search?query=china+COVID-19')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# While button is present
while soup.find('button', {'data-testid': 'search-show-more-button'}) != None:
# Find button
button = driver.find_element_by_xpath('//button[@type="button"][contains(.,"Show More")]')
# Move to it to avoid false-clicking other elements
action.move_to_element(button).perform()
# Movement takes some time and not instant, therefore it is better to add a short wait
# to make sure that ElementClickInterceptedException won't appear
time.sleep(0.5)
# However, constant time sleep is not reliable if something unexpected happened and more
# time was required, therefore let's just create an endless loop, which will break once
# 'click' was successful. According to your last error, the 'covering element' was a 'div'.
# In other words, even by false-clicking you won't cause any action, which is why this
# solution is save.
while True:
try:
button.click()
break
except ElementClickInterceptedException:
time.sleep(0.5)
# Redefine variable 'soup' in case if button dissapeared, so the 'while' loop will end
soup = BeautifulSoup(driver.page_source, 'html.parser')
据我所知,没有关于第二部分的任何问题,即您在哪里解析搜索结果,但如果您有一些问题,请随时提问。
UPD:每次迭代初始化 ActionChains 也是没有意义的,因此您可以在创建 webdriver 后立即执行此操作。(我已经更改了代码示例,因此您可以简单地复制和阅读每个步骤的注释)
UPD2:我添加了一些额外的保护来避免误点击。
推荐阅读
- python - 比较json文件中列表中的元素
- javascript - 如何在 p5js 中使用箭头键跳转对象?
- sql - 具有多个顺序列的sql server中的条件顺序
- java - 从 FireStore android 检索自定义字段
- c++ - 如何在 Clang LibTooling 中获取有关调用析构函数的信息?
- javascript - 在某个元素之后查找类
- java - 当 postDelayed 处理程序运行时,AsynchTask 冻结
- php - 消息:未定义属性:CI_DB_mysqli_result::$num_row
- css - React js语义菜单项在中心
- python - 如何在 pycharms 交互式调试器中评估协程