python - 无法使用硒同时从两个不同的深度收集信息
问题描述
我已经使用 selenium 在 python 中编写了一个脚本,以从其登录页面获取name
和reputation
using函数,然后单击不同帖子的链接以到达内页,以便从那里解析using函数。get_names()
title
get_additional_info()
我试图解析的所有信息都在登录页面和内页中可用。而且,它们不是动态的,所以硒绝对是矫枉过正。但是,我的目的是利用 selenium 从两个不同的深度同时抓取信息。
在下面的脚本中,如果我注释掉name
和rep
行,我可以看到该脚本可以对登录页面的链接进行点击,并title
完美地解析来自内页的 s。
但是,当我按原样运行脚本时,我得到selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
指向这一name = item.find_element_by_css_selector()
行的错误。
我怎样才能摆脱这个错误并让它完美地运行符合我已经实现的逻辑?
到目前为止我已经尝试过:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
lead_url = 'https://stackoverflow.com/questions/tagged/web-scraping'
def get_names():
driver.get(lead_url)
for count, item in enumerate(wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary")))):
usableList = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))
name = item.find_element_by_css_selector(".user-details > a").text
rep = item.find_element_by_css_selector("span.reputation-score").text
driver.execute_script("arguments[0].click();",usableList[count])
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h1 > a.question-hyperlink")))
title = get_additional_info()
print(name,rep,title)
driver.back()
wait.until(EC.staleness_of(usableList[count]))
def get_additional_info():
title = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h1 > a.question-hyperlink"))).text
return title
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,5)
get_names()
解决方案
与您的设计模式保持广泛...不要工作item
。用于count
索引从当前提取的元素列表,page_source
例如
driver.find_elements_by_css_selector(".user-details > a")[count].text
派
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
lead_url = 'https://stackoverflow.com/questions/tagged/web-scraping'
def get_names():
driver.get(lead_url)
for count, item in enumerate(wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary")))):
usableList = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))
name = driver.find_elements_by_css_selector(".user-details > a")[count].text
rep = driver.find_elements_by_css_selector("span.reputation-score")[count].text
driver.execute_script("arguments[0].click();",usableList[count])
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h1 > a.question-hyperlink")))
title = get_additional_info()
print(name,rep,title)
driver.back()
wait.until(EC.staleness_of(usableList[count]))
def get_additional_info():
title = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h1 > a.question-hyperlink"))).text
return title
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,5)
get_names()
推荐阅读
- django - 哪种方法更适合在 Django 中进行标记?多对多还是 ArrayField?
- python - 如何在 backtrader 中检查馈送数据?
- android - 比较版本号和小数点
- mongodb - 忽略MongoDB中的最后一条记录
- r - R中的零膨胀模型的交互项问题
- input - IBM 大型机阻止了“X NOT HERE”
- sql - 单独 WHERE 子句的最旧和最新时间戳 - 条件聚合
- visual-studio-extensions - 如何在 Visual Studio 扩展的 QuickInfo 提示中为 ClassifiedTextRun 添加单击事件处理程序?
- express - 如何为 Exress JS 应用程序设置哨兵上下文?
- python - 使用moviepy将具有透明背景的Gizeh gif添加到现有电影