首页 > 解决方案 > 使用硒从两个不同深度刮取物品时遇到问题

问题描述

我在 python 中结合 selenium 创建了一个脚本,以number从它的登录页面获取答案,name从它的内页获取提问者的答案。我知道使用问题链接和下一页链接更容易抓取这两个项目,但这不是我打算在这里做的。底线是我试图只使用点击来遍历不同的地方。answer = WebDriverWait(item,10)但是,当我运行脚本时,它会在第二次迭代中抛出指向该行的以下错误。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document

虽然我要找的元素在登陆页和内页都有,但我需要从两个不同的深度抓取这两个项目。

我知道如何使用请求来抓取它们,所以我也不愿意走那条路。

我正在尝试的脚本:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = 'https://stackoverflow.com/questions/tagged/web-scraping'

def get_content(link):
    driver.get(link)
    while True:
        for count,item in enumerate(WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".question-summary")))):
            #error thrown in the following line in it's second iteration
            answer = WebDriverWait(item,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"[class$='answered'] > strong"))).text

            elem = driver.find_elements_by_css_selector(".summary a.question-hyperlink")[count]
            driver.execute_script("arguments[0].click();",elem)
            name = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1[itemprop='name'] > a"))).text
            print(answer,name)
            driver.back()

        try:
            next_page = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"a[rel='next']")))
            driver.execute_script("arguments[0].click();",next_page)
        except Exception:
            break

if __name__ == '__main__':
    with webdriver.Chrome() as driver:
        get_content(link)

如何从两个不同的深度刮掉这两个项目?

PS 如果我踢掉这一行answer = WebDriverWait(item,10)----,脚本会像魅力一样运行,穿越不同的深度和多个页面。

标签: pythonpython-3.xseleniumselenium-webdriverweb-scraping

解决方案


这是正常的,StaleElementReferenceException因为您离开页面并且对.question-summary元素的引用丢失了。

错误描述:Thrown when a reference to an element is now "stale".

按照您的意愿进行操作,下面的代码就可以完成。我将[class$='answered'] > strong选择器更改为[class*='answered'] > strong,否则如果问题已接受答案,您将收到错误消息。如果您只想要不被接受的,请根据需要修改脚本。

def get_content(link):
    driver.get(link)
    while True:
        count = len(WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".question-summary"))))
        for ix in range(count):
            question = driver.find_elements_by_css_selector(".question-summary")[ix]
            answers_count = question.find_element_by_css_selector("[class*='answered'] > strong").text

            driver.execute_script("arguments[0].click();", question.find_element_by_css_selector("a.question-hyperlink"))
            name = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "h1[itemprop='name'] > a"))).text
            print(answers_count, name)
            driver.back()
        try:
            next_page = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "a[rel='next']")))
            driver.execute_script("arguments[0].click();", next_page)
        except Exception:
            break

推荐阅读