首页 > 解决方案 > 从 PubMed find_element_by_css_selector VS visibility_of_all_elements_located 中抓取文本

问题描述

我正在尝试从 PubMed 的一篇文章中获取摘要。如果我用下面的代码直接进入文章链接,那么我可以得到我想要的摘要。

在此处输入图像描述

driver = webdriver.Chrome(executable_path="../drivers/chromedriver.exe")
driver.get("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6268174/")
time.sleep(randint(1, 5))
abstract = driver.find_element_by_css_selector("div#ABS1 p").text

但是,我有一个包含一千多篇文章的文章列表来获取它们的摘要。所以我做了一个自动化脚本如下

import time
from random import randint
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

# Define article name, i.e. Artificial intelligence in radiology
name = "Artificial intelligence in radiology"

# Invoke Chrome and go to PubMed website
driver = webdriver.Chrome(executable_path="../drivers/chromedriver.exe")
driver.get("https://pubmed.ncbi.nlm.nih.gov")
print("Accessing " + driver.title)
print(driver.current_url)

# Enter research article
time.sleep(randint(1, 5))
driver.find_element_by_css_selector("input[type='search']").send_keys(name)

# Click search
time.sleep(randint(1, 5))
driver.find_element_by_css_selector("span[class='usa-search-submit-text']").click()

# Click on the article link
time.sleep(randint(1, 5))
driver.find_element_by_css_selector("a[class='docsum-title']").click()

# Click to navigate to full text
time.sleep(randint(1, 5))
driver.find_element_by_css_selector("a[data-ga-action='PMC']").click()

# Get abstract
time.sleep(randint(1, 5))
abstract =  WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#ABS1 p")))

我使用与div#ABS1 p前面代码相同的标记,但它不起作用并引发超时异常。请问您的建议是什么造成了这种差异以及如何解决它?

标签: pythonseleniumweb-scraping

解决方案


这取决于它引发超时异常的哪一行?

但无论如何,用显式等待替换“time.sleep”行:

# Enter research article
inputWait = EC.element_to_be_located(By.CSS_SELECTOR, "input[type='search']")
WebDriverWait(driver, 10).until(inputWait)
driver.find_element_by_css_selector("input[type='search']").send_keys(name)

# Click search
spanWait = EC.element_to_be_located(By.CSS_SELECTOR, "span[class='usa-search-submit-text']")
WebDriverWait(driver, 10).until(spanWait)
driver.find_element_by_css_selector("span[class='usa-search-submit-text']").click()

..ETC

通常,我更喜欢使用 xpath 而不是 CSS 选择器。此外,您可能可以使用 requests 和 beautifulsoup 而不是 selenium 来做到这一点。


推荐阅读