首页 > 解决方案 > 无法使用带有 selenium 的浏览器仿真从 scihub 下载研究文章

问题描述

我正在尝试根据相应的文章标题自动从 scihub ( https://sci-hub.scihubtw.tw/ ) 下载研究文章。我正在使用一个名为学术(https://pypi.org/project/scholarly/)的库来获取与给定文章标题相关的 url、作者信息,如下面的代码所示。

我使用获取的 url(如上所述)来模拟使用 scihub 的下载过程。但我无法直接下载,因为我无法按搜索页面上的打开按钮(https://sci-hub.scihubtw.tw/)。并在填充查询后按 enter 将我转发到带有打开按钮的另一个页面。由于某种原因,我无法获取并按下打开按钮,它总是使用 selenium 库返回一个空元素。

但是,我可以在浏览器控制台中执行以下操作并成功下载文件,

document.querySelector("#open-button").click()

但是,试图从 selenium 获得类似的响应是失败的。

请帮我解决这个问题。

## This part of code fetches url using scholarly library from google scholar
from scholarly import scholarly
search_query = scholarly.search_pubs('Hydrogen-hydrogen pair correlation function in liquid water')
search_query = [query for query in search_query][0]


## This part of code uses selenium to automate download process
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import time

download_dir = '/Users/cacsag4/Downloads'

# setup the browser
options = webdriver.ChromeOptions()

options.add_experimental_option('prefs', {
    "download.default_directory": download_dir, #Change default directory for downloads
    "download.prompt_for_download": False, #To auto download the file
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})

browser = webdriver.Chrome('./chromedriver', options=options)
browser.delete_all_cookies()

browser.get('https://sci-hub.scihubtw.tw/')

# Find the search element to send the url string to it
searchElem = browser.find_element(By.CSS_SELECTOR, 'input[type="textbox"]')
searchElem.send_keys(search_query.bib['url'])

# Emulate pressing enter two different ways, either by pressing return key or by executing JS
#searchElem.send_keys(Keys.ENTER) # This produces the same effect as the next line
browser.execute_script("javascript:document.forms[0].submit()")

# Wait for page to load
time.sleep(10)

# Try to press the open button using JS or by fetching the button by its ID

# This returns error since its unable to fetch open-button id
browser.execute_script('javascript:document.querySelector("#open-button").click()')

#openElem = browser.find_element(By.ID, "open-button") ## This also returns a null element

标签: javascriptpython-3.xselenium-chromedrivercross-site

解决方案


好的,所以我得到了这个问题的答案。Sci-hub 将其 pdf 存储在 iframe 中,因此您只需在第一页按 enter 后获取 iframe 的 src 属性。下面的代码完成了这项工作。

from scholarly import scholarly
search_query = scholarly.search_pubs('Hydrogen-hydrogen pair correlation function in liquid water')
search_query = [query for query in search_query][0]
print(search_query.bib['url'])


from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import time

download_dir = '/Users/cacsag4/Downloads'

# setup the browser
options = webdriver.ChromeOptions()

options.add_experimental_option('prefs', {
    "download.default_directory": download_dir, #Change default directory for downloads
    "download.prompt_for_download": False, #To auto download the file
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})

browser = webdriver.Chrome('./chromedriver', options=options)
browser.delete_all_cookies()

browser.get('https://sci-hub.scihubtw.tw/')

# Find the search element to send the url string to it
searchElem = browser.find_element(By.CSS_SELECTOR, 'input[type="textbox"]')
searchElem.send_keys(search_query.bib['url'])
# Emulate pressing enter two different ways, either by pressing return key or by executing JS
#searchElem.send_keys(Keys.ENTER) # This produces the same effect as the next line
browser.execute_script("javascript:document.forms[0].submit()")

# Wait for page to load
time.sleep(2)

# Try to press the open button using JS or by fetching the button by its ID

# This returns error since its unable to fetch open-button id
#browser.execute_script('javascript:document.querySelector("#open-button").click()')

openElem = browser.find_element(By.CSS_SELECTOR, "iframe") ## This also returns a null element
browser.get(openElem.get_attribute('src'))

推荐阅读