javascript - 无法使用带有 selenium 的浏览器仿真从 scihub 下载研究文章
问题描述
我正在尝试根据相应的文章标题自动从 scihub ( https://sci-hub.scihubtw.tw/ ) 下载研究文章。我正在使用一个名为学术(https://pypi.org/project/scholarly/)的库来获取与给定文章标题相关的 url、作者信息,如下面的代码所示。
我使用获取的 url(如上所述)来模拟使用 scihub 的下载过程。但我无法直接下载,因为我无法按搜索页面上的打开按钮(https://sci-hub.scihubtw.tw/)。并在填充查询后按 enter 将我转发到带有打开按钮的另一个页面。由于某种原因,我无法获取并按下打开按钮,它总是使用 selenium 库返回一个空元素。
但是,我可以在浏览器控制台中执行以下操作并成功下载文件,
document.querySelector("#open-button").click()
但是,试图从 selenium 获得类似的响应是失败的。
请帮我解决这个问题。
## This part of code fetches url using scholarly library from google scholar
from scholarly import scholarly
search_query = scholarly.search_pubs('Hydrogen-hydrogen pair correlation function in liquid water')
search_query = [query for query in search_query][0]
## This part of code uses selenium to automate download process
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import time
download_dir = '/Users/cacsag4/Downloads'
# setup the browser
options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {
"download.default_directory": download_dir, #Change default directory for downloads
"download.prompt_for_download": False, #To auto download the file
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})
browser = webdriver.Chrome('./chromedriver', options=options)
browser.delete_all_cookies()
browser.get('https://sci-hub.scihubtw.tw/')
# Find the search element to send the url string to it
searchElem = browser.find_element(By.CSS_SELECTOR, 'input[type="textbox"]')
searchElem.send_keys(search_query.bib['url'])
# Emulate pressing enter two different ways, either by pressing return key or by executing JS
#searchElem.send_keys(Keys.ENTER) # This produces the same effect as the next line
browser.execute_script("javascript:document.forms[0].submit()")
# Wait for page to load
time.sleep(10)
# Try to press the open button using JS or by fetching the button by its ID
# This returns error since its unable to fetch open-button id
browser.execute_script('javascript:document.querySelector("#open-button").click()')
#openElem = browser.find_element(By.ID, "open-button") ## This also returns a null element
解决方案
好的,所以我得到了这个问题的答案。Sci-hub 将其 pdf 存储在 iframe 中,因此您只需在第一页按 enter 后获取 iframe 的 src 属性。下面的代码完成了这项工作。
from scholarly import scholarly
search_query = scholarly.search_pubs('Hydrogen-hydrogen pair correlation function in liquid water')
search_query = [query for query in search_query][0]
print(search_query.bib['url'])
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import time
download_dir = '/Users/cacsag4/Downloads'
# setup the browser
options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {
"download.default_directory": download_dir, #Change default directory for downloads
"download.prompt_for_download": False, #To auto download the file
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})
browser = webdriver.Chrome('./chromedriver', options=options)
browser.delete_all_cookies()
browser.get('https://sci-hub.scihubtw.tw/')
# Find the search element to send the url string to it
searchElem = browser.find_element(By.CSS_SELECTOR, 'input[type="textbox"]')
searchElem.send_keys(search_query.bib['url'])
# Emulate pressing enter two different ways, either by pressing return key or by executing JS
#searchElem.send_keys(Keys.ENTER) # This produces the same effect as the next line
browser.execute_script("javascript:document.forms[0].submit()")
# Wait for page to load
time.sleep(2)
# Try to press the open button using JS or by fetching the button by its ID
# This returns error since its unable to fetch open-button id
#browser.execute_script('javascript:document.querySelector("#open-button").click()')
openElem = browser.find_element(By.CSS_SELECTOR, "iframe") ## This also returns a null element
browser.get(openElem.get_attribute('src'))
推荐阅读
- javascript - 启用滚动功能后,透明导航栏不会变得透明
- python - 从 Python 执行 MS Access 宏时出错
- c# - 如果一动不动,Hololens 就会睡着
- c# - Oculus 触摸控制器本地头像是粉红色的?
- r - KNN 的 PCA:插入符号中的预处理参数
- javascript - phonegap-nfc - 如何访问某个内存地址?
- laravel - 没有 URI 的子域基础,无法捕获基础 url
- java - netbeans vs eclipse 项目结构对比
- javascript - 如何使用javascript使菜单项出现在鼠标悬停上
- javascript - 根据输入值分配 PHP 值