首页 > 解决方案 > Google Scholar 阻止我使用 search_pubs

问题描述

我正在使用 Pycharm Community Edition 2020.3.2、Scholarly 版本 1.0.2、Tor 版本 1.0.0。我试图抓取 700 篇文章来查找它们的引用次数。Google Scholar 阻止我使用 search_pubs(Scholarly 的一个功能)。但是,Scholarly 的另一个功能 search_author 仍然运行良好。一开始,search_pubs 功能正常工作。我试过这些代码。

from scholarly import scholarly
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

经过几次试验,它显示以下错误。

Traceback (most recent call last):
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-3bbcfb742cb5>", line 1, in <module>
    scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_scholarly.py", line 121, in search_pubs
    return self.__nav.search_publications(url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py", line 256, in search_publications
    return _SearchScholarIterator(self, url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\publication_parser.py", line 53, in __init__
    self._load_url(url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\publication_parser.py", line 58, in _load_url
    self._soup = self._nav._get_soup(url)
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py", line 200, in _get_soup
    html = self._get_page('https://scholar.google.com{0}'.format(url))
  File "C:\Users\binhd\anaconda3\envs\t2\lib\site-packages\scholarly\_navigator.py", line 152, in _get_page
    raise Exception("Cannot fetch the page from Google Scholar.")
Exception: Cannot fetch the page from Google Scholar.

然后,我发现原因是我需要通过 Google 的 CAPTCHA 才能继续从 Google Scholar 获取信息。许多人建议我需要使用代理,因为我的 IP 被谷歌阻止了。我尝试使用 FreeProxies() 更改代理

from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator()
pg.FreeProxies()
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

它不起作用,并且 Pycharm 被冻结了很长时间。然后,我安装了 Tor(pip install Tor)并再次尝试:

from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.Tor_External(tor_sock_port=9050, tor_control_port=9051, tor_password="scholarly_password")
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

这没用。然后,我尝试了 SingleProxy()

from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.SingleProxy(https='socks5://127.0.0.1:9050',http='socks5://127.0.0.1:9050')
scholarly.use_proxy(pg)
scholarly.search_pubs('Large Batch Optimization for Deep Learning: Training BERT in 76 minutes')

它也不起作用。我从来没有尝试过 Luminati,因为我不熟悉它。如果有人知道解决方案,请帮助!

标签: pythonproxytorgoogle-scholar

解决方案


推荐阅读