首页 > 解决方案 > Google 学者使用 Captcha 阻止 python scrapy

问题描述

我正在调试一个简短的脚本,以从论文列表中获取引用计数和摘要。在调试时,我遇到了一个验证码块。但是,我最多只每 4-5 分钟执行一次脚本。这是一个重现我的问题的最小工作示例:

from scrapy.crawler import CrawlerProcess
import re

class ResArt_Spider(scrapy.Spider):
    name = "restart_spider"

    def start_requests(self):
        url_start = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=GHG+emission+pathways+until+2300+for+the+1.5%C2%B0C+temperature+rise+target+and+the+mitigation+costs+achieving+the+pathways&btnG="
        yield scrapy.Request(url = url_start, callback = self.parse_metrics)

    def parse_metrics(self, response):
        # scrape to extract abstract and citations
        citation_block = response.css('body > div#gs_top > div#gs_bdy ::text').extract()
        print(citation_block)

if __name__ == "__main__":
    getArt = CrawlerProcess()
    getArt.crawl(ResArt_Spider)
    getArt.start()

有一段时间,我能够得到一个列表并在列表中搜索引文和摘要。我这样做是为了尽量减少对谷歌学者的请求,即使我只是每 4-5 分钟调试和发出请求。所以我每次只拿回 1 个项目。

这是对 scrapy 调用的响应的截断版本:

2020-11-02 14:13:02 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2020-11-02 14:13:02 [scrapy.utils.log] INFO: Versions: ... Python 3.6.2 |Anaconda ...
2020-11-02 14:13:02 [scrapy.crawler] INFO: Overridden settings: {}
2020-11-02 14:13:02 [scrapy.middleware] INFO: Enabled extensions:
...
2020-11-02 14:13:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
...
2020-11-02 14:13:02 [scrapy.middleware] INFO: Enabled spider middlewares:
...
2020-11-02 14:13:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-11-02 14:13:02 [scrapy.core.engine] INFO: Spider opened
2020-11-02 14:13:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-11-02 14:13:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2020-11-02 14:13:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scholar.google.com/scholar?
hl=en&as_sdt=0%2C5&q=GHG+emission+pathways+until+2300+for+the+1.5%C2%B0C+temperature+rise+target+and+
the+mitigation+costs+achieving+the+pathways&btnG=> (referer: None)
['#gs_captcha_ccl{max-width:680px;margin:21px 0;}.gs_el_sm #gs_captcha_ccl{margin:13px 0;}
#gs_captcha_ccl h1{font-size:16px;line-height:24px;font-weight:normal;padding:0 0 16px 0;}',
'function gs_captcha_cb(){grecaptcha.render("gs_captcha_c",
{"sitekey":"6LfFDwUTAAAAAIyC8IeC3aGLqVpvrB6ZpkfmAibj","callback":function()
{document.getElementById("gs_captcha_f").submit()}});};', "Please show you're not a robot", 
"Sorry, we can't verify that you're not a robot when JavaScript is turned off.", 'Please ', 
'enable JavaScript', ' in your browser and reload this page.']
2020-11-02 14:13:03 [scrapy.core.engine] INFO: Closing spider (finished)
2020-11-02 14:13:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...

我对 the 和返回的列表进行了换行,[scrapy.core.engine] DEBUG: Crawled (200)因此它们都更易于阅读。

问题:

相关(但没有帮助)SO Q&A:

标签: pythonweb-scrapingscrapy

解决方案


尝试Google Cachereferer.

另外,请注意不要发送超过 2 个请求/秒。您可能会被阻止:

headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,'referer':'https://www.google.com/'}

推荐阅读