首页 > 解决方案 > 如何阻止爬虫

问题描述

我正在尝试编写一个爬虫,它可以访问网站并搜索关键字列表,max_Depth 为 2。但是一旦任何关键字出现在任何页面上,刮板就应该停止,我现在面临的问题是爬虫在第一次看到任何关键字时都不会停止。

即使在尝试之后,提前返回命令、中断命令和 CloseSpider 命令甚至 python 退出命令。

我的爬行者班:

class WebsiteSpider(CrawlSpider):

name = "webcrawler"

allowed_domains = ["www.roomtoread.org"]
start_urls = ["https://"+"www.roomtoread.org"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

crawl_count = 0
words_found = 0                                 

def check_buzzwords(self, response):

    self.__class__.crawl_count += 1

    crawl_count = self.__class__.crawl_count

    wordlist = [
        "sfdc",
        "pardot",
        "Web-to-Lead",
        "salesforce"
        ]

    url = response.url
    contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
    data = response.body.decode('utf-8')

    for word in wordlist:
            substrings = find_all_substrings(data, word)
            for pos in substrings:
                    ok = False
                    if not ok:
                        if  self.__class__.words_found==0:
                            self.__class__.words_found += 1
                            print(word + "," + url + ";")
                            STOP!

                            
                            
                        
    return Item()

def _requests_to_follow(self, response):
    if getattr(response, "encoding", None) != None:
            return CrawlSpider._requests_to_follow(self, response)
    else:
            return []

if not ok:我希望它在is时停止执行True

标签: web-scrapingscrapy

解决方案


当我想阻止蜘蛛时,我通常使用exception scrapy.exceptions.CloseSpider(reason='cancelled')Scrapy -Docs中的异常。

那里的示例显示了如何使用它:

if 'Bandwidth exceeded' in response.body:
    raise CloseSpider('bandwidth_exceeded')

在你的情况下,像

if not ok:
    raise CloseSpider('keyword_found')

或者这就是你的意思

CloseSpider 命令

并且已经尝试过了?


推荐阅读