首页 > 解决方案 > 使用 Scrapy,如何检查 robots.txt 文件中允许的单个页面上的链接?

问题描述

使用 Scrapy,我将抓取一个页面(通过脚本而不是从控制台)来检查该页面上的所有链接是否被robots.txt文件允许。

scrapy.robotstxt.RobotParser抽象基类中,我找到了方法allowed(url, user_agent),但我不知道如何使用它。

import scrapy

class TestSpider(scrapy.Spider):
    name = "TestSpider"

    def __init__(self):
        super(TestSpider, self).__init__()
               
    def start_requests(self):
        yield scrapy.Request(url='http://httpbin.org/', callback=self.parse)

    def parse(self, response):
        if 200 <= response.status < 300:
            links = scrapy.linkextractors.LinkExtractor.extract_links(response)
            for idx, link in enumerate(links):
                    # How can I check each link is allowed by robots.txt file?
                    # => allowed(link.url , '*')    
                    
                    # self.crawler.engine.downloader.middleware.middlewares
                    # self.crawler AttributeError: 'TestSpider' object has no attribute 'crawler'
        

settings.py集合中运行“TestSpider”蜘蛛

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

转到项目的顶级目录并运行:

爬虫爬取TestSpider

感谢任何帮助。

我的解决方案:

import scrapy
from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
from scrapy.utils.httpobj import urlparse_cached
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class TestSpider(CrawlSpider):
    name = "TestSpider"

    def __init__(self):
        super(TestSpider, self).__init__()
        self.le = LinkExtractor(unique=True, allow_domains=self.allowed_domains)
        self._rules = [
            Rule(self.le, callback=self.parse)
        ]

    def start_requests(self):
        self._robotstxt_middleware = None
        for middleware in self.crawler.engine.downloader.middleware.middlewares:
            if isinstance(middleware, RobotsTxtMiddleware):
                self._robotstxt_middleware = middleware
break

        yield scrapy.Request(url='http://httpbin.org/', callback=self.parse_robotstxt)

    def parse_robotstxt(self, response):
        robotstxt_middleware = None
        for middleware in self.crawler.engine.downloader.middleware.middlewares:
            if isinstance(middleware, RobotsTxtMiddleware):
                robotstxt_middleware = middleware
                break

        url = urlparse_cached(response)
        netloc = url.netloc
        self._robotsTxtParser = None
        if robotstxt_middleware and netloc in robotstxt_middleware._parsers:
                self._robotsTxtParser = robotstxt_middleware._parsers[netloc]
       
        return self.parse(response)

    def parse(self, response):
        if 200 <= response.status < 300:
            links = self.le.extract_links(response)
            for idx, link in enumerate(links):
                # Check if link target is forbidden by robots.txt
                if self._robotsTxtParser:
                    if not self._robotsTxtParser.allowed(link.url, "*"):
                        print(link.url,' Disallow by robotstxt file')

标签: pythonscrapy

解决方案


页面上的解析器实现比您发布的链接高一点。

Protego解析器

基于Protego

  • 用 Python 实现
  • 符合 Google 的 Robots.txt 规范
  • 支持通配符匹配
  • 使用基于长度的规则

Scrapy 默认使用这个解析器。

因此,如果您想要与 scrapy 默认提供的相同结果,请使用 protego。

用法如下(robotstxt是 robots.txt 文件的内容):

>>> from protego import Protego
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False

也可以识别和重用当前使用的机器人中间件,但对于大多数用例来说,这可能比它的价值更麻烦。

编辑:

如果你真的想重用中间件,你的蜘蛛可以通过self.crawler.engine.downloader.middleware.middlewares.
从那里,您需要识别机器人中间件(可能通过类名?)和您需要的解析器(来自中间件的_parsers属性)。
最后,您将使用解析器的can_fetch()方法来检查您的链接。


推荐阅读