首页 > 解决方案 > Scrapy 和 Incapsula

问题描述

我正在尝试使用带有 Splash 的 Scrapy 从网站“whoscored.com”检索数据。这是我的设置:

BOT_NAME = 'scrapy_matchs'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_matchs (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 20
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

USER_AGENTS = [
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/57.0.2987.110 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.79 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
     'Gecko/20100101 '
     'Firefox/55.0'),  # firefox
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.91 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/62.0.3202.89 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/63.0.3239.108 '
     'Safari/537.36'),  # chrome
]

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'scrapy_matchs.pipelines.ScrapyMatchsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 30
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

在此之前,我只使用 Splash,并且在被 Incapsula 阻止之前至少可以请求 2 或 3 页。但是使用 Scrapy,我在发出第一个请求后立即被阻止。

<html style="height:100%">
 <head>
  <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="initial-scale=1.0" name="viewport"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3" type="text/javascript">
  </script>
 </head>
 <body style="margin:0px;height:100%">
  <iframe frameborder="0" height="100%" id="main-iframe" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=22&amp;xinfo=14-58014137-0%200NNN%20RT%281572446923864%2084%29%20q%280%20-1%20-1%202%29%20r%280%20-1%29%20B17%284%2c200%2c0%29%20U18&amp;incident_id=727001300034907080-167681622137047086&amp;edet=17&amp;cinfo=04000000&amp;rpinfo=0" width="100%">
   Request unsuccessful. Incapsula incident ID: 727001300034907080-167681622137047086
  </iframe>
 </body>
</html>

为什么我那么容易被屏蔽?我应该更改我的设置吗?

先感谢您。

标签: pythonweb-scrapingscrapysplash-js-render

解决方案


他们有可能记录了您之前的抓取活动吗?那Scrapy不负责?有吗?

USER_AGENT = 'scrapy_matchs (+http://www.yourdomain.com)'

这部分也让我想到了我自己的 Web 服务器日志文件,其中包含 github.com/masscan 之类的 url。如果该域与抓取相关联,或者它包含短语scrapy,我不会因为禁止它们而感到难过。绝对遵守 robots.txt 规则,机器人不要检查它会让你看起来很糟糕;)而且我不会使用这么多用户代理。我也喜欢获取站点的默认标题并将其而不是您自己的标题的想法。如果我有一个网站被大量爬行内容击中,我可以想象根据用户是否有看起来奇怪/不正常的请求标头来过滤用户。

我建议你...

  1. nmap 扫描该站点以找出他们使用的 Web 服务器。
  2. 使用最基本的设置在本地计算机上安装和设置它。(打开所有日志参数,大多数服务器都关闭了一些)
  3. 检查该服务器的日志文件并检查您的抓取流量与连接到该站点的浏览器的外观。
  4. 然后想办法让前者看起来和后者一模一样。
  5. 如果这些都不能缓解问题,请不要使用scrapy,只需使用 selenium 和真正的用户代理自动通过网站,您的爬网代码在用户自动化获得的页面上运行。
  6. 我还建议您通过代理或其他方法使用不同的 ip,因为您的 ip 似乎可能在某处的某个禁止列表中。
  7. 如果 AWS 免费版本允许您通过您在连接到 AWS 服务器的计算机上设置的 ssh 代理端口连接到站点,那么 AWS 免费版本将是一种检查站点安全性的简单方法,那么这意味着他们没有禁止您使用的 AWS 服务器我认为使用它意味着他们缺乏安全性,因为基本上地球上的 AWS 服务器似乎每天都会扫描我的 Pi。
  8. 在星巴克旁边的图书馆做这项工作,旁边有一个......有免费无线网络和不同的 IP 地址会很好。

推荐阅读