python - 如何在 Scrapy 爬虫中进行代理?
问题描述
我正在尝试抓取只能通过代理访问的网站。我使用 Scrapy 创建了一个名为 scrapy_crawler 的项目,结构如下:
我已阅读我需要在 settings.py 中启用 HttpProxyMiddleware。
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100
}
在那之后我有点失落。我想我需要在请求中包含代理,但我不确定在哪里做。我在 middlewares.py 文件中尝试了以下内容。
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
r.meta['proxy'] = 'http://username:password@myproxy:port'
yield r
这是 digtionary.py 文件供参考。
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ImdbCrawler(CrawlSpider):
name = 'digtionary'
allowed_domains = ['www.mywebsite.com']
start_urls = ['https://mywebsite.com/digital/pages/start.aspx#']
rules = (Rule(LinkExtractor()),)
任何形式的帮助将不胜感激。提前致谢。