scrapy - 由于 process_links,Scrapy 陷入无限循环
问题描述
我正在使用 scrapy 2.1.0 并希望通过 link_filtering 向每个请求添加参数。这可行,但我确实遇到了无限循环,因为重复过滤器似乎受到了影响。
rules = (
Rule(
LinkExtractor(
allow=['^(example)?\/(?!ratgeber)[a-z-]+\/(\?p=\d+)?$'],
restrict_xpaths=(['//div[@class="sidebar--categories-navigation"]', # only navi pannel
'//div[contains(@class,"panel--paging")]/a']), # include pagination
),
follow=True,
process_links='link_filtering',
callback= 'parse_item'
),
)
添加链接过滤:
# get max amount of results per category and add n=x results to url
def link_filtering(self, links):
for link in links:
if re.match('.*\?.*',link.url) is None: #add all parameters if there are none
link.url = "%s?p=1&followSearch=10000&o=1&n=1000" % link.url
else: # add max amount of results to pagination
link.url = "%s&followSearch=10000&o=1&n=1000" % link.url
return links
爬虫将继续一遍又一遍地抓取相同的 URL。如何防止这种情况并保留添加的参数?
解决方案
from w3lib.url import canonicalize_url
然后
link.url = canonicalize_url(link.url)
这会有帮助吗?
并保留原来的回报