首页 > 解决方案 > 由于 process_links,Scrapy 陷入无限循环

问题描述

我正在使用 scrapy 2.1.0 并希望通过 link_filtering 向每个请求添加参数。这可行,但我确实遇到了无限循环,因为重复过滤器似乎受到了影响。

rules = (
    Rule(
        LinkExtractor(
            allow=['^(example)?\/(?!ratgeber)[a-z-]+\/(\?p=\d+)?$'],
            restrict_xpaths=(['//div[@class="sidebar--categories-navigation"]', # only navi pannel
                              '//div[contains(@class,"panel--paging")]/a']), # include pagination         
        ), 
        follow=True,
        process_links='link_filtering',                    
        callback= 'parse_item'
    ),
)

添加链接过滤:

# get max amount of results per category and add n=x results to url
def link_filtering(self, links):
    for link in links:
        if re.match('.*\?.*',link.url) is None: #add all parameters if there are none
            link.url = "%s?p=1&followSearch=10000&o=1&n=1000" % link.url
        else:  # add max amount of results to pagination
            link.url = "%s&followSearch=10000&o=1&n=1000" % link.url
    return links  

爬虫将继续一遍又一遍地抓取相同的 URL。如何防止这种情况并保留添加的参数?

标签: scrapy

解决方案


from w3lib.url import canonicalize_url

然后

link.url = canonicalize_url(link.url)

这会有帮助吗?

并保留原来的回报


推荐阅读