首页 > 解决方案 > Scrapy编辑从Rule中提取的编辑链接

问题描述

我正在亚马逊上测试以解析产品,我想抓取产品,我得到了正确的产品 xpath,但我想编辑它以匹配,"https://www.amazon.com/dp/{}".format("ASIN")即从链接中删除一些额外的东西,我也得到了它的正则表达式,但 scrapy 是当我process_valueLink Extractors使用时显示错误。我怎样才能解决这个问题?

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from myamazon.items import MyamazonItem
from scrapy.loader import ItemLoader
import re
class AmazonSpider(CrawlSpider):
    name = 'amazon'
    allowed_domains = ['amazon.com']
    start_urls = ['http://amazon.com/']


    rules = (Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a/@href')),
            Rule(LinkExtractor(restrict_xpaths='//a[@class="a-link-normal a-text-normal"]'),callback="parse",
                process_value= lambda i:f"https://www.amazon.com/dp/{re.search('dp/(.*)/',i).groups()[0]}")
        )

错误:

    process_value= lambda i:re.serach('dp/(.*)/',i).groups()[0])
TypeError: __init__() got an unexpected keyword argument 'process_value'

标签: pythonscrapy

解决方案


看起来您正在尝试使用参数process_valueinRule()而不是LinkExtractor().

让我们格式化您的代码:

rules = (
    Rule(
        LinkExtractor(
            restrict_xpaths='//li[@class="a-last"]/a/@href'
        )
    ),
    Rule(
        LinkExtractor(
                restrict_xpaths='//a[@class="a-link-normal a-text-normal"]'
        ),
        callback="parse",
        process_value= lambda if: "https://www.amazon.com/dp/{re.search('dp/(.*)/',i).groups()[0]}"
     )
)

这里更明显的process_value是用于Rule(). scrapy.spiders.Rule不期望process_value,但LinkExtractor不期望。


推荐阅读