python - Scrapy - 递归爬取多个页面时避免重复项
问题描述
我应该在我的代码中进行哪些更改以避免 Scrapy 在深度抓取到多个页面期间检索相同的项目?
现在,Scrapy 像这样执行爬取和抓取
Visit Page-A >> ScrapeItem1 & Extract_link_to_Page-B >> Visit Page-B >> ScrapeItem2 & Extract_links_to_Pages-C-D-E >> ScrapeItems2-3-4-5 from Pages-C-D-E
代码看起来像这样
def category_page(self,response):
next_page = response.xpath('').extract()
for item in self.parse_attr(response):
yield item
if next_page:
path = next_page.extract_first()
nextpage = response.urljoin(path)
yield scrapy.Request(nextpage,callback=category_page)
def parse_attr(self, response):
item = TradeItem()
item['NameOfCompany'] = response.xpath('').extract_first().strip()
item['Country'] = response.xpath('').extract_first().strip()
item['TrustPt'] = response.xpath('').extract_first().strip()
company_page = response.xpath('').extract_first()
if company_page:
company_page = response.urljoin(company_page)
request = scrapy.Request(company_page, callback = self.company_data)
request.meta['item'] = item
yield request
else:
yield item
def company_data(self, response):
item = response.meta['item']
item['Address'] = response.xpath('').extract()[1]
product_page = response.xpath('').extract()[1]
sell_page = response.xpath('').extract()[2]
trust_page = response.xpath('').extract()[4]
if sell_page:
sell_page = response.urljoin(sell_page)
request = scrapy.Request(sell_page, callback = self.sell_data)
request.meta['item3'] = item
yield request
if product_page:
product_page = response.urljoin(product_page)
request = scrapy.Request(product_page, callback = self.product_data)
request.meta['item2'] = item
yield request
if trust_page:
trust_page = response.urljoin(trust_page)
request = scrapy.Request(trust_page, callback = self.trust_data)
request.meta['item4'] = item
yield request
yield item
def product_data(self, response):
item = response.meta['item2']
item ['SoldProducts'] = response.xpath('').extract()
yield item
def sell_data(self, response):
item = response.meta['item3']
item ['SellOffers'] = response.xpath('').extract()
yield item
def trust_data(self, response):
item = response.meta['item4']
item ['TrustData'] = response.xpath('').extract()
yield item
问题是项目是重复的,因为 Scrapy 对每个功能/元项目执行部分抓取。所以,我得到这样的条目:
第1步:
{'Address': u'',
'Country': u'',
'NameOfCompany': u'',
'TrustPoints': u''}
第2步:
{'Address': u'',
'Country': ','
'NameOfCompany': ',
'SellOffers': [
'TrustPoints': u''}
第三步:
{'Address': u'',
'Country': u'',
'NameOfCompany': u'',
'SellOffers': [],
'SoldProducts': [u' '],
'TrustData': [u''],
'TrustPoints': u''}
每个 STEP 都重复前一个 STEP 的值。我知道这是由 Scrapy 多次访问 URLS 引起的。我的逻辑中有一些我无法完全理解的错误。
解决方案
问题解决了。
对应答案:
https://stackoverflow.com/a/16177544/11008259
针对我的情况更正了代码。
def parse_attr(self, response):
company_page = response.xpath('').extract_first()
company_page = response.urljoin(company_page)
request = scrapy.Request(company_page, callback = self.company_data)
yield request
def company_data(self, response):
item = TradekeyItem()
item['Address'] = response.xpath('').extract()[1]
item['NameOfCompany'] = response.xpath('').extract()[1]
product_page = response.xpath('').extract()[1]
product_page = response.urljoin(product_page)
request = scrapy.Request(product_page, callback = self.product_data, meta={'item': item})
request.meta['item'] = item
return request
def product_data(self, response):
item = response.meta['item']
item ['SoldProducts'] = response.xpath('').extract()
sell_page = response.xpath('').extract()[2]
sell_page = response.urljoin(sell_page)
request = scrapy.Request(sell_page, callback = self.sell_data, meta={'item': item})
return request
def sell_data(self, response):
item = response.meta['item']
item ['SellOffers'] = response.xpath('').extract()
trust_page = response.xpath('').extract()[4]
trust_page = response.urljoin(trust_page)
request = scrapy.Request(trust_page, callback = self.trust_data, meta={'item': item})
return request
def trust_data(self, response):
item = response.meta['item']
item ['TrustData'] = response.xpath('")]//text()').extract()
yield item
我们通过不在每一步产生项目,而是在最后一步产生项目来建立项目之间的链。每个函数都将请求返回给下一个函数,因此只有在所有函数完成运行时才会打印项目。
推荐阅读
- testing - 让 Puppeteer 等待 globalSetup 完成
- python - Django + Postgres:字符串文字不能包含 NUL (0x00) 字符
- java - 关于从WebDriver内部的URL中提取参数的问题
- amazon-web-services - AWS GovClouds 的云形成模板中的 ARN 结构
- javascript - Angular 8 中已弃用的 MediaQueryList.addListener() 事件的替换
- python - 如何从数据框中创建字符串索引而不是数字?
- mysql - 无法使用 mysql 3 joins 查询获取完整数据
- oracle - oracle- 加入 2 个共有 2 个 ID 的表
- java - 当手机锁定在某些设备(如 oppo、一加设备)中时,绑定服务会在 1 或 2 分钟后停止
- html - 怎么把文字放在图片右边的里面