python - 如何在内容不同的网站中抓取容器?
问题描述
我想废弃这个网站。 https://www.dhgate.com/wholesale/electronics-robots/c103032.html
我已经建立了一个scrapy代码:
import scrapy
from urllib.parse import urljoin
class DhgateSpider(scrapy.Spider):
name = 'dhgate'
allowed_domains = ['dhgate.com']
start_urls = ['https://www.dhgate.com/wholesale/electronics-robots/c103032.html']
def parse(self, response):
Product = response.xpath('//*[@class="pro-title"]/a/@title').extract()
Price = response.xpath('//*[@class="price"]/span/text()').extract()
Customer_review = response.xpath('//*[@class="reviewnum"]/span/text()').extract()
Seller = response.xpath('//*[@class="seller"]/a/text()').extract()
Feedback = response.xpath('//*[@class="feedback"]/span/text()').extract()
for item in zip(Product,Price,Customer_review,Seller,Feedback):
scraped_info = {
'Product':item[0],
'Price': item[1],
'Customer_review':item[2],
'Seller':item[2],
'Feedback':item[3],
}
yield scraped_info
next_page_url = response.xpath('//*[@class="next"]/@href').extract_first()
if next_page_url:
next_page_url = urljoin('https:',next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse)
问题是并非每个容器都有客户评论或反馈项。因此,它只抓取那些具有完整产品、价格、客户评论、卖家和反馈项目的产品。我想刮掉所有容器,在没有 customer_review 的地方,我想添加一个空值。我怎么做?谢谢。
解决方案
不要使用zip
:
def parse(self, response):
for product_node in response.xpath('//div[@id="proList"]/div[contains(@class, "listitem")]'):
Product = product_node.xpath('.//*[@class="pro-title"]/a/@title').extract_first()
Price = product_node.xpath('.//*[@class="price"]/span/text()').extract_first()
Customer_review = product_node.xpath('.//*[@class="reviewnum"]/span/text()').extract_first()
Seller = product_node.xpath('.//*[@class="seller"]/a/text()').extract_first()
Feedback = product_node.xpath('.//*[@class="feedback"]/span/text()').extract_first()
scraped_info = {
'Product':Product,
'Price': Price,
'Customer_review':Customer_review,
'Seller':Seller,
'Feedback':Feedback,
}
yield scraped_info
next_page_url = response.xpath('//*[@class="next"]/@href').extract_first()
if next_page_url:
next_page_url = urljoin('https:',next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse)
推荐阅读
- python - 忽略 on_voice_state_update discord.py 中的异常
- typescript - 来自扩展类型的打字稿初始化对象
- python - 高效变换矩阵
- javascript - waitForXPath 一个 utf-8 字符串返回超时
- jython - Maximo Automation 脚本记录脚本崩溃
- java - 无法在 android 中创建我的视图模型类的实例
- javascript - 用于检测浏览器外部链接的 Google Chrome 扩展程序
- python - 如何更有效地解码 RSA 加密?
- c - 调用另一个函数并使其直接返回给该函数的调用者
- r - 将 Excel 文件中的表抓取到 R 中