python - 已抓取 0 页 已抓取 0 项
问题描述
我刚开始学习 Python 和 Scrapy。
我的第一个项目是在包含网络安全信息的网站上抓取信息。但是当我使用 cmd 运行它时,它会说
抓取了 0 页(以 0 页/分钟) 抓取了 0 项(以 0 项/分钟)
似乎什么都没有出来。如果有人能解决我的问题,我将不胜感激。
以下是我的蜘蛛文件:
项目:
import scrapy
class ReporteinmobiliarioItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
titulo = scrapy.Field()
precioAlquiler = scrapy.Field()
ubicacion = scrapy.Field()
descripcion = scrapy.Field()
superficieTotal = scrapy.Field()
superficieCubierta = scrapy.Field()
antiguedad = scrapy.Field()
pass
蜘蛛:
import scrapy
from scrapy.spider import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
from reporteInmobiliario.items import ReporteinmobiliarioItem
class reporteInmobiliario(CrawlSpider):
name = 'reporteInmobiliario'
allowed_domains = ['zonaprop.com.ar/']
item_count = 0
start_urls = ['https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html']
rules = {
# Para cada item
Rule(LinkExtractor(allow = (), restrict_xpaths = ('//li[@class="pagination-action-next"]/a'))),
Rule(LinkExtractor(allow = (), restrict_xpaths = ('//h4[@class="aviso-data-title"]')),
callback = 'parse_item', follow = False)
}
def parse_item(self,response):
rp_item = ReporteinmobiliarioItem()
rp_item['titulo']= response.xpath('//div[@class="card-title"]/text()').extract()
rp_item['precioAlquiler'] = response.xpath('normalize-space(//*[@id="layout-content"]/div[1]/div[1]/div[2]/div[2]/div[1]/div[2]/p/strong)').extract()
rp_item['ubicacion'] = response.xpath('normalize-space(//*[@id="map"]/div[1]/div/ul/li)').extract()
rp_item['descripcion'] = response.xpath('normalize-space(//*[@id="id-descipcion-aviso"]').extract()
rp_item['superficieTotal'] = response.xpath('//*[@id="layout-content"]/div[1]/div[1]/div[2]/div[1]/div[4]/div[1]/div[1]/div/ul/li[4]/span)').extract()
rp_item['superficieCubierta'] = response.xpath('normalize-space(//*[@id="layout-content"]/div[1]/div[1]/div[2]/div[1]/div[4]/div[1]/div[1]/div/ul/li[5]/span)').extract()
rp_item['antiguedad'] = response.xpath('normalize-space(//*[@id="layout-content"]/div[1]/div[1]/div[2]/div[1]/div[4]/div[1]/div[1]/div/ul/li[6]/span)').extract()
self.item_count += 1
if self.item_count > 5:
raise CloseSpider('item_exceeded')
yield rp_item
解决方案
您需要始终先检查日志:
2018-09-09 09:19:21 [scrapy.spidermiddlewares.offsite] 调试:过滤到“www.zonaprop.com.ar”的异地请求:https://www.zonaprop.com.ar/propiedades/galpon-de -337-m2-7-79-x-43-30-ma-metros-de-av-43096244.html>
您的第一个规则也有错误(正确的类名是“pagination-action-next”)。另外不要忘记修复您的 XPath 错误( parse_item
)!
class reporteInmobiliario(CrawlSpider):
name = 'reporteInmobiliario'
allowed_domains = ['zonaprop.com.ar']
item_count = 0
start_urls = ['https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html']
rules = {
# Para cada item
Rule(LinkExtractor(allow = (), restrict_xpaths = ('//li[contains(@class, "pagination-action-next")]/a'))),
Rule(LinkExtractor(allow = (), restrict_xpaths = ('//h4[@class="aviso-data-title"]')),
callback = 'parse_item')
}
def parse_item(self,response):
rp_item = ReporteinmobiliarioItem()
rp_item['titulo']= response.xpath('//div[@class="card-title"]/text()').extract()
rp_item['precioAlquiler'] = response.xpath('normalize-space(//*[@id="layout-content"]/div[1]/div[1]/div[2]/div[2]/div[1]/div[2]/p/strong)').extract()
rp_item['ubicacion'] = response.xpath('normalize-space(//*[@id="map"]/div[1]/div/ul/li)').extract()
rp_item['descripcion'] = response.xpath('normalize-space(//*[@id="id-descipcion-aviso"]').extract()
rp_item['superficieTotal'] = response.xpath('//*[@id="layout-content"]/div[1]/div[1]/div[2]/div[1]/div[4]/div[1]/div[1]/div/ul/li[4]/span)').extract()
rp_item['superficieCubierta'] = response.xpath('normalize-space(//*[@id="layout-content"]/div[1]/div[1]/div[2]/div[1]/div[4]/div[1]/div[1]/div/ul/li[5]/span)').extract()
rp_item['antiguedad'] = response.xpath('normalize-space(//*[@id="layout-content"]/div[1]/div[1]/div[2]/div[1]/div[4]/div[1]/div[1]/div/ul/li[6]/span)').extract()
self.item_count += 1
if self.item_count > 5:
raise CloseSpider('item_exceeded')
yield rp_item
推荐阅读
- java - 在 Java 项目中包含 python 包
- python - soup.find() 总是返回 None,beautifullsoup 网页抓取
- webpack - PUG 需要图像(webpack)
- machine-learning - seq2seq 模型中测试和推理的区别
- elasticsearch - 富文本文件的本地索引
- spring - 为特定配置文件导入 Spring @Configuration
- c# - 为什么 .net 会在运行时自动处理 urldecode
- sql-server - 在sql server中每n分钟获取一条记录
- tcl - 无法打印放置在“::”之前的 tcl 变量
- javascript - 为什么 console.log 不使用 VS Code 在控制台中显示结果?