python - Scrapy CrawlSpider 未加入
问题描述
我在这里和其他有关scrapy的网站上阅读了很多,但我无法解决这个问题,所以我问你:P希望有人能帮助我。
我想在主客户端页面验证登录,然后解析所有类别,然后是所有产品,并保存产品的标题、类别、数量和价格。
我的代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging
class article(Item):
category = Field()
title = Field()
quantity = Field()
price = Field()
class combatzone_spider(CrawlSpider):
name = 'combatzone_spider'
allowed_domains = ['www.combatzone.es']
start_urls = ['http://www.combatzone.es/areadeclientes/']
rules = (
Rule(LinkExtractor(allow=r'/category.php?id=\d+'),follow=True),
Rule(LinkExtractor(allow=r'&page=\d+'),follow=True),
Rule(LinkExtractor(allow=r'goods.php?id=\d+'),follow=True,callback='parse_items'),
)
def init_request(self):
logging.info("You are in initRequest")
return Request(url=self,callback=self.login)
def login(self,response):
logging.info("You are in login")
return scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)
def check_login_response(self,response):
logging.info("You are in checkLogin")
if "Hola,XXXX" in response.body:
self.log("Succesfully logged in.")
return self.initialized()
else:
self.log("Something wrong in login.")
def parse_items(self,response):
logging.info("You are in item")
item = scrapy.loader.ItemLoader(article(),response)
item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
yield item.load_item()
当我在终端上运行scrapy crawl spider时,我得到了这个:
SCAPY) pi@raspberry:~/SCRAPY/combatzone/combatzone/spiders $ scrapy crawl fightzone_spider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning: 模块
scrapy.contrib.spiders
已弃用,请使用scrapy.spiders
scrapy.contrib .spiders.init 导入 InitSpider /home/pi/SCRAPY/combatzone/combatzone/spiders/combatzone_spider.py:9: ScrapyDeprecationWarning: 模块scrapy.contrib.spiders.init
已弃用,使用scrapy.spiders.init
而不是从 scrapy.contrib.spiders.init 导入 InitSpider 2018-07-24 22:14:53 [scrapy.utils.log] 信息:Scrapy 1.5.1 开始(机器人:战斗区)2018-07-24 22:14:53 [scrapy.utils.log] 信息:版本:lxml 4.2.3.0、libxml2 2.9.8、cssselect 1.0.3、parsel 1.5.0、w3lib 1.19.0、Twisted 18.7.0、Python 2.7.13(默认,11 月 24 日2017, 17:33:09) - [GCC 6.3.0 20170516],pyOpenSSL 18.0.0(OpenSSL 1.1.0h 2018 年 3 月 27 日),密码学 2.3,平台 Linux-4.9.0-6-686-i686-with-debian -9.5 2018-07-24 22:14:53 [scrapy.crawler] 信息:覆盖设置:{'NEWSPIDER_MODULE':'combatzone.spiders','SPIDER_MODULES':['combatzone.spiders'],'LOG_LEVEL':' INFO','BOT_NAME':'combatzone'} 2018-07-24 22:14:53 [scrapy.middleware] 信息:启用扩展:['scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.logstats.LogStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.corestats.CoreStats'] 2018-07-24 22:14:53 [scrapy.中间件]信息:启用下载中间件:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',' scrapy.downloadermiddlewares.retry.RetryMiddleware'、'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware'、'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware'、'scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-07-24 22:14:53 [scrapy.中间件]信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',' scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-07-24 22:14:53 [scrapy.middleware] 信息:启用的项目管道:[] 2018-07-24 22:14:53 [scrapy.core.engine]scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-07-24 22:14:53 [scrapy.middleware] 信息:启用蜘蛛中间件: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware' ] 2018-07-24 22:14:53 [scrapy.middleware] 信息:启用的项目管道:[] 2018-07-24 22:14:53 [scrapy.core.engine]scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-07-24 22:14:53 [scrapy.middleware] 信息:启用蜘蛛中间件: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware' ] 2018-07-24 22:14:53 [scrapy.middleware] 信息:启用的项目管道:[] 2018-07-24 22:14:53 [scrapy.core.engine]HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-07-24 22:14:53 [scrapy.middleware] 信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares。 offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-07-24 22:14:53 [scrapy.middleware]信息:启用的项目管道:[] 2018-07-24 22:14:53 [scrapy.core.engine]HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-07-24 22:14:53 [scrapy.middleware] 信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares。 offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-07-24 22:14:53 [scrapy.middleware]信息:启用的项目管道:[] 2018-07-24 22:14:53 [scrapy.core.engine]HttpErrorMiddleware'、'scrapy.spidermiddlewares.offsite.OffsiteMiddleware'、'scrapy.spidermiddlewares.referer.RefererMiddleware'、'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware'、'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-07-24 22: 14:53 [scrapy.middleware] 信息:启用项目管道:[] 2018-07-24 22:14:53 [scrapy.core.engine]HttpErrorMiddleware'、'scrapy.spidermiddlewares.offsite.OffsiteMiddleware'、'scrapy.spidermiddlewares.referer.RefererMiddleware'、'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware'、'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-07-24 22: 14:53 [scrapy.middleware] 信息:启用项目管道:[] 2018-07-24 22:14:53 [scrapy.core.engine]信息:蜘蛛打开2018-07-24 22:14:53 [scrapy.extensions.logstats]信息:爬取 0 页(以 0 页/分钟),抓取 0 项(以 0 项/分钟) 2018-07-24 22 :14:54 [scrapy.core.engine]信息:关闭蜘蛛(完成)2018-07-24 22:14:54 [scrapy.statscollectors] 信息:转储 Scrapy 统计信息:{'downloader/request_bytes':231,'downloader/request_count':1,'downloader/request_method_count/GET':1,'downloader /response_bytes': 7152, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': '完成', 'finish_time': datetime.datetime(2018, 7, 24, 21, 14, 54, 410938), 'log_count/INFO': 7, 'memusage/max': 36139008, 'memusage/startup': 36139008, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory' : 1, '调度程序/入队': 1, '调度程序/入队/内存': 1, 'start_time': datetime.datetime(2018, 7, 24, 21,14, 53, 998619)} 2018-07-24 22:14:54 [scrapy.core.engine] 信息:蜘蛛关闭(完成)
蜘蛛似乎没有工作,知道为什么会这样吗?非常感谢小伙伴们:D
解决方案
有2个问题:
- 第一个是正则表达式,你应该转义“?”。例如:
/category.php?id=\d+
应该改为/category.php\?id=\d+
(注意"\?") - 第二个是你应该缩进所有的方法,否则它们在战斗区蜘蛛类中找不到。
至于登录,我试图让你的代码工作,但我失败了。start_requests
我通常在爬行之前覆盖登录。
这是代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import logging
class article(Item):
category = Field()
title = Field()
quantity = Field()
price = Field()
class CombatZoneSpider(CrawlSpider):
name = 'CombatZoneSpider'
allowed_domains = ['www.combatzone.es']
start_urls = ['http://www.combatzone.es/areadeclientes/']
rules = (
# escape "?"
Rule(LinkExtractor(allow=r'category.php\?id=\d+'),follow=False),
Rule(LinkExtractor(allow=r'&page=\d+'),follow=False),
Rule(LinkExtractor(allow=r'goods.php\?id=\d+'),follow=False,callback='parse_items'),
)
def parse_items(self,response):
logging.info("You are in item")
# This is used to print the results
selector = scrapy.Selector(response=response)
res = selector.xpath("/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()").extract()
self.logger.info(res)
# item = scrapy.loader.ItemLoader(article(),response)
# item.add_xpath('category','/html/body/div[3]/div[2]/div[2]/a[2]/text()')
# item.add_xpath('title','/html/body/div[3]/div[2]/div[2]/div/div[2]/h1/text()')
# item.add_xpath('quantity','//*[@id="ECS_FORMBUY"]/div[1]/ul/li[2]/font/text()')
# item.add_xpath('price','//*[@id="ECS_RANKPRICE_2"]/text()')
# yield item.load_item()
# login part
# I didn't test if it can login because I have no accounts, but they will print something in console.
def start_requests(self):
logging.info("You are in initRequest")
return [scrapy.Request(url="http://www.combatzone.es/areadeclientes/user.php",callback=self.login)]
def login(self,response):
logging.info("You are in login")
# generate the start_urls again:
for url in self.start_urls:
yield self.make_requests_from_url(url)
# yield scrapy.FormRequest.from_response(response,formname='ECS_LOGINFORM',formdata={'username':'XXXX','password':'YYYY'},callback=self.check_login_response)
# def check_login_response(self,response):
# logging.info("You are in checkLogin")
# if "Hola,XXXX" in response.body:
# self.log("Succesfully logged in.")
# return self.initialized()
# else:
# self.log("Something wrong in login.")
推荐阅读
- javascript - 修改箭头函数javascript
- python - django项目中excel模板文件放在哪里
- python - 在 groupby 之后为类别添加一列
- javascript - 当我在表单中输入输入时,React 组件不必要地重新渲染
- c++ - 如何在不同类的私有成员中声明指向类对象数组的指针?
- c# - C# Casting T where T: struct 到一个没有装箱的接口
- python - 通过关系获取模型的_meta
- c# - 如果在 XAML 中使用选项卡按钮,第一个 TextBox 不显示光标并且无法键入任何内容
- sql - 浮点数据类型中的 Oracle SQL 插入 -ve Infinity
- python - IntegrityError NOT NULL 约束失败:-在验证表单中缺少用户 ID(另一个模型的外键)