首页 > 解决方案 > 如何使用 Scrapy formRequest 抓取 casenet?

问题描述

我想爬取这个网站:https ://www.courts.mo.gov/casenet/cases/searchCases.do?searchType=name

这是我的代码:

import scrapy
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from Challenge6.items import Challenge6Item

class CasenetSpider(scrapy.Spider):
    name = "casenet"
    def start_requests(self):
        start_urls = [
            "https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name"
        ]
        Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),)
        for url in start_urls:
            yield scrapy.Request(url=url, callback=self.parse )

    def parse(self, response):
        data = {
            inputVO.lastName: 'smith',
            inputVO.firstName: 'fred',
            inputVO.yearFiled: 2010,
        }
        yield scrapy.FormRequest(url="https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name", formdata=data, callback=self.parse_pages)
        casenet_row = Selector(response).xpath('//tr[@align="left"]')

    def parse_pages(self, response):
        for row in casenet_row:
            if "Part Name" not in row or "Address on File" not in row:
                item = Challenge6Item()
                item['name'] = quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').extract()
                yield item

但是,我收到此错误:

/var/www/html/challenge6/Challenge6/Challenge6/spiders/casenet_crawler.py:3:ScrapyDeprecationWarning:模块scrapy.contrib.spiders已弃用,请使用scrapy.spiders而不是从 scrapy.contrib.spiders 导入规则 2018-11-14 17:47:54 [scrapy.utils.log] 信息:Scrapy 1.5.1 开始(机器人:Challenge6)2018-11-14 17:47:54 [scrapy .utils.log] 信息:版本:lxml 4.2.5.0、libxml2 2.9.8、cssselect 1.0.3、parsel 1.5.1、w3lib 1.19.0、Twisted 18.9.0、Python 2.7.12(默认,2017 年 12 月 4 日, 14:50:18) - [GCC 5.4.0 20160609],pyOpenSSL 18.0.0(OpenSSL 1.1.0i 2018 年 8 月 14 日),密码学 2.3.1,平台 Linux-4.4.0-1066-aws-x86_64-with-Ubuntu -16.04-xenial 2018-11-14 17:47:54 [scrapy.crawler] 信息:覆盖设置:{'NEWSPIDER_MODULE':'Challenge6.spiders','SPIDER_MODULES':['Challenge6.spiders'],'ROBOTSTXT_OBEY' :对,'BOT_NAME':'Challenge6'} 2018-11-14 17:47:55 [scrapy.middleware] 信息:启用扩展:['scrapy.extensions.memusage.MemoryUsage'、'scrapy.extensions.logstats.LogStats'、'scrapy.extensions.telnet.TelnetConsole'、'scrapy.extensions.corestats.CoreStats'] 2018-11-14 17:47:55 [scrapy.middleware] 信息:启用的下载器中间件:['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares. useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-11-14 17:47:55 [scrapy.middleware] 信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware ','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-11-14 17:47:55 [scrapy.middleware] 信息:启用的项目管道:[] 2018-11-14 17:47:55 [scrapy.core.引擎]信息:蜘蛛打开2018-11-14 17:47:55 [scrapy.extensions.logstats] INFO:爬取 0 页(以 0 页/分钟),抓取 0 项(以 0 项/分钟)2018-11-14 17:47:55 [scrapy.extensions.telnet] 调试:Telnet 控制台正在监听127.0.0.1:6023 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] 调试:重试https://www.courts.mo.gov/robots.txt>(失败1次):[] 2018 -11-14 17:47:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying https://www.courts.mo.gov/robots.txt> (失败2次): [] 2018-11-14 17: 47:55 [scrapy.downloadermiddlewares.retry] DEBUG:放弃重试https://www.courts.mo.gov/robots.txt>(失败3次):[] 2018-11-14 17:47:55 [ scrapy.downloadermiddlewares.robotstxt] 错误:下载时出错 https://www.courts.mo.gov/robots.txt>:[] Traceback(最近一次调用):文件“/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py”,第 43 行,在 process_request defer.returnValue((yield download_func(request=request,spider=spider))) ResponseNeverReceived: [] 2018-11-14 17 :47:55 [scrapy.downloadermiddlewares.retry] 调试:重试 https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>(失败 1 次):[] 2018-11- 14 17:47:55 [scrapy.downloadermiddlewares.retry] 调试:重试 https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>(失败 2 次):[] 2018- 11-14 17:47:55 [scrapy.downloadermiddlewares.retry] DEBUG:放弃重试https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>(失败3次): [] 2018-11-14 17:47:56 [scrapy.core.scraper] 错误:下载 https://www.courts.mo.gov/casenet/cases/nameSearch 时出错。do?searchType=name> Traceback(最近一次调用最后一次):文件“/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py”,第 43 行,在 process_request defer.returnValue ((yield download_func(request=request,spider=spider))) ResponseNeverReceived: [] 2018-11-14 17:47:56 [scrapy.core.engine] INFO: Closing spider (finished) 2018-11-14 17: 47:56 [scrapy.statscollectors] 信息:转储 Scrapy 统计信息:{'downloader/exception_count': 6, 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 6, 'downloader/request_bytes': 1455, 'downloader/ request_count': 6, 'downloader/request_method_count/GET': 6, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 11, 14, 23, 47, 56, 195277), 'log_count/DEBUG': 7, 'log_count/ERROR': 2, 'log_count/INFO': 7, 'memusage/max': 52514816, 'memusage/startup': 52514816, 'retry/count': 4, 'retry/ max_reached': 2, 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 4, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/入队/内存': 3, 'start_time': datetime.datetime(2018, 11, 14, 23, 47, 55, 36009)}scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2018, 11, 14, 23 , 47, 55, 36009)}scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2018, 11, 14, 23 , 47, 55, 36009)}

我究竟做错了什么?

标签: python-2.7web-scrapingscrapyscrapy-spider

解决方案


推荐阅读