首页 > 解决方案 > 为什么scrapy FormRequest无法登录?

问题描述

我正在尝试通过 scrapy.FormRequest登录https://ptab.uspto.gov/#/login 。下面是我的代码。在终端中运行时,scrapy 不输出该项目并说它爬取了 0 页。我的代码不允许登录有什么问题?

import scrapy
from ..items import PatentItem
from scrapy.utils.response import open_in_browser

class LoginNeedScraper(scrapy.Spider):
    name = 'ptab'
    start_urls = ('https://ptab.uspto.gov/#/login')


    def parse(self, response):
        return scrapy.FormRequest.from_response(response,
                                                formdata={'userName':'username', 'password':'password'},
                                                callback=self.logged_in)

    def logged_in(self, response):
        open_in_browser ( response )
        item = PatentItem()
        item['message'] = response.css('h1::text').extract()
        return item

以下是终端中的输出:

(Scrape) (base) Andrews-MacBook-Pro-5:patent rhodes259$ scrapy crawl ptab -o data.json
2021-03-16 01:10:02 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: patent)
2021-03-16 01:10:02 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.6.3 (v3.6.3:2c5fed86e0, Oct  3 2017, 00:32:08) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1j  16 Feb 2021), cryptography 3.4.6, Platform Darwin-19.6.0-x86_64-i386-64bit
2021-03-16 01:10:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-16 01:10:02 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'patent',
 'NEWSPIDER_MODULE': 'patent.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['patent.spiders']}
2021-03-16 01:10:02 [scrapy.extensions.telnet] INFO: Telnet Password: 93dadadb5f6c58a8
2021-03-16 01:10:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2021-03-16 01:10:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-16 01:10:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-16 01:10:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-16 01:10:02 [scrapy.core.engine] INFO: Spider opened
2021-03-16 01:10:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-03-16 01:10:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-16 01:10:02 [scrapy.core.engine] INFO: Closing spider (finished)
2021-03-16 01:10:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.006319,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 3, 16, 5, 10, 2, 926018),
 'log_count/INFO': 10,
 'memusage/max': 60981248,
 'memusage/startup': 60981248,
 'start_time': datetime.datetime(2021, 3, 16, 5, 10, 2, 919699)}
2021-03-16 01:10:02 [scrapy.core.engine] INFO: Spider closed (finished)

标签: pythonscrapy

解决方案


单击登录时的 POST 请求将发送到https://ptab.uspto.gov/ptabe2e/rest/login


推荐阅读