首页 > 解决方案 > Scrapy Splash formrequest.formresponse, - 停留在识别动态加载页面的表单数据

问题描述

问题

我想从所述 URL 中提取数据。我在selenium中编写了功能齐全的代码。我想使用scrapy在时间方面获得更好的性能。但是,我不能继续使用当前的代码。我得到“无”作为回报。所以我在第一阶段被阻止了。没有办法继续前进!

我尝试了什么:

这个这个这个。几乎所有答案都指向手动设置一些值。但在上述情况下,这似乎是不可能的!或者我可能遗漏了一些东西,如果我明白这一点,我可以手动设置一些值,一些通过 xpath 设置,一些我可以留空。

有趣的是,直到我现在想提取数据(在表格中为每个单位(地区)定义了某些日期的数据表)都不需要 javascript。我的意思是如果我点击搜索按钮(在浏览器中),即使禁用了 javascript,数据仍然会被填充。但是,那不能在scrapy中复制。所以,我也尝试过不飞溅。我在那个问题中提到的确切错误现在在评论和答案的帮助下得到解决。我在这里分享了链接,以查看我的努力没有飞溅。

我的代码:

import scrapy

from scrapy_splash import SplashFormRequest, SplashRequest


class ExampleSpider(scrapy.Spider):
    name = 'example'

    script = '''
        function main(splash, args)
          assert(splash:go(args.url))
          assert(splash:wait(0.5))
          return splash:html()
        end
    '''

    def start_requests(self):
        yield SplashRequest(
            url='https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx',
            headers={
                'Referer': 'https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx'
            },
            endpoint='execute',
            args={
                'lua_source': self.script
            },
            callback=self.parse
        )

    def parse(self, response):
        yield SplashFormRequest.from_response(

            response,
            formid='form1',
            formdata={
                '__EVENTTARGET': "ctl00$ContentPlaceHolder1$ddlDistrict",
                '__EVENTARGUMENT': "",
                '__LASTFOCUS': "",
                '__VIEWSTATE': response.xpath('//*[@id="__VIEWSTATE"]/@value').get(),
                '__VIEWSTATEGENERATOR': "6F2EA376",
                '__PREVIOUSPAGE': response.xpath('//*[@id="__PREVIOUSPAGE"]/@value').get(),
                '__EVENTVALIDATION': response.xpath('//*[@id="__EVENTVALIDATION"]/@value').get(),
                'ctl00$hdnSessionIdleTime': "",
                'ctl00$hdnUserUniqueId': "",
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationFrom': "03/07/2020",
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationFrom_ClientState': "",
                'ctl00$ContentPlaceHolder1$txtDateOfRegistrationTo': "03/07/2020",
                'ctl00$ContentPlaceHolder1$meeDateOfRegistrationTo_ClientState': "",
                'ctl00$ContentPlaceHolder1$ddlDistrict': "19372",
                'ctl00$ContentPlaceHolder1$ddlPoliceStation': "Select",
                'ctl00$ContentPlaceHolder1$txtFirno': "",
                'ctl00$ContentPlaceHolder1$ucRecordView$ddlPageSize': "0",
                'ctl00$ContentPlaceHolder1$ucGridRecordView$txtPageNumber': ""
            },
            callback=(self.after_login),
        )

      def after_login(self, response):
           police_stations = response.xpath('//*[@id="ContentPlaceHolder1_ddlPoliceStation"]/@value').get()
           print(police_stations)

终端:

2020-07-16 22:34:15 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: first)
2020-07-16 22:34:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 27 2020, 15:53:34) - [GCC 9.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-42-generic-x86_64-with-glibc2.29
2020-07-16 22:34:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-16 22:34:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'first',
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
 'NEWSPIDER_MODULE': 'first.spiders',
 'SPIDER_MODULES': ['first.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:79.0) Gecko/20100101 '
               'Firefox/79.0'}
2020-07-16 22:34:15 [scrapy.extensions.telnet] INFO: Telnet Password: 4b34176c2fa9d5f5
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-16 22:34:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-16 22:34:15 [scrapy.core.engine] INFO: Spider opened
2020-07-16 22:34:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-16 22:34:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-16 22:34:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx via http://localhost:8050/execute> (referer: None)
2020-07-16 22:34:19 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://citizen.mahapolice.gov.in/Citizen/MH/PublishedFIRs.aspx via http://localhost:8050/render.html> (referer: None)
None
2020-07-16 22:34:19 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-16 22:34:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6420,
 'downloader/request_count': 2,
 'downloader/request_method_count/POST': 2,
 'downloader/response_bytes': 51224,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 3.544418,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 7, 16, 17, 4, 19, 333815),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'memusage/max': 53067776,
 'memusage/startup': 53067776,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/200': 1,
 'splash/render.html/request_count': 1,
 'splash/render.html/response_count/200': 1,
 'start_time': datetime.datetime(2020, 7, 16, 17, 4, 15, 789397)}
2020-07-16 22:34:19 [scrapy.core.engine] INFO: Spider closed (finished)

请指导。

标签: pythonscrapyscreen-scrapingscrapy-splash

解决方案


推荐阅读