首页 > 解决方案 > 带有自定义标头的scrapy-splash中的会话处理

问题描述

我通过 Scrapy-Splash 使用 Scrapy 和 Splash。

在初始请求后,我在保持登录状态时遇到问题。

这是我的整个蜘蛛类:

import scrapy
from scrapy_splash import SplashRequest
import logging

class MasterSpider(scrapy.Spider):
    name = 'master'
    allowed_domains = ['www.somesite.com']
    start_url = 'https://www.somesite.com/login'

    login_script = '''
    function main(splash, args)
      splash.private_mode_enabled = false

      my_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0'

      headers = {
        ['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        ['User-Agent'] = my_user_agent,
        ['Accept-Language'] = 'en-US;q=0.9,en;q=0.8',
      }

      splash:set_custom_headers(headers)

      url = args.url

      assert(splash:go(url))

      assert(splash:wait(2))

      -- username input
      username_input = assert(splash:select('#username'))
      username_input:focus()
      username_input:send_text('myusername')
      assert(splash:wait(0.3))

     -- password input
      password_input = assert(splash:select('#password'))
      password_input:focus()
      password_input:send_text('mysecurepass')
      assert(splash:wait(0.3))

      -- the login button
      login_btn = assert(splash:select('#login_btn'))
      login_btn:mouse_click()
      assert(splash:wait(4))

      return {
        html = splash:html(),
        cookies = splash:get_cookies(),
      }

    end
    '''

    fruit_selection_script = '''
    function main(splash, args)
      splash:init_cookies(splash.args.cookies)
      splash.private_mode_enabled = false

      my_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0'

      headers = {
        ['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        ['User-Agent'] = my_user_agent,
        ['Accept-Language'] = 'en-US;q=0.9,en;q=0.8',
      }

      splash:set_custom_headers(headers)

      url = args.url

      assert(splash:go(url))

      assert(splash:wait(4))

      -- state select input
      state_select = assert(splash:select('select#fruits'))
      state_select:mouse_click()
      state_select:send_keys("<Down>")
      assert(splash:wait(0.2))
      state_select:send_keys("<Enter>")
      assert(splash:wait(0.2))

      -- game select input
      game_select = assert(splash:select('select#type'))
      game_select:mouse_click()
      game_select:send_keys("<Down>")
      assert(splash:wait(0.1))
      game_select:send_keys("<Up>")
      assert(splash:wait(0.1))

      -- the next button
      login_btn = assert(splash:select('input.submit'))
      login_btn:mouse_click()
      assert(splash:wait(4))

      return splash:html()
    end
    '''

    def start_requests(self):
        yield SplashRequest(url = self.start_url, callback = self.post_login, endpoint = 'execute', args = { 'lua_source': self.login_script })

    def post_login(self, response):
        search_link = response.urljoin(response.xpath("(//div[@id='sidebar']/ul/li)[7]/a/@href").get())

        logging.info('about to fire up second splash request')

        with open('temp.html', 'w') as f:
            f.write(response.text)
            f.close()

        yield SplashRequest(url = search_link, callback = self.search, endpoint = 'execute', args = { 'wait': 3, 'lua_source': self.game_selection_script })

    def search(self, response):
        logging.info('hey from search!')

        with open('post_search_response.html', 'w') as f:
            f.write(response.text)
            f.close()

    def post_search(self, response):
        logging.info('hey from post_search!')

        with open('post_search_response.html', 'w') as f:
            f.write(response.text)
            f.close()

    def parse(self, response):
        pass

scrapy -splash 文档说:

SplashRequest 自动为 /execute 端点设置 session_id,即如果您使用 SplashRequest、/execute 端点和兼容的 Lua 渲染脚本,则默认启用 cookie 处理。

如果你想从同一组 cookie 开始,但是除了 session_id 之外,'fork' 会话设置 request.meta['splash']['new_session_id']。请求 cookie 将从 cookiejar session_id 中获取,但响应 cookie 将合并回 new_session_id cookiejar。

如您所见,我总是使用execute端点,所以我应该默认处理 cookie 吗?但是它不起作用,我不知道为什么,但我想知道是不是因为我正在为用户代理和语言设置自定义标头?

现在,当蜘蛛运行第二个脚本(fruit_selection_script)时,我得到了一个403 Forbidden错误。

我错过了什么?

标签: pythonscrapyscrapy-splash

解决方案


推荐阅读