python - 带有自定义标头的scrapy-splash中的会话处理
问题描述
我通过 Scrapy-Splash 使用 Scrapy 和 Splash。
在初始请求后,我在保持登录状态时遇到问题。
这是我的整个蜘蛛类:
import scrapy
from scrapy_splash import SplashRequest
import logging
class MasterSpider(scrapy.Spider):
name = 'master'
allowed_domains = ['www.somesite.com']
start_url = 'https://www.somesite.com/login'
login_script = '''
function main(splash, args)
splash.private_mode_enabled = false
my_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0'
headers = {
['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
['User-Agent'] = my_user_agent,
['Accept-Language'] = 'en-US;q=0.9,en;q=0.8',
}
splash:set_custom_headers(headers)
url = args.url
assert(splash:go(url))
assert(splash:wait(2))
-- username input
username_input = assert(splash:select('#username'))
username_input:focus()
username_input:send_text('myusername')
assert(splash:wait(0.3))
-- password input
password_input = assert(splash:select('#password'))
password_input:focus()
password_input:send_text('mysecurepass')
assert(splash:wait(0.3))
-- the login button
login_btn = assert(splash:select('#login_btn'))
login_btn:mouse_click()
assert(splash:wait(4))
return {
html = splash:html(),
cookies = splash:get_cookies(),
}
end
'''
fruit_selection_script = '''
function main(splash, args)
splash:init_cookies(splash.args.cookies)
splash.private_mode_enabled = false
my_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0'
headers = {
['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
['User-Agent'] = my_user_agent,
['Accept-Language'] = 'en-US;q=0.9,en;q=0.8',
}
splash:set_custom_headers(headers)
url = args.url
assert(splash:go(url))
assert(splash:wait(4))
-- state select input
state_select = assert(splash:select('select#fruits'))
state_select:mouse_click()
state_select:send_keys("<Down>")
assert(splash:wait(0.2))
state_select:send_keys("<Enter>")
assert(splash:wait(0.2))
-- game select input
game_select = assert(splash:select('select#type'))
game_select:mouse_click()
game_select:send_keys("<Down>")
assert(splash:wait(0.1))
game_select:send_keys("<Up>")
assert(splash:wait(0.1))
-- the next button
login_btn = assert(splash:select('input.submit'))
login_btn:mouse_click()
assert(splash:wait(4))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(url = self.start_url, callback = self.post_login, endpoint = 'execute', args = { 'lua_source': self.login_script })
def post_login(self, response):
search_link = response.urljoin(response.xpath("(//div[@id='sidebar']/ul/li)[7]/a/@href").get())
logging.info('about to fire up second splash request')
with open('temp.html', 'w') as f:
f.write(response.text)
f.close()
yield SplashRequest(url = search_link, callback = self.search, endpoint = 'execute', args = { 'wait': 3, 'lua_source': self.game_selection_script })
def search(self, response):
logging.info('hey from search!')
with open('post_search_response.html', 'w') as f:
f.write(response.text)
f.close()
def post_search(self, response):
logging.info('hey from post_search!')
with open('post_search_response.html', 'w') as f:
f.write(response.text)
f.close()
def parse(self, response):
pass
scrapy -splash 文档说:
SplashRequest 自动为 /execute 端点设置 session_id,即如果您使用 SplashRequest、/execute 端点和兼容的 Lua 渲染脚本,则默认启用 cookie 处理。
如果你想从同一组 cookie 开始,但是除了 session_id 之外,'fork' 会话设置 request.meta['splash']['new_session_id']。请求 cookie 将从 cookiejar session_id 中获取,但响应 cookie 将合并回 new_session_id cookiejar。
如您所见,我总是使用execute
端点,所以我应该默认处理 cookie 吗?但是它不起作用,我不知道为什么,但我想知道是不是因为我正在为用户代理和语言设置自定义标头?
现在,当蜘蛛运行第二个脚本(fruit_selection_script)时,我得到了一个403 Forbidden
错误。
我错过了什么?
解决方案
推荐阅读
- c# - 为什么 Mathf.Abs(MyClass v) 在 C# 中调用隐式 int 转换而不是隐式 float 转换?
- c# - 如何在不使用 MediaComposition 的情况下从 UWP 桌面应用程序中的一组图像和一个 MP3 生成视频?
- python - 我想一次处理大量 wav 文件,并通过使用 python 编程将它们全部转换为印地语英语泰米尔语等文本
- attributes - 如何在没有 getElement 方法的情况下设置 Vaadin 8 TextField 的属性?
- angular - RemoveAt(i) 不删除
- python - 如何从用户输入中单独输入列表的数字?
- svg - svg 文本 - rtl 问题
- powershell - Powershell:如何结合2个命令的输出将磁盘字母与磁盘MediaType相关联?
- android - 如何以编程方式从状态栏中隐藏时钟
- java - HttpMessageNotWritableException 与 MultipleBagFetchException