web-scraping - 在登录页面中使用 FormRequest 登录没有任何效果(重定向 302)
问题描述
我正在尝试从 KBBI 获取印尼语词典用于 NLP 研究,我注意到它是一个受保护的页面,需要先进行身份验证登录,这是我与 Scrapy Python 一起使用的片段
import scrapy
import re
import pandas as pd
from scrapy.http import FormRequest
from scrapy import Request
class scrape_kamus_kbbi(scrapy.Spider):
name = "kamus_kbbi"
list_url = []
ALP = [chr(x) for x in range(65, 91)]
sub_directory = "KBBI_FULL_HURUF"
page_set = [142, 232, 47, 76, 38, 23, 75, 50, 43, 44,
239, 85, 343, 32, 26, 274, 1, 69, 195, 178,
30, 11, 17, 2, 5, 7]
login_url = "https://kbbi.kemdikbud.go.id/Account/Login?ReturnUrl"
username = "myusername"
password = "mypassword"
full_directory = "C:/Users/User/Desktop/Data Science Journey/Data Science with Python/Crawling Script/Indonesian Words/" + sub_directory + ".csv"
for h in range(26):
for g in range(1,(page_set[h]+1)):
text_url = "https://kbbi.kemdikbud.go.id/Cari/Alphabet?masukan=" + str(ALP[h]) + "&masukanLengkap=" + str(ALP[h]) + "&page" + str(g)
list_url.append(text_url)
start_urls = [login_url]
def __init__(self):
self.words=[]
def parse(self, response):
self.log("Login page... Posting username & password")
formdata = {'Username': self.username, 'Password': self.password}
return FormRequest.from_response(response, formdata=formdata,
callback=self.after_login)
def after_login(self, response):
for i in range(len(self.list_url)):
yield Request(self.list_url[i], self.parse_page)
def parse_page(self, response):
self.log("Logged in... Grab All KBBI Words...")
kata = response.xpath('.//div[@class="col-md-3"]/a/text()').extract()
for x in range(len(kata)):
self.words.append(kata[x])
kumpulan_kata = pd.DataFrame(self.words, columns=["Kata"])
kumpulan_kata.to_csv(self.full_directory)
from scrapy import cmdline
cmdline.execute("scrapy runspider scapre_kbbi_kemdikbug.py".split())
但我仍然收到重定向代码 302
Redirecting (302) to <GET https://kbbi.kemdikbud.go.id/Account/Login?ReturnUrl=%2FCari%2FAlphabet%3Fmasukan%3DG%26masukanLengkap%3DG%26page29> from <GET https://kbbi.kemdikbud.go.id/Cari/Alphabet?masukan=G&masukanLengkap=G&page29>
我不知道此时出了什么问题。谁能指出线索?
解决方案
您传递给请求的表单数据是错误的。
应该是这样的:
formdata = {
'Posel': self.username,
'KataSandi': self.password
}
但您改为发送:
{
"__RequestVerificationToken": "some_token",
"Posel": "",
"KataSandi": "",
"IngatSaya": false,
"Username": "{self.username}",
"Password": "{self.password}"
}
推荐阅读
- javascript - 使用单引号时出现双引号
- git - Gerrit Rest API - Gitiles 插件:获取提交详细信息,提交文件列表
- java - 什么是复选框的网络元素(基于文本的 HTML)
- regex - 在其他单元格包含值的工作表中查找最后一个值
- java - 如何在长按列表视图中更新和删除数据
- javascript - 如何将带有 contenteditable 列的表保存到本地存储中
- javascript - 如何在自定义组件中使用离子刷新器
- sql-server - 批量插入、无效对象错误(表名)、Excel VBA 到 SQL Server
- java - 在应用程序类中注册时广播接收器泄漏的风险
- python - 如何使用正确的数据库参考设置 Flask-Migrate?