首页 > 解决方案 > 在登录页面中使用 FormRequest 登录没有任何效果(重定向 302)

问题描述

我正在尝试从 KBBI 获取印尼语词典用于 NLP 研究,我注意到它是一个受保护的页面,需要先进行身份验证登录,这是我与 Scrapy Python 一起使用的片段

import scrapy
import re
import pandas as pd
from scrapy.http import FormRequest
from scrapy import Request

class scrape_kamus_kbbi(scrapy.Spider):
    name = "kamus_kbbi"
    list_url = []
    ALP = [chr(x) for x in range(65, 91)]
    sub_directory = "KBBI_FULL_HURUF"
    page_set = [142, 232, 47, 76, 38, 23, 75, 50, 43, 44, 
                239, 85, 343, 32, 26, 274, 1, 69, 195, 178,
                30, 11, 17, 2, 5, 7]
    login_url = "https://kbbi.kemdikbud.go.id/Account/Login?ReturnUrl"
    username = "myusername"
    password = "mypassword"
    full_directory = "C:/Users/User/Desktop/Data Science Journey/Data Science with Python/Crawling Script/Indonesian Words/" + sub_directory + ".csv"
    for h in range(26):
        for g in range(1,(page_set[h]+1)): 
            text_url = "https://kbbi.kemdikbud.go.id/Cari/Alphabet?masukan=" + str(ALP[h]) + "&masukanLengkap=" + str(ALP[h]) + "&page" + str(g)
            list_url.append(text_url)
    start_urls = [login_url]
    
    def __init__(self):
        self.words=[]
    
    def parse(self, response):
        self.log("Login page... Posting username & password")
        formdata = {'Username': self.username, 'Password': self.password}
        return FormRequest.from_response(response, formdata=formdata, 
                                         callback=self.after_login)

    def after_login(self, response):
        for i in range(len(self.list_url)):
            yield Request(self.list_url[i], self.parse_page)
    
    def parse_page(self, response):
        self.log("Logged in... Grab All KBBI Words...")
        kata = response.xpath('.//div[@class="col-md-3"]/a/text()').extract()
        for x in range(len(kata)):
            self.words.append(kata[x])
        kumpulan_kata = pd.DataFrame(self.words, columns=["Kata"])
        kumpulan_kata.to_csv(self.full_directory)

from scrapy import cmdline
cmdline.execute("scrapy runspider scapre_kbbi_kemdikbug.py".split())

但我仍然收到重定向代码 302

Redirecting (302) to <GET https://kbbi.kemdikbud.go.id/Account/Login?ReturnUrl=%2FCari%2FAlphabet%3Fmasukan%3DG%26masukanLengkap%3DG%26page29> from <GET https://kbbi.kemdikbud.go.id/Cari/Alphabet?masukan=G&masukanLengkap=G&page29>

我不知道此时出了什么问题。谁能指出线索?

标签: web-scrapingscrapy

解决方案


您传递给请求的表单数据是错误的。

应该是这样的:

formdata = {
    'Posel': self.username,
    'KataSandi': self.password
}

正确请求的参数是这些

但您改为发送:

{
    "__RequestVerificationToken": "some_token",
    "Posel": "",
    "KataSandi": "",
    "IngatSaya": false,
    "Username": "{self.username}",
    "Password": "{self.password}"
}

推荐阅读