首页 > 解决方案 > Python web-scraping 在登录页面后产生与浏览器不同的 html

问题描述

目前第一次进行网络抓取,试图从我的 CodeWars 个人资料中获取并编译已完成的 Katas 列表。您可以在不登录的情况下查看已完成的问题,但除非您已登录到该特定帐户,否则它不会显示您的解决方案。

这是登录时页面显示的检查预览以及我试图抓取的相关 div: 在此处输入图像描述

该页面的 urlhttps://www.codewars.com/users/User_Name/completed_solutions 替换User_Name为实际的username。登录页面为:https://www.codewars.com/users/sign_in

我现在尝试以两种不同的方式获取带有“列表项解决方案”类的 div,我将编写以下方法:

#attempt 1
import requests
from bs4 import BeautifulSoup

login_url = "https://www.codewars.com/users/sign_in"
end_url = "https://www.codewars.com/users/Ash-Ozen/completed_solutions"

with requests.session() as sesh:
    result = sesh.get(login_url)

    soup = BeautifulSoup(result.content, "html.parser")

    token = soup.find("input", {"name": "authenticity_token"})["value"]

    payload = {
        "user[email]": "ph@gmail.com",
        "user[password]": "phpass>",
        "authenticity_token": str(token),
    }

    result = sesh.post(login_url, data=payload) #this logs me in?
    page = sesh.get(end_url) #This navigates me to the target page?

    soup = BeautifulSoup(page.content, "html.parser")
    print(soup.prettify()) # some debugging
    # Examining the print statement shows that the "list-item solutions" is not
    # there. Checking page.url shows the correct url(https://www.codewars.com/users/Ash-Ozen/completed_solutions).

    solutions = soup.findAll("div", class_="list-item solutions")
    # solutions yields an empty list.

#attempt 2
from robobrowser import RoboBrowser
from bs4 import BeautifulSoup

browser = RoboBrowser(history=True)
browser.open("https://www.codewars.com/users/sign_in")
form = browser.get_form()
form["user[email]"].value = "phmail@gmail.com"
form["user[password]"].value = "phpass"
browser.submit_form(form) #think robobrowser handles the crfs token for me?
browser.open("https://www.codewars.com/users/Ash-Ozen/completed_solutions")
r = browser.parsed()
soup = BeautifulSoup(str(r[0]), "html.parser")
solutions = soup.find_all("div", class_="list-item solutions")  
print(solutions)  # returns empty list 

不知道如何/从这里调试什么以使其正常工作。

编辑:我最初对出了什么问题的想法是,在执行任一帖子后,我被重定向到仪表板(成功登录后的行为),但似乎在尝试获取最终 url 时,我最终得到了未登录的 -在页面的版本中。

标签: pythonweb-scrapingbeautifulsouppython-requestsrobobrowser

解决方案


推荐阅读