asp.net - Scraping an ASPX page with authentication. Using Python 3
问题描述
I am trying to use python`s requests library to scrape an ASPX site and get information from a Table inside.
The problem I am experiencing has also been well described in How to web scrape an ASPX page that requires authentication with no replies at the time of writing.
The way I am currently going about it is by:
creating a requests session,
fetching a request header.
The information received from the get request is parsed using BeautifulSoup.
Setting all of the parameters to a login_data dictionary.
import urllib.parse import requests from bs4 import BeautifulSoup headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36"} with requests.session() as session: session.headers.update(headers) response=session.get(login_url) soup=BeautifulSoup(response.content) VIEWSTATE = soup.find(id="__VIEWSTATE")['value'] VIEWSTATEGENERATOR = soup.find(id="__VIEWSTATEGENERATOR")['value'] EVENTVALIDATION = soup.find(id="__EVENTVALIDATION")['value'] EVENTTARGET = soup.find(id="__EVENTTARGET")['value'] EVENTARGUEMENT = soup.find(id="__EVENTARGUMENT")['value'] PREVIOUSPAGE = soup.find(id="__PREVIOUSPAGE")['value'] CMSESSIONID = soup.find(id="CMSessionId")['value'] soup.find(id="MasterHeaderPlaceHolder_ctl00_userNameTextbox")['value'] login_data= { "__VIEWSTATE" : VIEWSTATE, "txtUserName" : account_name, "txtPassword" : account_pass, "__VIEWSTATEGENERATOR" : VIEWSTATEGENERATOR, "__EVENTVALIDATION": EVENTVALIDATION, "__EVENTTARGET" : EVENTTARGET, "__EVENTARGUEMENT" : EVENTARGUEMENT, "__PREVIOUSPAGE" : PREVIOUSPAGE, "CMSessionId" : CMSESSIONID, "MasterHeaderPlaceHolder_ctl00_userNameTextbox" : account_name, "MasterHeaderPlaceHolder_ctl00_passwordTextbox" : account_pass, "MasterHeaderPlaceHolder_ctl00_tempPasswordTextbox" : account_pass } login_data_encoded = urllib.parse.urlencode(login_data) #*
Further to this, the login_data dictionary is being passed to a post request to the login_url as the data.
The same session is then used to try and get the request from the report_url.
response_1 = session.post(login_url, data=login_data)
response_2 = session.get(report_url)
The problem seems to be that the login is not being effected. as the get request is being re-routed to a login page.
Can anyone kindly shed some light on why this is the case? I am guessing that this is the correct flow, however please let me know if there is anything I am doing wrong or that can be improved.
I am unfortunately currently limited to using only requests or other popular python 3 libraries as it is a requirement (using references to "browser".exe files, as suggested in some replies on the subject, is not an option.)
解决方案
推荐阅读
- c - C中的动态数组push()导致Valgrind错误
- sql - 根据sql中右侧表的条件获取左侧表的ID?
- python - 如果在目录中创建了新文件,则启动 python 模块(看门狗)
- html - 点击图片无法访问链接
- python - 为什么我的 FPS 相机滚动?在 Python 中使用欧拉角(不是四元数)实现
- python - 隔离森林:分类数据
- node.js - puppeteer:无法登录并循环浏览网址
- sql - datediff 奇怪的东西 - SQL Server
- report - Jira Agile - 版本报告 - 未显示预计完成日期
- azure - Azure 功能未在指定时间触发