python-3.x - 使用 python 从 .aspx 网站抓取网页
问题描述
我正在尝试从该站点抓取一些数据: https ://fortress.wa.gov/esd/file/warn/Public/SearchWARN.aspx
我可以使用我的方法获得前 11 页,但由于某种原因,它在第 11 页之后退出。我已经阅读了与 .aspx 相关的其他帖子,但没有看到任何适用于我的情况的帖子。
我是新手,所以我的代码有点冗长,但它完成了工作——有点。我已经调整了标题和其他一些东西,但无法超过第 11 页。对我来说毫无意义。
我相当确定问题出在 viewstate 和 viewgenerator 标头参数上。我不确定如何在循环中为您要访问的页面获取这些信息。我几乎对所有页面使用相同的值。出于某种原因,这种方法一直有效到第 11 页(包括第 11 页),然后它就中断了。这很奇怪,因为看起来每个页面都有不同的视图状态值。
提前致谢。
import pandas as pd
import re
import pandas as pd
import numpy as np
import urllib
from requests import Session
from bs4 import BeautifulSoup
import time
import requests
# List of pages to loop over
page_list = ['Page$1','Page$2','Page$3','Page$4','Page$5','Page$6','Page$7','Page$8','Page$9','Page$10',
'Page$11','Page$12','Page$13','Page$14','Page$15','Page$16','Page$17','Page$18','Page$19','Page$20']
wa_url = 'https://fortress.wa.gov/esd/file/warn/Public/SearchWARN.aspx'
# Getting header elements from url
session = requests.Session()
session.headers.update({
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
})
val_get = session.get(wa_url)
soup = BeautifulSoup(val_get.content, "html.parser")
tags = soup.find_all('input')
# Header elements I need for the POST request
view_state = tags[3]['value']
view_generator = tags[4]['value']
evnt_validation = tags[6]['value']
no_emps = []
date = []
#Looping through pages of WARN database
for page in page_list:
data = {
# Form data header stuff
"__EVENTTARGET": "ucPSW$gvMain",
"__EVENTARGUMENT": page,
"__LASTFOCUS": "",
"__VIEWSTATE": view_state,
"__VIEWSTATEGENERATOR": view_generator,
"__VIEWSTATEENCRYPTED": "",
"__EVENTVALIDATION": evnt_validation,
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding":"gzip, deflate, br",
"Accept-Language":"en-US,en;q=0.9",
"Cache-Control":"max-age=0",
"Connection":"keep-alive",
"Content-Type":"application/x-www-form-urlencoded",
"Cookie":"_ga=GA1.2.1011893740.1592948578; _gid=GA1.2.1433455844.1592948578",
"Host":"fortress.wa.gov",
"Origin":"https://fortress.wa.gov",
"Referer":"https://fortress.wa.gov/esd/file/warn/Public/SearchWARN.aspx",
"Sec-Fetch-Dest":"document",
"Sec-Fetch-Mode":"navigate",
"Sec-Fetch-Site":"same-origin",
"Sec-Fetch-User":"?1",
"Upgrade-Insecure-Requests":"1"
}
# Getting data from each page
session = requests.Session()
session.headers.update({
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
})
get_warn_data = session.post(wa_url, data=data)
soup = BeautifulSoup(get_warn_data.content, "html.parser")
# Getting all rows of data and desired table data after some cleaning up
work = soup.find_all('tr')
work = [a.get_text('@') for a in work]
work = [re.sub(r'\n', '', a) for a in work]
work = [re.sub(r'^@|@$', '', a) for a in work]
work = [a.split('@') for a in work]
work = [a for a in work if len(a) == 7]
no_emps_u = [a[3] for a in work]
date_use = [a[6] for a in work]
no_emps.append(no_emps_u)
date.append(date_use)
# Dynamically Updating header values with stuff in current html
# Only applicable for page2 and on
if page != 'Page$1':
tags = soup.find_all('input')
view_state = tags[3]['value']
view_generator = tags[4]['value']
evnt_validation = tags[6]['value']
else:
pass
# Wrapping up results into lists
from pandas.core.common import flatten
WA_WARN_no_emps = list(flatten(no_emps))
WA_WARN_date = list(flatten(date))
更新<<<:按照 Andrej 在评论中的建议,我在 for 循环的末尾添加了 if 语句,用于更新标头值;此添加修复了代码并获取 page_list 中的所有页面。
解决方案
您可以使用此示例从站点获取所有页面(总共 67 个)(它<input>
动态获取所有值 - 因此它变得正确__VIEWSTATE
等):
import requests
from bs4 import BeautifulSoup
url = 'https://fortress.wa.gov/esd/file/warn/Public/SearchWARN.aspx'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def get_data(soup, page_num):
data = {}
for i in soup.select('input'):
data[i['name']] = i.get('value', '')
del data['ucPSW$btnSearchCompany']
data['__EVENTTARGET'] = 'ucPSW$gvMain'
data['__EVENTARGUMENT'] = 'Page${}'.format(page_num)
data['__LASTFOCUS'] = ''
return data
page = 1
while True:
print('Page {}...'.format(page))
total = 1
for total, tr in enumerate(soup.select('#ucPSW_gvMain > tr:not(:has(table)):has(td)'), 1):
tds = [td.get_text(strip=True) for td in tr.select('td')]
print('{:<3}{:<50}{:<25}{:<15}{:<15}{:<15}{:<15}{:<15}'.format(total, *tds))
if total % 15:
break
page += 1
soup = BeautifulSoup( requests.post(url, get_data(soup, page)).content, 'html.parser' )
印刷:
Page 1...
1 Safran Cabin Materials, LLC Marysville and Newport 6/23/2020 85 Layoff Permanent 6/24/2020
2 Swissport Fueling SeaTac 5/8/2020 69 Layoff Permanent 6/19/2020
3 Swissport USA, Inc SeaTac 5/22/2020 62 Layoff Permanent 6/19/2020
4 Swissport USA, Inc SeaTac 3/20/2020 167 Layoff Temporary 6/19/2020
5 Tool Gauge and Machine Works Tacoma 6/17/2020 59 Layoff Permanent 6/18/2020
6 Hyatt Corporation Motif Seattle Seattle 3/14/2020 91 Layoff Temporary 6/18/2020
7 Jacobsen Daniel's Enterprise, Inc Tacoma 6/12/2020 1 Layoff Permanent 6/18/2020
8 Benchmark Stevenson, LLC d/b/a Skamania Lodge Stevenson 3/18/2020 185 Layoff Temporary 6/17/2020
9 Seattle Art Museum Seattle 7/5/2020 76 Layoff Temporary 6/16/2020
10 Chihuly Garden & Glass Seattle 3/21/2020 97 Layoff Temporary 6/16/2020
11 Seattle Center Seattle 3/21/2020 182 Layoff Temporary 6/16/2020
12 Sekisui Aerospace Renton and Sumner 6/12/2020 111 Layoff Permanent 6/15/2020
13 Pioneer Human Services Seattle 8/14/2020 59 Layoff Permanent 6/15/2020
14 Crista Senior Living Shoreline 8/16/2020 156 Closure Permanent 6/15/2020
15 Hyatt Corporation / Hyatt Regency Bellevue Bellevue 3/15/2020 223 Layoff Temporary 6/15/2020
Page 2...
1 Toray Composite Materials America, Inc Tacoma 8/8/2020 146 Layoff Permanent 6/12/2020
2 Embassy Suites Seattle Bellevue Seattle 6/1/2020 57 Layoff Temporary 6/12/2020
3 Triumph Aerospace Structures Spokane 6/15/2020 12 Layoff Permanent 6/11/2020
4 Hyatt Corporation / Hyatt Regency Lake Washington Renton 6/30/2020 129 Layoff Temporary 6/9/2020
5 Lamb Weston, Inc Connell, WA 6/15/2020 360 Layoff Temporary 6/8/2020
6 Lamb Weston, Inc Warden 6/15/2020 300 Layoff Temporary 6/8/2020
... and so on.
推荐阅读
- android - 在 JSON API 中获取新的插入新闻
- c - 在C中复制3d数组的函数?
- kendo-ui - 在 Kendo Grid 或 TreeList 中使用 CRUD 的三到四级层次结构
- javascript - 带有隐藏表行的 FooTable 未处理错误
- datetime - 多个用户的日期戳(目前仅适用于电子表格的所有者)
- sql - 基于周的事件枢轴
- reactjs - 反应原生 redux 不更新组件
- mysql - Web 应用程序的数据库事务保护 - 推荐还是性能消耗?
- xml - XML 映射 - 删除 @ 登录键名
- c# - 打开列表时为每个 Combobox 项显示工具提示 C# WPF XAML MVVM