json - 如果站点的内容都存在于一个标签中,我如何抓取站点(使用 Python3)?
问题描述
努力尝试抓取一个带有选举结果的非常简单的网站。所有内容都在一个<pre />
标签中。显然,像往常一样使用 python 3 将其解析为 json 是非常困难的。
在前一年,当我不得不抓取这个网站时,我只需要两场比赛的结果信息,所以我做了这样的事情:
def scrape_kendall():
COUNTY_NAME = "Kendall"
# sets URLs
KENDALL_RACE_URL = 'https://results.co.kendall.il.us/'
#gets data
html = urllib.request.urlopen(KENDALL_RACE_URL).read()
soup = BeautifulSoup(html, 'html.parser')
# creates empty list for results info
kendall_county_results = []
data = soup.find('pre').text
precincts_total = 87
rows = data.splitlines()
for index, row in enumerate(rows):
if row.startswith(" PRECINCTS"):
precincts_reporting = int(row[-2:])
if row == "COUNTY BOARD MEMBER-DIST.1":
dist1_race_name = row
dist1_race_obj = initialize_race_obj(dist1_race_name,precincts_reporting,precincts_total,COUNTY_NAME)
if index >= 115 and index <= 119: # hard-coded
cand_index = int(str(index)[-1:]) - 2
cand_info, full_name, party = get_candidate_info(row)
first_name, middle_name, last_name = parse_name(full_name)
votes = get_vote_count(cand_info)
formatted_candidate_info = get_candidates_in_race_obj(
first_name, middle_name, last_name,
votes, party, cand_index)
dist1_race_obj["reporting_units"][0]['candidates'].append(formatted_candidate_info)
...etc
这会产生如下所示的数据:
[
{
"name": "County Board Member-Dist.1",
"description": "",
"election_date": "2020-11-03",
"market": "chinews",
"uncontested": false,
"amendment": false,
"state_postal": "IL",
"recount": false,
"reporting_units": [
{
"name": "Kendall",
"level": "county",
"district_type": "",
"state_postal": "IL",
"geo_id": "",
"electoral_vote_total": 0,
"precincts_reporting": 0,
"total_precincts": 87,
"data_source_update_time": "2020-11-20T20:10:15+0000",
"candidates": [
{
"first_name": "Scott",
"middle_name": "",
"last_name": "Gengler",
"vote_count": 14696,
"party": "REP",
"ballot_order": 3
},
{
"first_name": "Brian",
"middle_name": "E.",
"last_name": "Debolt",
"vote_count": 12867,
"party": "REP",
"ballot_order": 4
},```
...etc.
intialize_race_obj
, get_candidate_info
,parse_name
和get_vote_count
都是 util 函数,有些还涉及一些硬编码。因为我只需要两场比赛的结果信息,所以我对一些东西进行硬编码和使用if
语句(如上)让我平静下来。将来,我可能需要 10 或 20 场比赛的信息,而且在那种情况下我不准备硬编码或使用if
这样的语句。关于如何以更编程的方式使用 python 3 抓取该站点的任何想法?
解决方案
我认为没有一个始终有效的具体答案。在您的情况下,不同部分之间有明确的输入。因此,我将创建一个手动解析器,专注于获取这些不同的部分。
我可以想出的一些示例代码如下所示,但我将首先提及我采取的步骤。
从网站获取数据集,并将其存储在本地文件中(节省一些能源)。
手动找到拆分摘要数据(顶部的标题)和带有投票计数的正文的点。
手动解析行的标题行,如果发生变化,这会中断,但是嘿,你可能只需要做一次(交叉手指)。
解析主体,我将主体划分为多个部分,每个部分包含在两个输入之间。一个示例部分是
AURORA MAYOR VOTE FOR 1 (WITH 3 OF 3 PRECINCTS COUNTED) RICHARD C. IRVIN . . . . . . . . 237 62.20 207 30 0 JUDD LOFCHIE . . . . . . . . . 59 15.49 56 3 0 JOHN LAESCH. . . . . . . . . . 85 22.31 63 22 0
然后手动解析该部分,为此我留出空间来解析每个候选人。
现在我还没有完全完成抓取,但这对你来说是乐趣的一部分。但这应该为您建立一个如何处理任意大量部分和候选人的框架。
代码
import itertools
import urllib.request
from argparse import Namespace
from pprint import pprint
from bs4 import BeautifulSoup
def get_data(url, file='data.txt'):
""" Retrieve the bare bone data from a weblink and stores it in provided file. """
with urllib.request.urlopen(url) as page:
soup = BeautifulSoup(page.read(), 'html.parser')
data = soup.find('pre').text.split('\n')
with open(file, 'w') as file:
file.writelines(data)
def clean_data(file='data.txt', header=15, ignore=False):
"""
Clean the data, where the first n lines are for the header or ignored.
:param file: (str) Name of the file to load.
:param header: (int) Number of lines used for header or skipped when ignore is True.
:param ignore: (bool) If True, skips the lines indicated by header.
:return:
"""
with open(file, 'r') as file:
data = file.readlines()
header, body = data[:header], data[header:]
data_header = generate_header(header)
data_body = generate_body(body, columns=data_header.columns)
# pprint(vars(data_header))
pprint(vars(data_body))
def parse_numbers(line: str, columns, missing: list = None, fill_value='-') -> dict:
values = list(filter(str.strip, line.split(' ')))
if len(values) == len(columns):
return dict(zip(columns, values))
if all(int(value) == 0 for value in values):
return dict(zip(columns, ['0'] * len(columns)))
raise ValueError(f"Unknown handling of missing values."
f"\nColumns: {columns}Line: \n{line}Values: \n{values}")
def generate_header(header: list[str]):
""" Manually parse the header (hopefully only once). """
clean_data = list(filter(bool, ''.join(header).split('\n')))
name, description, status = list(filter(str.strip, clean_data[0].split(' ')))
date = clean_data[1].strip()
country, state = list(map(str.strip, clean_data[2].split(',')))
election_date = clean_data[3].strip()
columns = list(filter(str.strip, clean_data[4].split(' ')))
summary = {}
for row in clean_data[5:11]:
pass
return Namespace(
name=name,
description=description,
status=status,
date=date,
country=country,
state=state,
election_date=election_date,
columns=columns,
summary=summary
)
def generate_body(body: list[str], columns=None):
clean_body = list(map(str.strip, ''.join(body).split('\n')))
# https://stackoverflow.com/a/52943710/10961342
sections = [list(group) for key, group in itertools.groupby(clean_body, key=bool) if key]
metadata = []
for section in sections:
function = section[0]
vote = [row.startswith('VOTE FOR') for row in section].index(True) # locate where `VOTE FOR`
info = ' '.join(map(str.strip, section[1:vote + 2]))
candidates = []
for candidate in section[vote + 2:]:
name = candidate.split('.')[0].strip()
numbers = candidate.rsplit('. .')[-1]
data = parse_numbers(numbers, columns)
candidates.append({"name": name, "data": data})
metadata.append({"function": function, "info": info, "candidates": candidates})
pprint(metadata, sort_dicts=False)
return Namespace(body=metadata)
if __name__ == '__main__':
# Retrieve the original data set.
# get_data('https://results.co.kendall.il.us/')
clean_data()
输出
[{'function': 'AURORA MAYOR',
'info': 'VOTE FOR 1 (WITH 3 OF 3 PRECINCTS COUNTED)',
'candidates': [{'name': 'RICHARD C',
'data': {'TOTAL VOTES': '237',
' %': ' 62.20',
'ELECTION DAY': ' 207',
' EV, VBM': '30',
'PROV, POST': ' 0'}},
{'name': 'JUDD LOFCHIE',
'data': {'TOTAL VOTES': ' 59',
' %': ' 15.49',
'ELECTION DAY': '56',
' EV, VBM': ' 3',
'PROV, POST': ' 0'}},
{'name': 'JOHN LAESCH',
'data': {'TOTAL VOTES': ' 85',
' %': ' 22.31',
'ELECTION DAY': '63',
' EV, VBM': '22',
'PROV, POST': ' 0'}}]},
{'function': 'AURORA ALDERMAN AT LARGE',
'info': 'VOTE FOR 1 (WITH 3 OF 3 PRECINCTS COUNTED)',
'candidates': [{'name': 'RON WOERMAN',
'data': {'TOTAL VOTES': '117',
' %': ' 34.01',
'ELECTION DAY': ' 106',
' EV, VBM': '11',
'PROV, POST': ' 0'}},
{'name': 'BROOKE SHANLEY',
'data': {'TOTAL VOTES': '168',
' %': ' 48.84',
'ELECTION DAY': ' 136',
' EV, VBM': '32',
'PROV, POST': ' 0'}},
{'name': 'RAYMOND HULL',
'data': {'TOTAL VOTES': ' 59',
' %': ' 17.15',
'ELECTION DAY': '52',
' EV, VBM': ' 7',
'PROV, POST': ' 0'}}]},
{'function': 'AURORA ALDERMAN WARD 9',
'info': 'VOTE FOR 1 (WITH 3 OF 3 PRECINCTS COUNTED)',
'candidates': [{'name': 'EDWARD J',
'data': {'TOTAL VOTES': '339',
' %': '100.00',
'ELECTION DAY': ' 285',
' EV, VBM': '54',
'PROV, POST': ' 0'}}]},
{'function': 'JOLIET COUNCILMAN AT LARGE',
'info': 'VOTE FOR 3 (WITH 7 OF 7 PRECINCTS COUNTED)',
'candidates': [{'name': 'GLENDA WRIGHT-McCULLUM',
'data': {'TOTAL VOTES': ' 96',
' %': '7.81',
'ELECTION DAY': '91',
' EV, VBM': ' 5',
'PROV, POST': ' 0'}},
{'name': 'NICOLE LURRY',
'data': {'TOTAL VOTES': ' 77',
' %': '6.27',
'ELECTION DAY': '70',
' EV, VBM': ' 7',
'PROV, POST': ' 0'}},
{'name': 'JEREMY BRZYCKI',
'data': {'TOTAL VOTES': ' 90',
' %': '7.32',
'ELECTION DAY': '78',
' EV, VBM': '12',
'PROV, POST': ' 0'}},
{'name': 'CESAR GUERRERO',
'data': {'TOTAL VOTES': '106',
' %': '8.62',
'ELECTION DAY': '95',
' EV, VBM': '11',
'PROV, POST': ' 0'}},
{'name': 'ISIAH WILLIAMS JR',
'data': {'TOTAL VOTES': ' 47',
' %': '3.82',
'ELECTION DAY': '45',
' EV, VBM': ' 2',
'PROV, POST': ' 0'}},
{'name': 'HUDSON HOLLISTER',
'data': {'TOTAL VOTES': ' 84',
' %': '6.83',
'ELECTION DAY': '72',
' EV, VBM': '12',
'PROV, POST': ' 0'}},
{'name': 'JAMES LANHAM',
'data': {'TOTAL VOTES': ' 32',
' %': '2.60',
'ELECTION DAY': '29',
' EV, VBM': ' 3',
'PROV, POST': ' 0'}},
{'name': 'ROGER POWELL',
'data': {'TOTAL VOTES': ' 56',
' %': '4.56',
'ELECTION DAY': '55',
' EV, VBM': ' 1',
'PROV, POST': ' 0'}},
{'name': 'WARREN C',
'data': {'TOTAL VOTES': ' 76',
' %': '6.18',
'ELECTION DAY': '66',
' EV, VBM': '10',
'PROV, POST': ' 0'}},
{'name': 'ROBERT WUNDERLICH',
'data': {'TOTAL VOTES': '166',
' %': ' 13.51',
'ELECTION DAY': ' 149',
' EV, VBM': '17',
'PROV, POST': ' 0'}},
{'name': 'JOE CLEMENT',
'data': {'TOTAL VOTES': '203',
' %': ' 16.52',
'ELECTION DAY': ' 190',
' EV, VBM': '13',
'PROV, POST': ' 0'}},
{'name': 'JAN QUILLMAN',
'data': {'TOTAL VOTES': '196',
' %': ' 15.95',
'ELECTION DAY': ' 184',
' EV, VBM': '12',
'PROV, POST': ' 0'}}]},
{'function': 'PLANO MAYOR',
'info': 'VOTE FOR 1 (WITH 11 OF 11 PRECINCTS COUNTED)',
'candidates': [{'name': 'ROBERT "BOB" HAUSLER (IND)',
'data': {'TOTAL VOTES': '388',
' %': ' 48.50',
'ELECTION DAY': ' 336',
' EV, VBM': '52',
'PROV, POST': ' 0'}},
{'name': 'MIKE RENNELS (IND)',
'data': {'TOTAL VOTES': '412',
' %': ' 51.50',
'ELECTION DAY': ' 352',
' EV, VBM': '60',
'PROV, POST': ' 0'}}]},
...
推荐阅读
- python - 在 Spyder 中运行 Pyomo 随机示例
- microsoft-graph-api - MS Graph API - 创建联系人 401 问题
- bots - 如果成员没有角色,Discord.py 发送错误消息
- powerbi - 在 Power BI 中聚合时间字段
- php - 试图获得非对象的属性“角色”
- python - 在我的 py 中表达一个返回字符串和值列表的字典
- amazon-cloudformation - 如何使用辅助 Buildspec 文件
- spring-boot - Spring Boot junit5 测试在本地工作(Windows)但不在 Jenkins(Linux)中
- reactjs - 使用分块工具上传 React 文档
- java - 改造接口反序列化