首页 > 解决方案 > 如果站点的内容都存在于一个标签中,我如何抓取站点(使用 Python3)?

问题描述

努力尝试抓取一个带有选举结果的非常简单的网站。所有内容都在一个<pre />标签中。显然,像往常一样使用 python 3 将其解析为 json 是非常困难的。

在前一年,当我不得不抓取这个网站时,我只需要两场比赛的结果信息,所以我做了这样的事情:

def scrape_kendall():
    COUNTY_NAME = "Kendall"
    # sets URLs
    KENDALL_RACE_URL = 'https://results.co.kendall.il.us/'
    
    #gets data
    html = urllib.request.urlopen(KENDALL_RACE_URL).read()
    soup = BeautifulSoup(html, 'html.parser')

    # creates empty list for results info
    kendall_county_results = []

    data = soup.find('pre').text
    precincts_total = 87
    rows = data.splitlines()

    for index, row in enumerate(rows):
        if row.startswith(" PRECINCTS"):
            precincts_reporting = int(row[-2:])

        if row == "COUNTY BOARD MEMBER-DIST.1":
            dist1_race_name = row

            dist1_race_obj = initialize_race_obj(dist1_race_name,precincts_reporting,precincts_total,COUNTY_NAME)

        if index >= 115 and index <= 119: # hard-coded
            cand_index = int(str(index)[-1:]) - 2
            cand_info, full_name, party = get_candidate_info(row)
                
            first_name, middle_name, last_name = parse_name(full_name)

            votes = get_vote_count(cand_info)
            
            formatted_candidate_info = get_candidates_in_race_obj( 
                first_name, middle_name, last_name, 
                votes, party, cand_index)

            dist1_race_obj["reporting_units"][0]['candidates'].append(formatted_candidate_info)

...etc

这会产生如下所示的数据:

[
    {
        "name": "County Board Member-Dist.1",
        "description": "",
        "election_date": "2020-11-03",
        "market": "chinews",
        "uncontested": false,
        "amendment": false,
        "state_postal": "IL",
        "recount": false,
        "reporting_units": [
            {
                "name": "Kendall",
                "level": "county",
                "district_type": "",
                "state_postal": "IL",
                "geo_id": "",
                "electoral_vote_total": 0,
                "precincts_reporting": 0,
                "total_precincts": 87,
                "data_source_update_time": "2020-11-20T20:10:15+0000",
                "candidates": [
                    {
                        "first_name": "Scott",
                        "middle_name": "",
                        "last_name": "Gengler",
                        "vote_count": 14696,
                        "party": "REP",
                        "ballot_order": 3
                    },
                    {
                        "first_name": "Brian",
                        "middle_name": "E.",
                        "last_name": "Debolt",
                        "vote_count": 12867,
                        "party": "REP",
                        "ballot_order": 4
                    },```
...etc.

intialize_race_obj, get_candidate_info,parse_nameget_vote_count都是 util 函数,有些还涉及一些硬编码。因为我只需要两场比赛的结果信息,所以我对一些东西进行硬编码和使用if语句(如上)让我平静下来。将来,我可能需要 10 或 20 场比赛的信息,而且在那种情况下我不准备硬编码或使用if这样的语句。关于如何以更编程的方式使用 python 3 抓取该站点的任何想法?

标签: jsonpython-3.xweb-scraping

解决方案


我认为没有一个始终有效的具体答案。在您的情况下,不同部分之间有明确的输入。因此,我将创建一个手动解析器,专注于获取这些不同的部分。

我可以想出的一些示例代码如下所示,但我将首先提及我采取的步骤。

  1. 从网站获取数据集,并将其存储在本地文件中(节省一些能源)。

  2. 手动找到拆分摘要数据(顶部的标题)和带有投票计数的正文的点。

  3. 手动解析行的标题行,如果发生变化,这会中断,但是嘿,你可能只需要做一次(交叉手指)。

  4. 解析主体,我将主体划分为多个部分,每个部分包含在两个输入之间。一个示例部分是

    AURORA MAYOR
    VOTE FOR  1
        (WITH 3 OF 3 PRECINCTS COUNTED)
     RICHARD C. IRVIN .  .  .  .  .  .  .  .        237   62.20           207            30             0
     JUDD LOFCHIE  .  .  .  .  .  .  .  .  .         59   15.49            56             3             0
     JOHN LAESCH.  .  .  .  .  .  .  .  .  .         85   22.31            63            22             0
    
  5. 然后手动解析该部分,为此我留出空间来解析每个候选人。

现在我还没有完全完成抓取,但这对你来说是乐趣的一部分。但这应该为您建立一个如何处理任意大量部分和候选人的框架。

代码

import itertools
import urllib.request
from argparse import Namespace
from pprint import pprint

from bs4 import BeautifulSoup


def get_data(url, file='data.txt'):
    """ Retrieve the bare bone data from a weblink and stores it in provided file.  """
    with urllib.request.urlopen(url) as page:
        soup = BeautifulSoup(page.read(), 'html.parser')
        data = soup.find('pre').text.split('\n')

        with open(file, 'w') as file:
            file.writelines(data)


def clean_data(file='data.txt', header=15, ignore=False):
    """
        Clean the data, where the first n lines are for the header or ignored.

    :param file: (str) Name of the file to load.
    :param header: (int) Number of lines used for header or skipped when ignore is True.
    :param ignore: (bool) If True, skips the lines indicated by header.
    :return:
    """
    with open(file, 'r') as file:
        data = file.readlines()

    header, body = data[:header], data[header:]

    data_header = generate_header(header)
    data_body = generate_body(body, columns=data_header.columns)

    # pprint(vars(data_header))
    pprint(vars(data_body))


def parse_numbers(line: str, columns, missing: list = None, fill_value='-') -> dict:
    values = list(filter(str.strip, line.split('  ')))
    if len(values) == len(columns):
        return dict(zip(columns, values))
    if all(int(value) == 0 for value in values):
        return dict(zip(columns, ['0'] * len(columns)))
    raise ValueError(f"Unknown handling of missing values."
                     f"\nColumns: {columns}Line: \n{line}Values: \n{values}")


def generate_header(header: list[str]):
    """ Manually parse the header (hopefully only once).  """
    clean_data = list(filter(bool, ''.join(header).split('\n')))
    name, description, status = list(filter(str.strip, clean_data[0].split('  ')))
    date = clean_data[1].strip()
    country, state = list(map(str.strip, clean_data[2].split(',')))
    election_date = clean_data[3].strip()
    columns = list(filter(str.strip, clean_data[4].split('  ')))

    summary = {}
    for row in clean_data[5:11]:
        pass

    return Namespace(
            name=name,
            description=description,
            status=status,
            date=date,
            country=country,
            state=state,
            election_date=election_date,
            columns=columns,
            summary=summary
    )


def generate_body(body: list[str], columns=None):
    clean_body = list(map(str.strip, ''.join(body).split('\n')))
    # https://stackoverflow.com/a/52943710/10961342
    sections = [list(group) for key, group in itertools.groupby(clean_body, key=bool) if key]
    metadata = []

    for section in sections:
        function = section[0]
        vote = [row.startswith('VOTE FOR') for row in section].index(True)  # locate where `VOTE FOR`
        info = ' '.join(map(str.strip, section[1:vote + 2]))
        candidates = []

        for candidate in section[vote + 2:]:
            name = candidate.split('.')[0].strip()
            numbers = candidate.rsplit('.  .')[-1]
            data = parse_numbers(numbers, columns)
            candidates.append({"name": name, "data": data})

        metadata.append({"function": function, "info": info, "candidates": candidates})

    pprint(metadata, sort_dicts=False)
    return Namespace(body=metadata)


if __name__ == '__main__':
    # Retrieve the original data set.
    # get_data('https://results.co.kendall.il.us/')
    clean_data()

输出

[{'function': 'AURORA MAYOR',
  'info': 'VOTE FOR  1 (WITH 3 OF 3 PRECINCTS COUNTED)',
  'candidates': [{'name': 'RICHARD C',
                  'data': {'TOTAL VOTES': '237',
                           ' %': ' 62.20',
                           'ELECTION DAY': ' 207',
                           ' EV, VBM': '30',
                           'PROV, POST': ' 0'}},
                 {'name': 'JUDD LOFCHIE',
                  'data': {'TOTAL VOTES': ' 59',
                           ' %': ' 15.49',
                           'ELECTION DAY': '56',
                           ' EV, VBM': ' 3',
                           'PROV, POST': ' 0'}},
                 {'name': 'JOHN LAESCH',
                  'data': {'TOTAL VOTES': ' 85',
                           ' %': ' 22.31',
                           'ELECTION DAY': '63',
                           ' EV, VBM': '22',
                           'PROV, POST': ' 0'}}]},
 {'function': 'AURORA ALDERMAN AT LARGE',
  'info': 'VOTE FOR  1 (WITH 3 OF 3 PRECINCTS COUNTED)',
  'candidates': [{'name': 'RON WOERMAN',
                  'data': {'TOTAL VOTES': '117',
                           ' %': ' 34.01',
                           'ELECTION DAY': ' 106',
                           ' EV, VBM': '11',
                           'PROV, POST': ' 0'}},
                 {'name': 'BROOKE SHANLEY',
                  'data': {'TOTAL VOTES': '168',
                           ' %': ' 48.84',
                           'ELECTION DAY': ' 136',
                           ' EV, VBM': '32',
                           'PROV, POST': ' 0'}},
                 {'name': 'RAYMOND HULL',
                  'data': {'TOTAL VOTES': ' 59',
                           ' %': ' 17.15',
                           'ELECTION DAY': '52',
                           ' EV, VBM': ' 7',
                           'PROV, POST': ' 0'}}]},
 {'function': 'AURORA ALDERMAN WARD 9',
  'info': 'VOTE FOR  1 (WITH 3 OF 3 PRECINCTS COUNTED)',
  'candidates': [{'name': 'EDWARD J',
                  'data': {'TOTAL VOTES': '339',
                           ' %': '100.00',
                           'ELECTION DAY': ' 285',
                           ' EV, VBM': '54',
                           'PROV, POST': ' 0'}}]},
 {'function': 'JOLIET COUNCILMAN AT LARGE',
  'info': 'VOTE FOR  3 (WITH 7 OF 7 PRECINCTS COUNTED)',
  'candidates': [{'name': 'GLENDA WRIGHT-McCULLUM',
                  'data': {'TOTAL VOTES': ' 96',
                           ' %': '7.81',
                           'ELECTION DAY': '91',
                           ' EV, VBM': ' 5',
                           'PROV, POST': ' 0'}},
                 {'name': 'NICOLE LURRY',
                  'data': {'TOTAL VOTES': ' 77',
                           ' %': '6.27',
                           'ELECTION DAY': '70',
                           ' EV, VBM': ' 7',
                           'PROV, POST': ' 0'}},
                 {'name': 'JEREMY BRZYCKI',
                  'data': {'TOTAL VOTES': ' 90',
                           ' %': '7.32',
                           'ELECTION DAY': '78',
                           ' EV, VBM': '12',
                           'PROV, POST': ' 0'}},
                 {'name': 'CESAR GUERRERO',
                  'data': {'TOTAL VOTES': '106',
                           ' %': '8.62',
                           'ELECTION DAY': '95',
                           ' EV, VBM': '11',
                           'PROV, POST': ' 0'}},
                 {'name': 'ISIAH WILLIAMS JR',
                  'data': {'TOTAL VOTES': ' 47',
                           ' %': '3.82',
                           'ELECTION DAY': '45',
                           ' EV, VBM': ' 2',
                           'PROV, POST': ' 0'}},
                 {'name': 'HUDSON HOLLISTER',
                  'data': {'TOTAL VOTES': ' 84',
                           ' %': '6.83',
                           'ELECTION DAY': '72',
                           ' EV, VBM': '12',
                           'PROV, POST': ' 0'}},
                 {'name': 'JAMES LANHAM',
                  'data': {'TOTAL VOTES': ' 32',
                           ' %': '2.60',
                           'ELECTION DAY': '29',
                           ' EV, VBM': ' 3',
                           'PROV, POST': ' 0'}},
                 {'name': 'ROGER POWELL',
                  'data': {'TOTAL VOTES': ' 56',
                           ' %': '4.56',
                           'ELECTION DAY': '55',
                           ' EV, VBM': ' 1',
                           'PROV, POST': ' 0'}},
                 {'name': 'WARREN C',
                  'data': {'TOTAL VOTES': ' 76',
                           ' %': '6.18',
                           'ELECTION DAY': '66',
                           ' EV, VBM': '10',
                           'PROV, POST': ' 0'}},
                 {'name': 'ROBERT WUNDERLICH',
                  'data': {'TOTAL VOTES': '166',
                           ' %': ' 13.51',
                           'ELECTION DAY': ' 149',
                           ' EV, VBM': '17',
                           'PROV, POST': ' 0'}},
                 {'name': 'JOE CLEMENT',
                  'data': {'TOTAL VOTES': '203',
                           ' %': ' 16.52',
                           'ELECTION DAY': ' 190',
                           ' EV, VBM': '13',
                           'PROV, POST': ' 0'}},
                 {'name': 'JAN QUILLMAN',
                  'data': {'TOTAL VOTES': '196',
                           ' %': ' 15.95',
                           'ELECTION DAY': ' 184',
                           ' EV, VBM': '12',
                           'PROV, POST': ' 0'}}]},
 {'function': 'PLANO MAYOR',
  'info': 'VOTE FOR  1 (WITH 11 OF 11 PRECINCTS COUNTED)',
  'candidates': [{'name': 'ROBERT "BOB" HAUSLER (IND)',
                  'data': {'TOTAL VOTES': '388',
                           ' %': ' 48.50',
                           'ELECTION DAY': ' 336',
                           ' EV, VBM': '52',
                           'PROV, POST': ' 0'}},
                 {'name': 'MIKE RENNELS (IND)',
                  'data': {'TOTAL VOTES': '412',
                           ' %': ' 51.50',
                           'ELECTION DAY': ' 352',
                           ' EV, VBM': '60',
                           'PROV, POST': ' 0'}}]},
...

推荐阅读