首页 > 解决方案 > 我如何为获奖者网页抓取网站

问题描述

嗨,我正在尝试用 Python 3 抓取这个网站,并注意到在源代码中它没有明确指示我将如何抓取这些初选中获胜者的名字。你能告诉我如何用这个网站在每次 MD 初选中收集所有获胜者的名单吗?

https://elections2018.news.baltimoresun.com/results/

标签: pythonweb-scraping

解决方案


解析有点复杂,因为结果在很多子页面中。此脚本收集它们并打印结果(所有数据都存储在变量中data):

from bs4 import BeautifulSoup
import requests

url = "https://elections2018.news.baltimoresun.com/results/"
r = requests.get(url)

data = {}
soup = BeautifulSoup(r.text, 'lxml')
for race in soup.select('div[id^=race]'):
    r = requests.get(f"https://elections2018.news.baltimoresun.com/results/contests/{race['id'].split('-')[1]}.html")
    s = BeautifulSoup(r.text, 'lxml')
    l = []
    data[(s.find('h3').text, s.find('div', {'class': 'party-header'}).text)] = l

    for candidate, votes, percent in zip(s.select('td.candidate'), s.select('td.votes'), s.select('td.percent')):
        l.append((candidate.text, votes.text, percent.text))

print('Winners:')
for (race, party), v in data.items():
    print(race, party, v[0])

# print(data)

输出:

Winners:
Governor / Lt. Governor Democrat ('Ben Jealous and Susan Turnbull', '227,764', '39.6%')
U.S. Senator Republican ('Tony Campbell', '50,915', '29.2%')
U.S. Senator Democrat ('Ben Cardin', '468,909', '80.4%')
State's Attorney Democrat ('Marilyn J. Mosby', '39,519', '49.4%')
County Executive Democrat ('John "Johnny O" Olszewski, Jr.', '27,270', '32.9%')
County Executive Republican ('Al Redmer, Jr.', '17,772', '55.7%')

推荐阅读