首页 > 解决方案 > 如何使用bs4从网页中提取数据

问题描述

我正在尝试从网页上抓取图表数据:'https://cawp.rutgers.edu/women-percentage-2020-candidates'

我尝试使用此代码从 Graph 中提取数据:

import requests
from bs4 import BeautifulSoup

Res = requests.get('https://cawp.rutgers.edu/women-percentage-2020-candidates').text
soup = BeautifulSoup(Res, "html.parser")

Values= [i.text for i in soup.findAll('g', {'class': 'igc-graph'}) if i]
Dates = [i.text for i in soup.findAll('g', {'class': 'igc-legend-entry'}) if i]

print(Values, Dates) ## both list are empty
Data= pd.DataFrame({'Value':Values,'Date':Dates}) ## Returning an Empty Dataframe

我想从所有 4 个条形图中提取日期和值。请任何人建议我在这里必须做什么来提取图形数据,或者是否有任何其他方法可以尝试提取数据。谢谢;

标签: python-3.xweb-scrapingbeautifulsouppython-requests

解决方案


您可以尝试使用此脚本从页面中提取一些数据:

import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://cawp.rutgers.edu/women-percentage-2020-candidates'
infogram_url = 'https://e.infogram.com/'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

def find_data(d):
    if isinstance(d, dict):
        for k, v in d.items():
            if k == 'data' and isinstance(v, list):
                yield v
            else:
                yield from find_data(v)
    elif isinstance(d, list):
        for v in d:
            yield from find_data(v)

for i in soup.select('.infogram-embed'):
    print(i['data-title'])

    html_data = requests.get(infogram_url + i['data-id']).text

    data = re.search(r'window\.infographicData=({.*})', html_data).group(1)
    data = json.loads(data)

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    for d in find_data(data):
        print(d)

    print('-' * 80)

印刷:

Candidate Tracker 2020_US House_Proportions
[[['', 'Districts Already Filed'], ['2020', '435']]]
[[['', '2016', '2018', '2020'], ['Filed', '17.8%', '24.2%', '29.1%']], [['', '2016', '2018', '2020'], ['Filed', '25.1%', '32.5%', '37.9%']], [['', '2016', '2018', '2020'], ['Filed', '11.5%', '13.7%', '21.3%']]]
--------------------------------------------------------------------------------
Candidate Tracker Nominees 2020_US House_Proportions
[[['', 'Possible Major-Party Nominations Decided', 'Possible Major-Party Nominations Left to be Decided'], ['2020', '829', '18']]]
[[['', '', '2018', '2020'], ['Percent of Nominees', '', '28.4%', '35.6%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '42.4%', '48.3%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '13.2%', '22.5%']]]
--------------------------------------------------------------------------------
Candidate Tracker 2020_US Senate_Proportions
[[['', 'States with Senate Contests Already Filed'], ['2020', '34']]]
[[['', '', '2018', '2020'], ['Filed', '', '20.9%', '23.9%']], [['', '', '2018', '2020'], ['Filed', '', '32.6%', '31.1%']], [['', '', '2018', '2020'], ['Filed', '', '14%', '17.4%']]]
--------------------------------------------------------------------------------
Candidate Tracker Nominees 2020_US Senate_Proportions
[[['', 'Decided', 'Left to be Decided'], ['2020', '29', '6']]]
[[['', '', '2018', '2020'], ['Percent of Nominees', '', '32.6%', '31.6%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '42.9%', '39.3%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '23.5%', '24.1%']]]
--------------------------------------------------------------------------------

推荐阅读