python-3.x - 如何使用bs4从网页中提取数据
问题描述
我正在尝试从网页上抓取图表数据:'https://cawp.rutgers.edu/women-percentage-2020-candidates'
我尝试使用此代码从 Graph 中提取数据:
import requests
from bs4 import BeautifulSoup
Res = requests.get('https://cawp.rutgers.edu/women-percentage-2020-candidates').text
soup = BeautifulSoup(Res, "html.parser")
Values= [i.text for i in soup.findAll('g', {'class': 'igc-graph'}) if i]
Dates = [i.text for i in soup.findAll('g', {'class': 'igc-legend-entry'}) if i]
print(Values, Dates) ## both list are empty
Data= pd.DataFrame({'Value':Values,'Date':Dates}) ## Returning an Empty Dataframe
我想从所有 4 个条形图中提取日期和值。请任何人建议我在这里必须做什么来提取图形数据,或者是否有任何其他方法可以尝试提取数据。谢谢;
解决方案
您可以尝试使用此脚本从页面中提取一些数据:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://cawp.rutgers.edu/women-percentage-2020-candidates'
infogram_url = 'https://e.infogram.com/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
def find_data(d):
if isinstance(d, dict):
for k, v in d.items():
if k == 'data' and isinstance(v, list):
yield v
else:
yield from find_data(v)
elif isinstance(d, list):
for v in d:
yield from find_data(v)
for i in soup.select('.infogram-embed'):
print(i['data-title'])
html_data = requests.get(infogram_url + i['data-id']).text
data = re.search(r'window\.infographicData=({.*})', html_data).group(1)
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for d in find_data(data):
print(d)
print('-' * 80)
印刷:
Candidate Tracker 2020_US House_Proportions
[[['', 'Districts Already Filed'], ['2020', '435']]]
[[['', '2016', '2018', '2020'], ['Filed', '17.8%', '24.2%', '29.1%']], [['', '2016', '2018', '2020'], ['Filed', '25.1%', '32.5%', '37.9%']], [['', '2016', '2018', '2020'], ['Filed', '11.5%', '13.7%', '21.3%']]]
--------------------------------------------------------------------------------
Candidate Tracker Nominees 2020_US House_Proportions
[[['', 'Possible Major-Party Nominations Decided', 'Possible Major-Party Nominations Left to be Decided'], ['2020', '829', '18']]]
[[['', '', '2018', '2020'], ['Percent of Nominees', '', '28.4%', '35.6%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '42.4%', '48.3%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '13.2%', '22.5%']]]
--------------------------------------------------------------------------------
Candidate Tracker 2020_US Senate_Proportions
[[['', 'States with Senate Contests Already Filed'], ['2020', '34']]]
[[['', '', '2018', '2020'], ['Filed', '', '20.9%', '23.9%']], [['', '', '2018', '2020'], ['Filed', '', '32.6%', '31.1%']], [['', '', '2018', '2020'], ['Filed', '', '14%', '17.4%']]]
--------------------------------------------------------------------------------
Candidate Tracker Nominees 2020_US Senate_Proportions
[[['', 'Decided', 'Left to be Decided'], ['2020', '29', '6']]]
[[['', '', '2018', '2020'], ['Percent of Nominees', '', '32.6%', '31.6%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '42.9%', '39.3%']], [['', '', '2018', '2020'], ['Percent of Nominees', '', '23.5%', '24.1%']]]
--------------------------------------------------------------------------------
推荐阅读
- android - java.io.FileNotFoundException:/jacoco.exec:打开失败:EROFS(只读文件系统)
- openerp-7 - 现场显示在 Pentaho 设计器中,但不在 OpenERP 中
- java - 我可以在项目的资源文件夹中看到图像,但仍然收到无效的 url 错误
- php - 使用 ftp_connect 和 ftp_login 获取错误描述
- pandas - 熊猫:使用日期时间作为条件
- python - 用于替换特定单词和数字模式的正则表达式
- reactjs - 如何从 React 的下拉列表中绑定值?
- laravel - 默认插槽内的组件
- typo3 - 我们如何在 TYPO3 的用户设置中的新选项卡中添加新字段
- python - 用四元数找到向量的端点