python-3.x - How to Extract Data from Graph from a web Page?
问题描述
I am Trying to scrape graph data from the webpage: 'https://cawp.rutgers.edu/women-percentage-2020-candidates'
I tried bellow code to extract data from Graph:
import requests
from bs4 import BeautifulSoup
Res = requests.get('https://cawp.rutgers.edu/women-percentage-2020-candidates').text
soup = BeautifulSoup(Res, "html.parser")
Values= [i.text for i in soup.findAll('g', {'class': 'igc-graph'}) if i]
Dates = [i.text for i in soup.findAll('g', {'class': 'igc-legend-entry'}) if i]
print(Values, Dates) ## both list are empty
Data= pd.DataFrame({'Value':Values,'Date':Dates}) ## Returning an Empty Dataframe
I want to extract Date and Value from all the 4 bar Graphs. Please anyone suggest what i have to do here to extract the graph data, or is there any other method that i can try to extract the data. thanks;
解决方案
This graph was located on this url : https://e.infogram.com/5bb50948-04b2-4113-82e6-5e5f06236538
You can find the infogram id (path of target url) directly on the original url if you look for div with class infogram-embed
which has the value of attribute data-id
:
<div class="infogram-embed" data-id="5bb50948-04b2-4113-82e6-5e5f06236538" data-title="Candidate Tracker 2020_US House_Proportions" data-type="interactive"> </div>
From this url, it loads a static JSON in javascript. You can use regex to extract it and parse the JSON structure to get row/column, and the different tables:
import requests
from bs4 import BeautifulSoup
import re
import json
original_url = "https://cawp.rutgers.edu/women-percentage-2020-candidates"
r = requests.get(original_url)
soup = BeautifulSoup(r.text, "html.parser")
infogram_url = f'https://e.infogram.com/{soup.find("div",{"class":"infogram-embed"})["data-id"]}'
r = requests.get(infogram_url)
soup = BeautifulSoup(r.text, "html.parser")
script = [
t
for t in soup.findAll("script")
if "window.infographicData" in t.text
][0].text
extract = re.search(r".*window\.infographicData=(.*);$", script)
data = json.loads(extract.group(1))
entities = data["elements"]["content"]["content"]["entities"]
tables = [
(entities[key]["props"]["chartData"]["sheetnames"], entities[key]["props"]["chartData"]["data"])
for key in entities.keys()
if ("props" in entities[key]) and ("chartData" in entities[key]["props"])
]
data = []
for t in tables:
for i, sheet in enumerate(t[0]):
data.append({
"sheetName": sheet,
"table": dict([(t[1][i][0][j],t[1][i][1][j]) for j in range(len(t[1][i][0])) ])
})
print(data)
Output:
[{'sheetName': 'Sheet 1',
'table': {'': '2020', 'Districts Already Filed': '435'}},
{'sheetName': 'All',
'table': {'': 'Filed', '2016': '17.8%', '2018': '24.2%', '2020': '29.1%'}},
{'sheetName': 'Democrats Only',
'table': {'': 'Filed', '2016': '25.1%', '2018': '32.5%', '2020': '37.9%'}},
{'sheetName': 'Republicans Only',
'table': {'': 'Filed', '2016': '11.5%', '2018': '13.7%', '2020': '21.3%'}}]
推荐阅读
- c++ - 如何使用“using”关键字来定义函数原型/签名
- python - 为什么python给了我额外的浮点数?
- treeview - 如何使用 Angular Tree 组件以编程方式将子节点添加到父节点的子节点
- javascript - 在云函数中的 exports.favoriteTrigger.functions.firestore.document.onCreate 中最喜欢的数据不可迭代
- python-3.x - 使用 Serverless-offline 插件时,任何 API 路由均无响应
- java - 金融应用程序开发中的 NodeJS 与 Java
- node.js - 如何使用配置在单个 package.json 行中运行多个命令
- asterisk - SIP 状态(响铃、忙碌、正在使用等)
- encryption - PKWARE 强加密算法
- python - 子字符串 SQL 查询变量