python - Pandas json_normalize 将值作为列的字典展平
问题描述
这是我试图展平的数据示例:
location = [{'id': 225,
'country': 'US',
'country_code': 'US',
'country_population': 327167434,
'province': '',
'last_updated': '2020-05-06T11:33:46.184263Z',
'coordinates': {'latitude': '37.0902', 'longitude': '-95.7129'},
'latest': {'confirmed': 1204351, 'deaths': 71064, 'recovered': 0},
'timelines': {'confirmed': {
'latest': 1204351,
'timeline': {
'2020-01-22T00:00:00Z': 1,
'2020-01-23T00:00:00Z': 1,
'2020-01-24T00:00:00Z': 2}
},
'deaths': {
'latest': 71064,
'timeline': {
'2020-01-22T00:00:00Z': 0,
'2020-01-23T00:00:00Z': 0,
'2020-01-24T00:00:00Z': 0}
}
}
}]
我无法以可用的格式获得确认和死亡的时间表。我要么得到每个日期的列:
pd.json_normalize(
data = location,
record_path=['timelines','confirmed','timeline'])
或者有日期但没有计数值的行:
pd.json_normalize(data = location[0]['timelines']['confirmed'])
所需的输出类似于:
提前致谢; 非常感谢任何帮助。
解决方案
使用@Trenton Mckinney 的数据更进一步,我们可以在 pandas 之外进行所有处理,并将成品放入数据框中:
第 1 步:创建一个函数来处理数据:
def extract_data(location,keyword):
collection = []
for ent in location:
#get the timeline data for either 'confirmed' or 'deaths'
#determined by the keyword
b = ent.get('timelines').get(keyword).get('timeline')
#switch out from dicts to a list of key value pairs
items = list(b.items())
#for each value in the items list,
#append the country and population
for val in items:
val = val + (ent.get('country'), ent.get('country_population'))
#collect each result into a collection list
collection.append(val)
#create a dataframe
cols = ['date', 'count', 'country', 'country_population']
res = pd.DataFrame(collection, columns = cols)
res = res.set_index(['country','country_population'])
#conditionals depending on the keyword
if keyword == "confirmed":
res = res.add_prefix('confirmed.timeline.')
elif keyword == "deaths":
res = res.add_prefix('deaths.timeline.')
return res
第 2 步:按关键字应用功能 - “已确认”或“死亡”
confirmed = extract_data(location,'confirmed')
deaths = extract_data(location,'deaths')
第 3 步:在列轴上连接数据框:
pd.concat([confirmed,deaths],axis=1)
confirmed.timeline.date confirmed.timeline.count deaths.timeline.date deaths.timeline.count
country country_population
US 327167434 2020-01-22T00:00:00Z 1 2020-01-22T00:00:00Z 0
327167434 2020-01-23T00:00:00Z 1 2020-01-23T00:00:00Z 0
327167434 2020-01-24T00:00:00Z 2 2020-01-24T00:00:00Z 0
AF 327167435 2020-02-22T00:00:00Z 2 2020-02-22T00:00:00Z 1
327167435 2020-02-23T00:00:00Z 2 2020-02-23T00:00:00Z 1
327167435 2020-02-24T00:00:00Z 3 2020-02-24T00:00:00Z 1
AS 327167436 2020-03-22T00:00:00Z 3 2020-03-22T00:00:00Z 2
327167436 2020-03-23T00:00:00Z 3 2020-03-23T00:00:00Z 2
327167436 2020-03-24T00:00:00Z 4 2020-03-24T00:00:00Z 2
更新:喜欢@DanilaGanchar 的反馈并决定重写代码。所有的计算都发生在 Pandas 之外,直到最后一刻:速度降低:853us
d = []
for entry in location:
#pull dictionary for country and population
country_population = {'country':entry.get('country'),
'country_population':entry.get('country_population')}
#pull data for the confirmed and death timelines
confirmed_timeline = entry.get('timelines').get('confirmed').get('timeline').items()
death_timeline = entry.get('timelines').get('deaths').get('timeline').items()
#set keys for the deaths and confirmed dicts
deaths = ('deaths.timeline.date', 'deaths.timeline.count')
confirmed = ('confirmed.timeline.date', 'confirmed.timeline.count')
#attach keys to the values in deaths and confirmed
#this will become the column names in the dataframe
confirmz = (dict(zip(confirmed,conf)) for conf in confirmed_timeline)
deathz = (dict(zip(deaths,death)) for death in death_timeline)
#aggregate the data into one dict
#and lump the dicts into one list
for content, cont in zip(confirmz, deathz):
content.update(country_population)
content.update(cont)
d.append(content)
pd.DataFrame(d)
confirmed.timeline.date confirmed.timeline.count country country_population deaths.timeline.date deaths.timeline.count
0 2020-01-22T00:00:00Z 1 US 327167434 2020-01-22T00:00:00Z 0
1 2020-01-23T00:00:00Z 1 US 327167434 2020-01-23T00:00:00Z 0
2 2020-01-24T00:00:00Z 2 US 327167434 2020-01-24T00:00:00Z 0
3 2020-02-22T00:00:00Z 2 AF 327167435 2020-02-22T00:00:00Z 1
4 2020-02-23T00:00:00Z 2 AF 327167435 2020-02-23T00:00:00Z 1
5 2020-02-24T00:00:00Z 3 AF 327167435 2020-02-24T00:00:00Z 1
6 2020-03-22T00:00:00Z 3 AS 327167436 2020-03-22T00:00:00Z 2
7 2020-03-23T00:00:00Z 3 AS 327167436 2020-03-23T00:00:00Z 2
8 2020-03-24T00:00:00Z 4 AS 327167436 2020-03-24T00:00:00Z 2
推荐阅读
- python - 哪种结构更 Pythonic?
- bixby - 当 AccessVivContext 不是目标时如何访问 vivContext?
- c - 在 scanf、fgets 和 printf 之后 getchar 没有得到输入
- php - 在 PHP 中修改 SVG 和子属性
- python - 在 Keras 中手动分配 Dropout 层
- python - 如何根据另一个表 [R 或 Python] 重新编码表中的多个列?
- spring-boot - 如何修复服务器中 Web 应用程序的多个部署
- angular - 以角度动态加载 twitter-feed
- c - pthread_cond_wait 是否锁定互斥锁和虚假唤醒
- javascript - Gatsby.js:如何将所有 *.js 文件迁移到 *ts?