json - 将 EIA Json 转换为 DataFrame - Python 3.6
问题描述
我试图将 Json 文件从http://api.eia.gov/bulk/INTL.zip转换为数据框。下面是我的代码
import os, sys,json
import pandas as pd
sourcePath = r"D:\Learn\EIA\INTL.txt"
DF = pd.read_json(sourcePath, lines=True)
DF2 = DF[['series_id', 'name', 'units', 'geography', 'f', 'data']] # Need only these columns
DF2 = DF2.dropna(subset=['data']) # Delete if blank/NA
DF2[['Date', 'Value']] = pd.DataFrame([item for item in DF2.data]) # DF2.data contains list, converting to Data Frame
错误:-
回溯(最后一次调用):文件“D:\python\pyCharm\EIA\EIAINTL2018May.py”,第 11 行,在 DF2[['Date', 'Value']] = pd.DataFrame([item for item in DF2.data]) 文件“C:\Python36\lib\site-packages\pandas\core\frame.py”,第 2326 行,在setitem self._setitem_array(key, value) 文件“C:\Python36\lib\site -packages\pandas\core\frame.py", line 2350, in _setitem_array raise ValueError('Columns must be same length as key') ValueError: Columns must be the same length as key
我卡住了,请帮忙。
我需要如下结果: DF.data 列中的列表中存在的日期和值
DF2[['Date', 'Value']] = pd.DataFrame([item for item in DF2.data]).iloc[:,0:2] # This not working
jezrael 解决方案后的新代码更改:
import os, sys, ast
import pandas as pd
sourcePath = r"C:\sunil_plus\dataset\EIAINTL2018May\8_updation2018Aug2\source\INTL.txt"
DF = pd.read_json(sourcePath, lines=True)
DF2 = DF[['series_id', 'name', 'units', 'geography', 'f', 'data']] # Need only these columns
DF2 = DF2.dropna(subset=['data'])
DF2['Date'] = [[x[0] for x in item] for item in DF2.data]
DF2['Values'] = [[x[1] for x in item] for item in DF2.data]
DF_All = pd.DataFrame(); DF4 = pd.DataFrame()
for series_id in DF2['series_id']:
DF3 = DF2.loc[DF2['series_id'] == series_id]
DF4['DateF'] = [item for item in DF3.Date] # Here I need to convert List values to Rows
DF4['ValuesF'] = [item for item in DF3.Values] # Here I need to convert List values to Rows
# Above code not working as expected
DF3 = DF3[['series_id', 'name', 'units', 'geography', 'f']] # Need only these columns
DF5 = pd.concat([DF3, DF4], axis=1).ffill() # Concat to get DateF & ValuesF Values
DF_All = DF_All.append(DF5)
解决方案
您可以使用 2list comprehension
来匹配嵌套列表的第一个和第二个值:
sourcePath = r"D:\Learn\EIA\INTL.txt"
DF = pd.read_json(sourcePath, lines=True)
DF2 = DF[['series_id', 'name', 'units', 'geography', 'f', 'data']] # Need only these columns
DF2 = DF2.dropna(subset=['data'])
DF2['Date'] = [[x[0] for x in item] for item in DF2.data]
DF2['Values'] = [[x[1] for x in item] for item in DF2.data]
print (DF2.head())
series_id name \
0 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
1 INTL.51-8-SRB-MMTCD.A CO2 Emissions from the Consumption of Natural ...
2 INTL.51-8-SSD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
3 INTL.51-8-SUN-MMTCD.A CO2 Emissions from the Consumption of Natural ...
4 INTL.51-8-SVK-MMTCD.A CO2 Emissions from the Consumption of Natural ...
units geography f \
0 Million Metric Tons MKD A
1 Million Metric Tons SRB A
2 Million Metric Tons SSD A
3 Million Metric Tons SUN A
4 Million Metric Tons SVK A
data \
0 [[2015, 0.1], [2014, (s)], [2013, (s)], [2012,...
1 [[2015, 4.1], [2014, 3.5], [2013, 4.2], [2012,...
2 [[2011, --], [2010, --], [2006, --], [2003, --...
3 [[2006, --], [2003, --], [2002, --], [2001, --...
4 [[2015, 9.1], [2014, 8.8], [2013, 11], [2012, ...
Date \
0 [2015, 2014, 2013, 2012, 2011, 2010, 2009, 200...
1 [2015, 2014, 2013, 2012, 2011, 2010, 2009, 200...
2 [2011, 2010, 2006, 2003, 2002, 2001, 2000, 199...
3 [2006, 2003, 2002, 2001, 2000, 1999, 1998, 199...
4 [2015, 2014, 2013, 2012, 2011, 2010, 2009, 200...
Values
0 [0.1, (s), (s), 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, ...
1 [4.1, 3.5, 4.2, 5.2, 4.4, 4.1, 3.2, 4.2, 4.1, ...
2 [--, --, --, --, --, --, --, --, --, --, --, -...
3 [--, --, --, --, --, --, --, --, --, --, --, -...
4 [9.1, 8.8, 11, 10, 11, 12, 10, 12, 12, 13, 14,...
编辑:您可以重复行并创建新的 2 列:
sourcePath = 'INTL.txt'
DF = pd.read_json(sourcePath, lines=True)
cols = ['series_id', 'name', 'units', 'geography', 'f', 'data']
DF2 = DF[cols].dropna(subset=['data'])
DF3 = DF2.join(pd.DataFrame(DF2.pop('data').values.tolist())
.stack()
.reset_index(level=1, drop=True)
.rename('data')
).reset_index(drop=True)
DF3[['Date', 'Value']] = pd.DataFrame(DF3['data'].values.tolist())
#if want remove original data column
#DF3[['Date', 'Value']] = pd.DataFrame(DF3.pop('data').values.tolist())
print (DF3.head())
series_id name \
0 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
1 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
2 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
3 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
4 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
units geography f data Date Value
0 Million Metric Tons MKD A [2015, 0.1] 2015 0.1
1 Million Metric Tons MKD A [2014, (s)] 2014 (s)
2 Million Metric Tons MKD A [2013, (s)] 2013 (s)
3 Million Metric Tons MKD A [2012, 0.2] 2012 0.2
4 Million Metric Tons MKD A [2011, 0.2] 2011 0.2
推荐阅读
- javascript - 打开多个不同的 iframe
- reporting-services - 在 SSRS 的圆环图中添加文本
- rabbitmq - RabbitMQ - 发布到队列或交换
- azure - 使用一个 Deploy 管理多个 Pod
- jquery - 在 jquery-mobile 中使用 ajax 数据填充页面并不总是有效
- java - Android OkHttp3 java.io.IOException: stream closed
- php - 引导联系表不发送电子邮件(FormTools 也没有)
- arrays - Powershell -notin 数组不会触发 if 语句
- amazon-redshift - Amazon Redshift 中的分号和单引号
- ios - How to record an audio stream for save it in file / swift 4.2