python - 如何在熊猫的每一行中使用多个(类似字典的)json对象来分解列?
问题描述
我有一个包含一些列的 csv 文件。感兴趣的列在一行中有多个 json 对象。它看起来像这样:
IN: df=read_csv('filename.tsv',sep='\t')
IN: df
OUT: name RSN model version dt si2 si3 pi1 wi20 wi28 li1 ci1 ai1 ai2 ai3 ad1 wi19 wi27 wan2 wan1 li3 li2 li5 li4 li7 li6 li9 li8 wi22 wi21 wi24 wi23 wi26 wi25 wi30 wi29 wi14 wi13 wi16 wi15 wi17 wi18
0 DE1 RSN JCO4032 R2.15 12-03-21 06:53:32:155 14 46 831 5 149 2 0 NaN NaN NaN NaN 0 0 218419 553198 1754335 32208167 18594 28750 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN
1 DE1 RSN JCO4032 R2.15 12-03-21 06:54:04:343 14 46 863 5 149 2 0 NaN NaN NaN NaN 0 0 9063 209 99335 1941734 1084 1598 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN
2 DE1 RSN JCO4032 R2.15 12-03-21 07:04:07:579 13 46 1469 5 149 2 0 NaN NaN NaN NaN 0 0 152680 18355 1656295 29541773 17201 25804 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN
IN: df.wi17
OUT:
35 NaN
36 NaN
37 [{"mac":"2xx01:xxF","rssi":-60,"txrate...
38 [{"mac":"20:4xx:1F","rssi":-60,"txrate...
39 NaN
Name: wi17, dtype: object
IN: df.wi17[37]
OUT: '[{"mac":"20:47xx:1F","rssi":-60,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"E8xx:A0","rssi":-57,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]'
我将这列字符串和NaN转换为使用.json.loads
def parser2(d):
if d!=d:
return np.nan
else:
return json.loads(d)
df.wi17 = df.wi17.apply(parser2)
我正在寻找一种优雅的解决方案来分解这些字典并根据独特的“mac”对它们进行分组,而该“mac”又由原始df中的唯一“RSN”分组。
它应该看起来像这样:
... RSN .... mac rssi txrate max_txrate txbytes rxbytes nxn ...
... RSNFDXXXKDF ... 2A:xxxx:sd 30 34 50 2323 34323 1x1 ...
... RSNFDXXXKDF ... 2A:xxxx:sd 50 84 70 20 2334343 1x1 ...
... RSNFDXXXKDF ... 3B:yyyy:sd 45 48 47 40 2334 2x2 ...
... RSNFDXXXKDF ... Nan Nan Nan Nan Nan Nan Nan ...
... ADKNCCJXKDF ... AA:yyyy:sd 45 48 47 40 2334 2x2 ...
有什么建议么?
解决方案
让我们在列的数据上使用df.explode()
, df.apply()
+ ,然后:pd.Series()
wi17
pd.concat
df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1),
df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
axis=1).reset_index()
在这里,我们使用df2.apply(lambda x: pd.Series(x.wi17), axis=1)
on df2
which 已经与字典列表一起分解成单个字典中的每一行df2
。将 lambda 函数与 pd.Series 一起使用,将字典及其各自的字典键、值扩展为列索引和列值。
演示运行
测试数据构建
data = {'name': ['DE1', 'DE2', 'DE3'], 'RSN': ['RSNJCO4032', 'RSNJCO4033', 'RSNJCO4034']}
df = pd.DataFrame(data)
df['wi17'] = ['[{"mac":"20:47xx:1F","rssi":-60,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"E8xx:A0","rssi":-57,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]', '[{"mac":"40:17xx:1F","rssi":-62,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"F8xx:B0","rssi":-58,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]', '[{"mac":"60:07xx:1F","rssi":-64,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"A8xx:C0","rssi":-61,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]']
import json
def parser2(d):
if d!=d:
return np.nan
else:
return json.loads(d)
df.wi17 = df.wi17.apply(parser2)
print(df)
name RSN wi17
0 DE1 RSNJCO4032 [{'mac': '20:47xx:1F', 'rssi': -60, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'E8xx:A0', 'rssi': -57, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
1 DE2 RSNJCO4033 [{'mac': '40:17xx:1F', 'rssi': -62, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'F8xx:B0', 'rssi': -58, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
2 DE3 RSNJCO4034 [{'mac': '60:07xx:1F', 'rssi': -64, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'A8xx:C0', 'rssi': -61, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
运行新代码
df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1),
df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
axis=1).reset_index()
print(df3)
输出:
name RSN mac rssi txrate max_txrate txbytes rxbytes nxn
0 DE1 RSNJCO4032 20:47xx:1F -60 72.0 72.0 0 0 1x1
1 DE1 RSNJCO4032 E8xx:A0 -57 72.0 72.0 1414810891 808725830 1x1
2 DE2 RSNJCO4033 40:17xx:1F -62 72.0 72.0 0 0 1x1
3 DE2 RSNJCO4033 F8xx:B0 -58 72.0 72.0 1414810891 808725830 1x1
4 DE3 RSNJCO4034 60:07xx:1F -64 72.0 72.0 0 0 1x1
5 DE3 RSNJCO4034 A8xx:C0 -61 72.0 72.0 1414810891 808725830 1x1
编辑
为了获得更好的系统性能(执行时间),您可以尝试将.apply()
函数更改list(map(...)
为如下:
df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1),
pd.DataFrame(list(map(pd.Series, df2['wi17'])), index=df2.index)],
axis=1).reset_index()
编辑 2
系统性能(执行时间)进一步微调。基准测试显示,使用将 json 结构扩展为新的 Pandas DataFrame 以合并到原始 DataFrame 中,可以实现快 20 倍以上的执行时间:pd.json_normalize()
df2 = df.explode('wi17')
df2['wi17'] = df2['wi17'].fillna({i: {} for i in df2.index}) # as suggested by @TwerkingPanda to handle NaN entries.
df3 = pd.concat([df2.drop('wi17', axis=1).reset_index(drop=True),
pd.json_normalize(df2['wi17'])],
axis=1).reset_index()
用 30,000 行(每行有 2 个 json,因此总共 60,000 个 json)对系统性能进行基准测试
df1 = pd.concat([df] * 10000, ignore_index=True)
df2 = df1.explode('wi17')
.apply()
(1) 使用with的原始版本的基准测试pd.Series()
%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1),
df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
axis=1).reset_index()
21.9 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
list(map(...))
(2) 使用withpd.Series()
和的修订版基准pd.DataFrame()
:
%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1),
pd.DataFrame(list(map(pd.Series, df2['wi17'])), index=df2.index)],
axis=1).reset_index()
20.6 s ± 364 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(3) 修订版的基准使用pd.json_normalize()
:
%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1).reset_index(drop=True),
pd.json_normalize(df2['wi17'])],
axis=1).reset_index()
999 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
编辑 3
该版本进一步提高了系统性能(执行时间),同时允许为 OP 保留原始系列索引以确保数据完整性。
由于json对象是一层没有嵌套,我们可以利用更高效的DataFrame构造函数pd.DataFrame()
将json字段展开为列,如下:
df3 = pd.concat([df2.drop('wi17', axis=1),
pd.DataFrame(df2['wi17'].to_list(), index=df2.index)],
axis=1)
使用的版本基准pd.DataFrame()
:
%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1),
pd.DataFrame(df2['wi17'].to_list(), index=df2.index)],
axis=1)
116 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
此版本比使用的版本进一步提高了8 倍pd.json_normalize()
以上,比使用+的版本快了180 倍以上。此外,我们可以灵活地使用 的参数在扩展列上保留原始数据帧的索引。df.apply()
pd.Series()
index=
pd.DataFrame()
推荐阅读
- json - 如何从 api 响应动态端点获取值?
- excel - 我有一个带有多个选项卡的 excel 工作簿,并且想使用 vba 选择一个选项卡,该选项卡的日期每周都会更改
- reactjs - 是否可以在我的另一个 React/NextJS 站点(例如 mydomain.com/blog)上部署我的 Wordpress 博客?
- javascript - ECMAScript 6 - 遍历数组以进行 API POST 调用
- django - 如何根据某些条件更改 Django admin 中某些类似字段的外观
- python - 在 MongoDB 中使用聚合管道存储前 N 个结果
- docker - dockerfile中完成包后如何删除第一个图像?
- unity3d - 适用于 Unity 的 GoogleARCore SDK - 锚点
- c - 词法分析器(Flex)在空格后跟标记(以及更多)后抛出词法错误
- android - Here Maps Android SDK位置指示器旋转