首页 > 解决方案 > 如何在熊猫的每一行中使用多个(类似字典的)json对象来分解列?

问题描述

我有一个包含一些列的 csv 文件。感兴趣的列在一行中有多个 json 对象。它看起来像这样:

IN: df=read_csv('filename.tsv',sep='\t')
IN: df

OUT: name RSN model version dt  si2 si3 pi1 wi20    wi28    li1 ci1 ai1 ai2 ai3 ad1 wi19    wi27    wan2    wan1    li3 li2 li5 li4 li7 li6 li9 li8 wi22    wi21    wi24    wi23    wi26    wi25    wi30    wi29    wi14    wi13    wi16    wi15    wi17    wi18
   0    DE1 RSN JCO4032 R2.15   12-03-21 06:53:32:155   14  46  831 5   149 2   0   NaN NaN NaN NaN 0   0   218419  553198  1754335 32208167    18594   28750   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   NaN NaN NaN 
   1    DE1 RSN JCO4032 R2.15   12-03-21 06:54:04:343   14  46  863 5   149 2   0   NaN NaN NaN NaN 0   0   9063    209 99335   1941734 1084    1598    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   NaN NaN NaN 
   2    DE1 RSN JCO4032 R2.15   12-03-21 07:04:07:579   13  46  1469    5   149 2   0   NaN NaN NaN NaN 0   0   152680  18355   1656295 29541773    17201   25804   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   NaN NaN NaN 

IN: df.wi17
OUT: 
    35                                                  NaN
    36                                                  NaN
    37    [{"mac":"2xx01:xxF","rssi":-60,"txrate...
    38    [{"mac":"20:4xx:1F","rssi":-60,"txrate...
    39                                                  NaN
Name: wi17, dtype: object

IN: df.wi17[37]
OUT: '[{"mac":"20:47xx:1F","rssi":-60,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"E8xx:A0","rssi":-57,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]'

我将这字符串NaN转换为使用.json.loads

def parser2(d):
   if d!=d:
      return np.nan
   else:
      return json.loads(d)
df.wi17 = df.wi17.apply(parser2)

我正在寻找一种优雅的解决方案来分解这些字典并根据独特的“mac”对它们进行分组,而该“mac”又由原始df中的唯一“RSN”分组。

它应该看起来像这样:

... RSN         .... mac        rssi  txrate  max_txrate  txbytes  rxbytes   nxn  ...
... RSNFDXXXKDF ... 2A:xxxx:sd   30   34      50          2323     34323     1x1  ...
... RSNFDXXXKDF ... 2A:xxxx:sd   50   84      70          20       2334343   1x1  ...
... RSNFDXXXKDF ... 3B:yyyy:sd   45   48      47          40       2334      2x2  ...
... RSNFDXXXKDF ... Nan         Nan   Nan     Nan         Nan      Nan       Nan  ...
... ADKNCCJXKDF ... AA:yyyy:sd   45   48      47          40       2334      2x2  ...

有什么建议么?

标签: pythonpandas

解决方案


让我们在列的数据上使用df.explode(), df.apply()+ ,然后:pd.Series()wi17pd.concat

df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
                 axis=1).reset_index()

在这里,我们使用df2.apply(lambda x: pd.Series(x.wi17), axis=1)on df2which 已经与字典列表一起分解成单个字典中的每一行df2。将 lambda 函数与 pd.Series 一起使用,将字典及其各自的字典键、值扩展为列索引和列值。

演示运行

测试数据构建

data = {'name': ['DE1', 'DE2', 'DE3'], 'RSN': ['RSNJCO4032', 'RSNJCO4033', 'RSNJCO4034']}
df = pd.DataFrame(data)
df['wi17'] = ['[{"mac":"20:47xx:1F","rssi":-60,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"E8xx:A0","rssi":-57,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]', '[{"mac":"40:17xx:1F","rssi":-62,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"F8xx:B0","rssi":-58,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]', '[{"mac":"60:07xx:1F","rssi":-64,"txrate":72.0,"max_txrate":72.0,"txbytes":0,"rxbytes":0,"nxn":"1x1"},{"mac":"A8xx:C0","rssi":-61,"txrate":72.0,"max_txrate":72.0,"txbytes":1414810891,"rxbytes":808725830,"nxn":"1x1"}]']

import json
def parser2(d):
   if d!=d:
      return np.nan
   else:
      return json.loads(d)
df.wi17 = df.wi17.apply(parser2)

print(df)

  name         RSN                                                                                                                                                                                                                                                wi17
0  DE1  RSNJCO4032  [{'mac': '20:47xx:1F', 'rssi': -60, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'E8xx:A0', 'rssi': -57, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
1  DE2  RSNJCO4033  [{'mac': '40:17xx:1F', 'rssi': -62, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'F8xx:B0', 'rssi': -58, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]
2  DE3  RSNJCO4034  [{'mac': '60:07xx:1F', 'rssi': -64, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 0, 'rxbytes': 0, 'nxn': '1x1'}, {'mac': 'A8xx:C0', 'rssi': -61, 'txrate': 72.0, 'max_txrate': 72.0, 'txbytes': 1414810891, 'rxbytes': 808725830, 'nxn': '1x1'}]

运行新代码

df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
                 axis=1).reset_index()
print(df3)

输出:

  name         RSN         mac  rssi  txrate  max_txrate     txbytes    rxbytes  nxn
0  DE1  RSNJCO4032  20:47xx:1F   -60    72.0        72.0           0          0  1x1
1  DE1  RSNJCO4032     E8xx:A0   -57    72.0        72.0  1414810891  808725830  1x1
2  DE2  RSNJCO4033  40:17xx:1F   -62    72.0        72.0           0          0  1x1
3  DE2  RSNJCO4033     F8xx:B0   -58    72.0        72.0  1414810891  808725830  1x1
4  DE3  RSNJCO4034  60:07xx:1F   -64    72.0        72.0           0          0  1x1
5  DE3  RSNJCO4034     A8xx:C0   -61    72.0        72.0  1414810891  808725830  1x1

编辑

为了获得更好的系统性能(执行时间),您可以尝试将.apply()函数更改list(map(...)为如下:

df2 = df.explode('wi17')
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 pd.DataFrame(list(map(pd.Series, df2['wi17'])), index=df2.index)],
                 axis=1).reset_index()

编辑 2

系统性能(执行时间)进一步微调。基准测试显示,使用将 json 结构扩展为新的 Pandas DataFrame 以合并到原始 DataFrame 中,可以实现快 20 倍以上的执行时间:pd.json_normalize()

df2 = df.explode('wi17')
df2['wi17'] = df2['wi17'].fillna({i: {} for i in df2.index})  # as suggested by @TwerkingPanda to handle NaN entries. 
df3 = pd.concat([df2.drop('wi17', axis=1).reset_index(drop=True), 
                 pd.json_normalize(df2['wi17'])],
                 axis=1).reset_index()

用 30,000 行(每行有 2 个 json,因此总共 60,000 个 json)对系统性能进行基准测试

df1 = pd.concat([df] * 10000, ignore_index=True)
df2 = df1.explode('wi17')

.apply()(1) 使用with的原始版本的基准测试pd.Series()

%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 df2.apply(lambda x: pd.Series(x.wi17), axis=1)],
                 axis=1).reset_index()
21.9 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

list(map(...))(2) 使用withpd.Series()和的修订版基准pd.DataFrame()

%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 pd.DataFrame(list(map(pd.Series, df2['wi17'])), index=df2.index)],
                 axis=1).reset_index()
20.6 s ± 364 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(3) 修订版的基准使用pd.json_normalize()

%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1).reset_index(drop=True), 
                 pd.json_normalize(df2['wi17'])],
                 axis=1).reset_index()
999 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

编辑 3

该版本进一步提高了系统性能(执行时间),同时允许为 OP 保留原始系列索引以确保数据完整性。

由于json对象是一层没有嵌套,我们可以利用更高效的DataFrame构造函数pd.DataFrame()将json字段展开为列,如下:

df3 = pd.concat([df2.drop('wi17', axis=1), 
                 pd.DataFrame(df2['wi17'].to_list(), index=df2.index)],
                 axis=1)

使用的版本基准pd.DataFrame()

%%timeit
df3 = pd.concat([df2.drop('wi17', axis=1), 
                 pd.DataFrame(df2['wi17'].to_list(), index=df2.index)],
                 axis=1)
116 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

此版本比使用的版本进一步提高了8 倍pd.json_normalize()以上,比使用+的版本快了180 倍以上。此外,我们可以灵活地使用 的参数在扩展列上保留原始数据帧的索引。df.apply()pd.Series()index=pd.DataFrame()


推荐阅读