首页 > 解决方案 > 在删除 NaN 值的同时跨行合并 DataFrame

问题描述

我有这个数据框

实现的数据框

我通过编写这段代码实现了这一点

df = pd.DataFrame(columns = ['Step Number' , 'CAN_Send' , 'CAN_Values'])
can = [{'ta1': ('atpcinfolamp_co', '3')}, {'ta2': ('xyz_signal', '4')}, {'ta2': ('abc_signal', '5')}]
keys = []
for can_signals in can:
    for key,value in can_signals.items():
        if key not in keys:
            keys.append(key)
            df = df.append({'Step Number' : key} , ignore_index = True)
            df = df.append({'CAN_Send' : value[0]} , ignore_index = True)
            df = df.append({'CAN_Values' : value[1]} , ignore_index = True)
        else:
            df = df.append({'CAN_Send' : value[0]} , ignore_index = True)
            df = df.append({'CAN_Values' : value[1]} , ignore_index = True)
df

我需要一个看起来像这样的数据框 所需的输出数据帧

我无法破解如何在同时删除 NaN 的同时跨列合并。

我尝试了类似的东西

df = df.groupby('Step Number')[['CAN_Send' , 'CAN_Values']]

但这不起作用,因为没有数值操作可以将 groupby 对象转换为帧,因为我有字符串值,并且任何删除 NaN 的方法最终都会清除我的整个数据帧。

非常感谢这方面的任何帮助!

提前致谢!

标签: pythonpandasdataframedata-sciencedata-manipulation

解决方案


Step Number您可以通过 填充first的缺失值.ffill()。然后是剩下的 2 列,如下:groupby() Step Numberaggregate()dropna()

df['Step Number'] = df['Step Number'].ffill()

df_out = (df.groupby('Step Number', as_index=False)
            .agg(lambda x: x.dropna(how='all'))
            .apply(pd.Series.explode)
         )

结果:

print(df_out)

  Step Number         CAN_Send CAN_Values
0         ta1  atpcinfolamp_co          3
1         ta2       xyz_signal          4
1         ta2       abc_signal          5

编辑

对于您的新数据集,您可以使用以下代码。它也适用于以前的数据集,并且通常适用于您的程序逻辑创建的结构。

df['Step Number'] = df['Step Number'].ffill()
df['CAN_Send'] = df['CAN_Send'].ffill(limit=1)
df['CAN_Values'] = df['CAN_Values'].bfill(limit=1)
df = df.dropna().drop_duplicates()

演示

数据准备:

您的代码经过微调,可以使您的逻辑正常工作。否则,如果您有一个键出现不止一次,但在此键之间出现了其他键(例如键以序列ta1, ta2,出现ta1),您现有的逻辑将无法Step Number为该键添加新行(例如最后一个ta1)列表中已经存在keys

df = pd.DataFrame(columns = ['Step Number' , 'CAN_Send' , 'CAN_Values'])
#can = [{'ta1': ('atpcinfolamp_co', '3')}, {'ta2': ('xyz_signal', '4')}, {'ta2': ('abc_signal', '5')}]
can = [{'ta1': ('atpcinfolamp_co', '3')}, {'ta1': ('hdcinfolamp_co', '5')}, {'ta2': ('xyz_signal', '4')}, {'ta2': ('abc_signal', '5')}] 
#keys = []
last_key = ''
for can_signals in can:
    for key,value in can_signals.items():
        if key != last_key:
#            keys.append(key)
            last_key = key
            df = df.append({'Step Number' : key} , ignore_index = True)
#            df = df.append({'CAN_Send' : value[0]} , ignore_index = True)
#            df = df.append({'CAN_Values' : value[1]} , ignore_index = True)
#        else:
#            df = df.append({'CAN_Send' : value[0]} , ignore_index = True)
#            df = df.append({'CAN_Values' : value[1]} , ignore_index = True)
        df = df.append({'CAN_Send' : value[0]} , ignore_index = True)
        df = df.append({'CAN_Values' : value[1]} , ignore_index = True)

df

  Step Number         CAN_Send CAN_Values
0         ta1              NaN        NaN
1         NaN  atpcinfolamp_co        NaN
2         NaN              NaN          3
3         NaN   hdcinfolamp_co        NaN
4         NaN              NaN          5
5         ta2              NaN        NaN
6         NaN       xyz_signal        NaN
7         NaN              NaN          4
8         NaN       abc_signal        NaN
9         NaN              NaN          5

运行新代码:

df['Step Number'] = df['Step Number'].ffill()
df['CAN_Send'] = df['CAN_Send'].ffill(limit=1)
df['CAN_Values'] = df['CAN_Values'].bfill(limit=1)
df = df.dropna().drop_duplicates()

结果:

print(df)

  Step Number         CAN_Send CAN_Values
1         ta1  atpcinfolamp_co          3
3         ta1   hdcinfolamp_co          5
6         ta2       xyz_signal          4
8         ta2       abc_signal          5

编辑 2

实际上,对于源数据的结构can,您可以通过更简单的方式直接到达所需的数据帧,如下所示:

can = [{'ta1': ('atpcinfolamp_co', '3')}, {'ta1': ('hdcinfolamp_co', '5')}, {'ta2': ('xyz_signal', '4')}, {'ta2': ('abc_signal', '5')}] 

data = {'Step Number': [list(x.keys())[0] for x in can], 'CAN_Send': [list(x.values())[0][0] for x in can], 'CAN_Values': [list(x.values())[0][1] for x in can]}
df = pd.DataFrame(data)

推荐阅读