首页 > 解决方案 > 组合不同的列

问题描述

我有一个像这样的数据框:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
                'vote':[5,4,5,1,10,1,9],
                'doggo': [None,"doggo",None,None,"doggo",None,None], 
                'floofer': ["floofer",None,None,"floofer",None,None,None],
                'pupper': [None,None,"pupper",None,None,None,None],
               'puppo':[None,None,None,None,None,None,"puppo"]})

我想合并最后 4 列并生成:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
                    'vote':[5,4,5,1,10,1,9],
                    'categories': ["floofer","doggo","pupper","floofer","doggo",None,"puppo"]})

任何指导表示赞赏。

标签: pythonpandasdataframe

解决方案


如果每一行每个分类列只有一个非None值,则解决方案:

cols = ['doggo','floofer','pupper','puppo']
cols1 = df.columns.difference(cols)
df2 = df[cols1].join(df[cols].ffill(axis=1).iloc[:, -1].rename('Categories'))
print (df2)
   id  vote Categories
0   1     5    floofer
1   2     4      doggo
2   3     5     pupper
3   4     1    floofer
4   5    10      doggo
5   6     1       None
6   7     9      puppo

说明

首先仅选择具有分类数据和前向填充缺失值的列 - 预期数据在最后一列:

print (df[cols].ffill(axis=1))
  doggo  floofer   pupper    puppo
0   None  floofer  floofer  floofer
1  doggo    doggo    doggo    doggo
2   None     None   pupper   pupper
3   None  floofer  floofer  floofer
4  doggo    doggo    doggo    doggo
5   None     None     None     None
6   None     None     None    puppo

按位置选择最后一列:

print (df[cols].ffill(axis=1).iloc[:, -1])
0    floofer
1      doggo
2     pupper
3    floofer
4      doggo
5       None
6      puppo
Name: puppo, dtype: object

如果多个值的解决方案 - 数据是从分类列的列名创建的:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
                'vote':[5,4,5,1,10,1,9],
                'doggo': [None,"doggo1",None,"doggo2","doggo3",None,None], 
                'floofer': ["floofer1",None,None,"floofer2",None,None,None],
                'pupper': [None,None,"pupper1",None,None,None,None],
               'puppo':["puppo1",None,None,None,None,None,"puppo2"]})
print (df)
   id  vote   doggo   floofer   pupper   puppo
0   1     5    None  floofer1     None  puppo1
1   2     4  doggo1      None     None    None
2   3     5    None      None  pupper1    None
3   4     1  doggo2  floofer2     None    None
4   5    10  doggo3      None     None    None
5   6     1    None      None     None    None
6   7     9    None      None     None  puppo2


s = (df[cols].notnull()
            .dot(pd.Index(cols) + ', ')
            .str.strip(', ')
            .rename('Categories')
            .replace('', np.nan)
            )
df = df[cols1].join(s)
print (df)
   id  vote      Categories
0   1     5  floofer, puppo
1   2     4           doggo
2   3     5          pupper
3   4     1  doggo, floofer
4   5    10           doggo
5   6     1             NaN
6   7     9           puppo

另一种解决方案,预期的输出不是来自列名:

s = pd.Series(df[cols].add(', ').fillna('').values.sum(axis=1), 
                  index=df.index, name='Categories').str.strip(', ')
df = df[cols1].join(s)
print (df)
   id  vote        Categories
0   1     5  floofer1, puppo1
1   2     4            doggo1
2   3     5           pupper1
3   4     1  doggo2, floofer2
4   5    10            doggo3
5   6     1                  
6   7     9            puppo2

推荐阅读