首页 > 解决方案 > 具有multiIndex的数据帧元组索引并减去索引的第三个元素

问题描述

我有一个数据框,例如:

    `           a    b    c
    (0,0,a)   1.0  2.0  3.0
    (0,0,b)   4.0  5.0  6.0
    (0,0,c)   7.0  8.0  9.0
    (0,1,a)  10.0 11.0 12.0
    (0,1,b)  13.0 14.0 15.0
    (0,1,c)  16.0 17.0 18.0
    (1,0,a)  19.0 20.0 21.0
    (1,0,b)  22.0 23.0 24.0
    (1,0,c)  26.0 27.0 28.0`

If 是一个具有 3 级的多索引 df,如元组。现在,我想添加一个包含所有行总和的新列,并减去列名的元素 = 索引元组的第三个元素,例如:

    `           a    b    c    new
    (0,0,a)   1.0  2.0  3.0    5.0
    (0,0,b)   4.0  5.0  6.0   10.0
    (0,0,c)   7.0  8.0  9.0   15.0
    (0,1,a)  10.0 11.0 12.0   23.0
    (0,1,b)  13.0 14.0 15.0   28.0
    (0,1,c)  16.0 17.0 18.0   33.0
    (1,0,a)  19.0 20.0 21.0   41.0
    (1,0,b)  22.0 23.0 24.0   46.0
    (1,0,c)  26.0 27.0 28.0   53.0`

我有与单个索引相同的 df 并且它适用于:

    ` df['new'] = df.apply(lambda row: sum(row[1:]) - row[row['index'][2]],1)`

但现在我需要更改一些列,我需要传递给多索引。我能做些什么?更改为单一索引?如何?或者我可以在我的df上保留multiIndex?

谢谢

标签: python-3.xpandasdataframe

解决方案


使用每行并减去索引中的元组的第三个sum值提取的值 by :DataFrame.lookupstr[2]

print(df.columns) 
MultiIndex([('a',),
            ('b',),
            ('c',)],
           )

#convert one level DataFrame to simple Index
df.columns = df.columns.get_level_values(0)
print(df.columns) 
Index(['a', 'b', 'c'], dtype='object')

df['new'] = df.sum(axis=1) - df.lookup(df.index, df.index.str[2])
print (df)
              a     b     c   new
(0, 0, a)   1.0   2.0   3.0   5.0
(0, 0, b)   4.0   5.0   6.0  10.0
(0, 0, c)   7.0   8.0   9.0  15.0
(0, 1, a)  10.0  11.0  12.0  23.0
(0, 1, b)  13.0  14.0  15.0  28.0
(0, 1, c)  16.0  17.0  18.0  33.0
(1, 0, a)  19.0  20.0  21.0  41.0
(1, 0, b)  22.0  23.0  24.0  46.0
(1, 0, c)  26.0  27.0  28.0  53.0

编辑:另一个可能的问题是第三个元组的某些值不匹配:

print(df) 
              a     b     c
(0, 0, d)   1.0   2.0   3.0 <- d not match
(0, 0, e)   4.0   5.0   6.0 <- e not match
(0, 0, c)   7.0   8.0   9.0
(0, 1, a)  10.0  11.0  12.0
(0, 1, b)  13.0  14.0  15.0
(0, 1, c)  16.0  17.0  18.0
(1, 0, a)  19.0  20.0  21.0
(1, 0, b)  22.0  23.0  24.0
(1, 0, c)  26.0  27.0  28.0

#get values of third level
s = df.index.str[2]
#dict of not matched values 
new = dict.fromkeys(np.setdiff1d(s, df.columns), np.nan)
print (new)
{'d': nan, 'e': nan}

#added new columns used for lookup
df1 = df.assign(**new)
print (df1)
              a     b     c   d   e
(0, 0, d)   1.0   2.0   3.0 NaN NaN
(0, 0, e)   4.0   5.0   6.0 NaN NaN
(0, 0, c)   7.0   8.0   9.0 NaN NaN
(0, 1, a)  10.0  11.0  12.0 NaN NaN
(0, 1, b)  13.0  14.0  15.0 NaN NaN
(0, 1, c)  16.0  17.0  18.0 NaN NaN
(1, 0, a)  19.0  20.0  21.0 NaN NaN
(1, 0, b)  22.0  23.0  24.0 NaN NaN
(1, 0, c)  26.0  27.0  28.0 NaN NaN


#used df1 for sum and lookup
df['new'] = df1.sum(axis=1) - df1.lookup(df1.index, s)
print (df)
              a     b     c   new
(0, 0, d)   1.0   2.0   3.0   NaN
(0, 0, e)   4.0   5.0   6.0   NaN
(0, 0, c)   7.0   8.0   9.0  15.0
(0, 1, a)  10.0  11.0  12.0  23.0
(0, 1, b)  13.0  14.0  15.0  28.0
(0, 1, c)  16.0  17.0  18.0  33.0
(1, 0, a)  19.0  20.0  21.0  41.0
(1, 0, b)  22.0  23.0  24.0  46.0
(1, 0, c)  26.0  27.0  28.0  53.0

推荐阅读