首页 > 解决方案 > python pandas用最大值填充NaN或毯子

问题描述

我有一个大数据框的问题。这是一个小片段。我想用最大值填充最后一列 E,如果有一些值或让它为空。那就是数据:

d = {'A': [4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074,
   4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074, 4000074],
 'B': ['SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746','SP000796746',
   'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 
   'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746', 'SP000796746'],
 'C': [201926, 201926, 201926, 201926, 201926, 201926, 201909,201909, 201909, 201909, 201909, 
   201909, 201933, 201933, 201933, 201933, 201933, 201933],
 'D': [-1, 0, 1, 2, 3, 4, -1, 0, 1, 2, 3, 4, -1, 0, 1, 2, 3, 4], 
 'E': [np.nan, 1000, 1000, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 3000, 3000, np.nan]}

它看起来像这样:

       A            B       C     D       E
0   4000074  SP000796746  201926 -1     NaN
1   4000074  SP000796746  201926  0  1000.0
2   4000074  SP000796746  201926  1  1000.0
3   4000074  SP000796746  201926  2     NaN
4   4000074  SP000796746  201926  3     NaN
5   4000074  SP000796746  201926  4     NaN
6   4000074  SP000796746  201909 -1     NaN
7   4000074  SP000796746  201909  0     NaN
8   4000074  SP000796746  201909  1     NaN
9   4000074  SP000796746  201909  2     NaN
10  4000074  SP000796746  201909  3     NaN
11  4000074  SP000796746  201909  4     NaN
12  4000074  SP000796746  201933 -1     NaN
13  4000074  SP000796746  201933  0     NaN
14  4000074  SP000796746  201933  1     NaN
15  4000074  SP000796746  201933  2  3000.0
16  4000074  SP000796746  201933  3  3000.0
17  4000074  SP000796746  201933  4     NaN

但我的目标是用最大值填充列“E”,如果在 -1 到 4 范围内(D 列)之间有任何值。如果没有,它应该保持为空。所以它应该看起来像:

      A            B       C      D       E
0   4000074  SP000796746  201926 -1       0
1   4000074  SP000796746  201926  0  1000.0
2   4000074  SP000796746  201926  1  1000.0
3   4000074  SP000796746  201926  2       0
4   4000074  SP000796746  201926  3       0
5   4000074  SP000796746  201926  4       0
6   4000074  SP000796746  201909 -1     NaN
7   4000074  SP000796746  201909  0     NaN
8   4000074  SP000796746  201909  1     NaN
9   4000074  SP000796746  201909  2     NaN
10  4000074  SP000796746  201909  3     NaN  
11  4000074  SP000796746  201909  4     NaN
12  4000074  SP000796746  201933 -1  3000.0
13  4000074  SP000796746  201933  0  3000.0
14  4000074  SP000796746  201933  1  3000.0
15  4000074  SP000796746  201933  2  3000.0
16  4000074  SP000796746  201933  3  3000.0
17  4000074  SP000796746  201933  4  3000.0

我的代码如下所示:

df = d

indx = df[df['D'] == -1].index.values

for i, j in zip(indx[:-1], indx[1:]):
        df.loc[i:j-1, 'E'] = df.loc[i:j-1, 'E'].max()

if j == indx[-1]:
            df.loc[j:, 'E'] = df.loc[j:, 'E'].max()

它不适用于非常大的数据帧......也许有人对我的代码有另一个代码或更正的想法。

谢谢!!

      A            B       C      D       E
0   4000074  SP000796746  201926 -1       0
1   4000074  SP000796746  201926  0  1000.0
2   4000074  SP000796746  201926  1  1000.0
3   4000074  SP000796746  201926  2       0
4   4000074  SP000796746  201926  3       0
5   4000074  SP000796746  201926  4       0
6   4000074  SP000796746  201909 -1     NaN
7   4000074  SP000796746  201909  0     NaN
8   4000074  SP000796746  201909  1     NaN
9   4000074  SP000796746  201909  2     NaN
10  4000074  SP000796746  201909  3     NaN  
11  4000074  SP000796746  201909  4     NaN
12  4000074  SP000796746  201933 -1       0
13  4000074  SP000796746  201933  0       0
14  4000074  SP000796746  201933  1       0
15  4000074  SP000796746  201933  2  3000.0
16  4000074  SP000796746  201933  3  3000.0
17  4000074  SP000796746  201933  4       0

标签: pythonpandasdataframe

解决方案


您可以使用D 列中的新 -1groupby.transform和. 然后是原始列。maxcumsumfillna

df['E'] = df['E'].fillna(df['E'].groupby(df['D'].eq(-1).cumsum()).transform('max'))

编辑:用零填充,你可以这样做:

mask = df['E'].groupby(df['D'].eq(-1).cumsum()).transform('any')
df.loc[mask, 'E'] = df.loc[mask, 'E'].fillna(0)

推荐阅读