首页 > 解决方案 > Pandas - 使用 pandas.Series.all 过滤 DataFrame 的最有效方法是什么

问题描述

考虑下面的代码 -

import pandas as pd
data = []
val = 0
for ind_1 in range(1000):
    for ind_2 in range(1000):
        data.append({'ind_1': ind_1, 'ind_2': ind_2,
                     'val': val})
        val += 1
df_mi = pd.DataFrame(data).set_index(['ind_1', 'ind_2'])

df_mi使用 MultiIndex- 创建 DataFrame-

In [90]: df_mi                                                                                       
Out[90]: 
                val
ind_1 ind_2        
0     0           0
      1           1
      2           2
      3           3
      4           4
...             ...
999   995    999995
      996    999996
      997    999997
      998    999998
      999    999999

[1000000 rows x 1 columns]

现在我想通过对每个值的所有值应用一些条件来过滤行ind_1-

In [116]: bool_filter_ind_1 = (df_mi['val'] < 999997).all(level='ind_1')                             

In [117]: bool_filter_ind_1                                                                          
Out[117]: 
ind_1
0       True
1       True
2       True
3       True
4       True
       ...  
995     True
996     True
997     True
998     True
999    False
Name: val, Length: 1000, dtype: bool

In [118]: ind_1_filtered = bool_filter_ind_1.index[bool_filter_ind_1]                                

In [119]: ind_1_filtered                                                                             
Out[119]: 
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            989, 990, 991, 992, 993, 994, 995, 996, 997, 998],
           dtype='int64', name='ind_1', length=999)

结果是正确的但df_mi.loc[ind_1_filtered]相对较慢-

In [120]: timeit df_mi_filtered = df_mi.loc[ind_1_filtered]                                          
4.73 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [121]: df_mi_filtered                                                                             
Out[121]: 
                val
ind_1 ind_2        
0     0           0
      1           1
      2           2
      3           3
      4           4
...             ...
998   995    998995
      996    998996
      997    998997
      998    998998
      999    998999

[999000 rows x 1 columns]

是否有更快的方法来执行相同的过滤?

标签: pandasindexing

解决方案


您可以使用:

第一个想法是invert掩码df_mi['val'] >= 999997)并获取ind_1不太像阈值的所有索引,并通过Index.isin掩码过滤第一级的原始索引并过滤boolean indexing

def new(df_mi):
    lvl0 = df_mi.index.get_level_values(0)
    return df_mi[~lvl0.isin(lvl0[(df_mi['val'] >= 999997)].unique())]

In [240]: %timeit (new(df_mi))
51.5 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

另一个想法是使用GroupBy.transformandGroupBy.all作为掩码并再次过滤boolean indexing

In [241]: %timeit df_mi[(df_mi['val'] < 999997).groupby(level='ind_1').transform('all')]
97.3 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

原始解决方案:

def orig(df_mi):
    bool_filter_ind_1 = (df_mi['val'] < 999997).all(level='ind_1')  
    ind_1_filtered = bool_filter_ind_1.index[bool_filter_ind_1]
    return df_mi.loc[ind_1_filtered]

In [242]: %timeit orig(df_mi)
11.2 s ± 405 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

推荐阅读