首页 > 解决方案 > Filter dataframe by minimum number of values in groups

问题描述

I have the following dataframe structure:

#----------------------------------------------------------#
# Generate dataframe mock example.

# define categorical column.
grps = pd.DataFrame(['a', 'a', 'a', 'b', 'b', 'b']) 

# generate dataframe 1.
df1 = pd.DataFrame([[3, 4, 6, 8, 10, 4], 
                   [5, 7, 2, 8, 9, 6], 
                   [5, 3, 4, 8, 4, 6]]).transpose()

# introduce nan into dataframe 1.
for col in df1.columns:
    df1.loc[df1.sample(frac=0.1).index, col] = np.nan

# generate dataframe 2.
df2 = pd.DataFrame([[3, 4, 6, 8, 10, 4], 
                   [5, 7, 2, 8, 9, 6], 
                   [5, 3, 4, 8, 4, 6]]).transpose()

# concatenate categorical column and dataframes.
df = pd.concat([grps, df1, df2], axis = 1)

# Assign column headers.
df.columns = ['Groups', 1, 2, 3, 4, 5, 6]

# Set index as group column.
df = df.set_index('Groups')

# Generate stacked dataframe structure.
test_stack_df = df.stack(dropna = False).reset_index() 

# Change column names.
test_stack_df = test_stack_df.rename(columns = {'level_1': 'IDs',
                                                0: 'Values'})

#----------------------------------------------------------#

Original dataframe - 'df' before stacking:

Groups  1   2   3   4   5   6
a       3   5   5   3   5   5
a      nan nan  3   4   7   3
a       6   2  nan  6   2   4
b       8   8   8   8   8   8
b      10   9   4  10   9   4
b       4   6   6   4   6   6

I would like to filter the columns such that there are minimally 3 valid values in each group - 'a' & 'b'. The final output should be only columns 4, 5, 6. I am currently using the following method:

# Function to define boolean series.
def filter_vals(test_stack_df, orig_df):
    # Reset index.
    df_idx_reset = orig_df.reset_index()

    # Generate list with size of each 'Group'.
    grp_num = pd.value_counts(df_idx_reset['Groups']).to_list()

    # Data series for each 'Group'.
    expt_class_1 = test_stack_df.head(grp_num[0])
    expt_class_2 = test_stack_df.tail(grp_num[1])

    # Check if both 'Groups' contain at least 3 values per 'ID'.
    valid_IDs = len(expt_class_1['Values'].value_counts()) >=3 & \
                len(expt_class_2['Values'].value_counts()) >=3

    # Return 'true' or 'false'
    return(valid_IDs)

# Apply function to dataframe to generate boolean series.
bool_series = test_stack_df.groupby('IDs').apply(filter_vals, df)

# Transpose original dataframe.
df_T = df.transpose()

# Filter by boolean series & transpose again.
df_filtered = df_T[bool_series].transpose()

I could achieve this with minimal fuss by applying pandas.dataframe.dropna() method and use a threshold of 6. However, this won't account for different sized groups or allow me to specify the minimum number of values, which the current code does.

For larger dataframes i.e. 4000+ columns, the code is a little slow i.e. takes ~ 20 secs to complete filtering process. I have tried alternate methods that access the original dataframe directly using groupby & transform but can't get anything to work.

Is there a simpler and faster method? Thanks for your time!

EDIT: 03/05/2020 (15:58) - just spotted something that wasn't clear in the function above. Still works but have clarified variable names. Sorry for the confusion!

标签: python-3.xpandasdataframefilteringpandas-groupby

解决方案


这将为您解决问题:

df.notna().groupby(level='Groups').sum(axis=0).ge(3).all(axis=0)

输出:

1    False
2    False
3    False
4     True
5     True
6     True
dtype: bool

推荐阅读