python-3.x - Filter dataframe by minimum number of values in groups
问题描述
I have the following dataframe structure:
#----------------------------------------------------------#
# Generate dataframe mock example.
# define categorical column.
grps = pd.DataFrame(['a', 'a', 'a', 'b', 'b', 'b'])
# generate dataframe 1.
df1 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# introduce nan into dataframe 1.
for col in df1.columns:
df1.loc[df1.sample(frac=0.1).index, col] = np.nan
# generate dataframe 2.
df2 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# concatenate categorical column and dataframes.
df = pd.concat([grps, df1, df2], axis = 1)
# Assign column headers.
df.columns = ['Groups', 1, 2, 3, 4, 5, 6]
# Set index as group column.
df = df.set_index('Groups')
# Generate stacked dataframe structure.
test_stack_df = df.stack(dropna = False).reset_index()
# Change column names.
test_stack_df = test_stack_df.rename(columns = {'level_1': 'IDs',
0: 'Values'})
#----------------------------------------------------------#
Original dataframe - 'df' before stacking:
Groups 1 2 3 4 5 6
a 3 5 5 3 5 5
a nan nan 3 4 7 3
a 6 2 nan 6 2 4
b 8 8 8 8 8 8
b 10 9 4 10 9 4
b 4 6 6 4 6 6
I would like to filter the columns such that there are minimally 3 valid values in each group - 'a' & 'b'. The final output should be only columns 4, 5, 6. I am currently using the following method:
# Function to define boolean series.
def filter_vals(test_stack_df, orig_df):
# Reset index.
df_idx_reset = orig_df.reset_index()
# Generate list with size of each 'Group'.
grp_num = pd.value_counts(df_idx_reset['Groups']).to_list()
# Data series for each 'Group'.
expt_class_1 = test_stack_df.head(grp_num[0])
expt_class_2 = test_stack_df.tail(grp_num[1])
# Check if both 'Groups' contain at least 3 values per 'ID'.
valid_IDs = len(expt_class_1['Values'].value_counts()) >=3 & \
len(expt_class_2['Values'].value_counts()) >=3
# Return 'true' or 'false'
return(valid_IDs)
# Apply function to dataframe to generate boolean series.
bool_series = test_stack_df.groupby('IDs').apply(filter_vals, df)
# Transpose original dataframe.
df_T = df.transpose()
# Filter by boolean series & transpose again.
df_filtered = df_T[bool_series].transpose()
I could achieve this with minimal fuss by applying pandas.dataframe.dropna()
method and use a threshold of 6. However, this won't account for different sized groups or allow me to specify the minimum number of values, which the current code does.
For larger dataframes i.e. 4000+ columns, the code is a little slow i.e. takes ~ 20 secs to complete filtering process. I have tried alternate methods that access the original dataframe directly using groupby
& transform
but can't get anything to work.
Is there a simpler and faster method? Thanks for your time!
EDIT: 03/05/2020 (15:58) - just spotted something that wasn't clear in the function above. Still works but have clarified variable names. Sorry for the confusion!
解决方案
这将为您解决问题:
df.notna().groupby(level='Groups').sum(axis=0).ge(3).all(axis=0)
输出:
1 False
2 False
3 False
4 True
5 True
6 True
dtype: bool
推荐阅读
- laravel - 别名的命名空间在哪里定义?
- javascript - 延迟后如何显示模态?
- android - kotlin android中带有注释的抽象val
- laravel - 当我在 laravel 中发送邮件时出现错误
- python-3.x - 如何配置jenkinsfile以使用声纳分析python代码
- python - AP调度程序。关闭具有任何非内存存储的调度程序的最佳做法是什么?
- sql - 如何在与其他两个表的 1:N 关系中表示同一个表(具有不同的值)?
- arrays - 来自数组字符串的数组
- jquery - jsPDF页面拆分在页面大小超过后会破坏内容
- docusignapi - DocuSign 签名请求中的支持文档