首页 > 解决方案 > 过滤满足某些条件的数据框

问题描述

我有以下df:

df = {'Modality': {('002_S_0413', '1', '6/21/2017', 'DTI'): 1,
  ('002_S_0413', '1', '6/21/2017', 'FLAIR'): 1,
  ('002_S_0413', '1', '6/21/2017', 'T1'): 1,
  ('002_S_0413', '3', '8/27/2019', 'DTI'): 1,
  ('002_S_0413', '3', '8/27/2019', 'FLAIR'): 1,
  ('002_S_0413', '3', '8/27/2019', 'T1'): 1,
  ('002_S_1261', '1', '3/15/2017', 'DTI'): 1,
  ('002_S_1261', '1', '3/15/2017', 'FLAIR'): 1,
  ('002_S_1261', '1', '3/15/2017', 'T1'): 1,
  ('002_S_1261', '2', '4/24/2018', 'DTI'): 1,
  ('002_S_1261', '2', '4/24/2018', 'FLAIR'): 1,
  ('002_S_1261', '2', '4/24/2018', 'T1'): 1,
  ('002_S_1261', '3', '5/01/2019', 'DTI'): 1,
  ('002_S_1261', '3', '5/01/2019', 'FLAIR'): 1,
  ('002_S_1261', '3', '5/01/2019', 'T1'): 1,
  ('002_S_1280', '1', '3/13/2017', 'DTI'): 1,
  ('002_S_1280', '1', '3/13/2017', 'FLAIR'): 1,
  ('002_S_1280', '3', '3/06/2019', 'DTI'): 1,
  ('002_S_4213', '1', '8/14/2017', 'FLAIR'): 1,
  ('002_S_4213', '1', '8/14/2017', 'T1'): 1},
 'Phase': {('002_S_0413', '1', '6/21/2017', 'DTI'): 1,
  ('002_S_0413', '1', '6/21/2017', 'FLAIR'): 1,
  ('002_S_0413', '1', '6/21/2017', 'T1'): 1,
  ('002_S_0413', '3', '8/27/2019', 'DTI'): 1,
  ('002_S_0413', '3', '8/27/2019', 'FLAIR'): 1,
  ('002_S_0413', '3', '8/27/2019', 'T1'): 1,
  ('002_S_1261', '1', '3/15/2017', 'DTI'): 1,
  ('002_S_1261', '1', '3/15/2017', 'FLAIR'): 1,
  ('002_S_1261', '1', '3/15/2017', 'T1'): 1,
  ('002_S_1261', '2', '4/24/2018', 'DTI'): 1,
  ('002_S_1261', '2', '4/24/2018', 'FLAIR'): 1,
  ('002_S_1261', '2', '4/24/2018', 'T1'): 1,
  ('002_S_1261', '3', '5/01/2019', 'DTI'): 1,
  ('002_S_1261', '3', '5/01/2019', 'FLAIR'): 1,
  ('002_S_1261', '3', '5/01/2019', 'T1'): 1,
  ('002_S_1280', '1', '3/13/2017', 'DTI'): 1,
  ('002_S_1280', '1', '3/13/2017', 'FLAIR'): 1,
  ('002_S_1280', '3', '3/06/2019', 'DTI'): 1,
  ('002_S_4213', '1', '8/14/2017', 'FLAIR'): 1,
  ('002_S_4213', '1', '8/14/2017', 'T1'): 1}}

抱歉,我无法显示标题,但它们如下图所示:

在此处输入图像描述

我被困在这一步,非常感谢您的帮助!

我需要一个代码,让每个主题 ID 查看描述列,如果 DTI、T1 和 FLAIR 出现在单个访问中,则进行该访问并删除其余部分,如果它们存在于多个访问中,则使用最小值和删除其余部分。如果单次访问中不存在 DTI、T1 和 FLAIR,则也删除这些行。我需要的是为每个主题 ID 获取具有描述中的三个值(DTI、T1 和 FLAIR)的最小访问值并删除其余的

我的输出看起来像这样:

在此处输入图像描述

谢谢!

标签: pythonpandasdataframe

解决方案


假设需要测试 'Description' 正是 3 个值 'DTI'、'FLAIR'、'T1' 并且仅检查n给定组中是否有值是不够的:

# Remove Description and Visit from MultiIndex
new_df = df.reset_index(['Visit', 'Description'])
# Create Set of Values to Check against
check_values = {'DTI', 'FLAIR', 'T1'}
# Create Boolean Index
m = (
        new_df.groupby(level=[0, 1])['Description'].transform(
            lambda g: set(g) == check_values and len(g) == len(check_values)
        )
        & new_df.groupby(level=0)['Visit'].transform('min').eq(new_df['Visit'])
)

# Filter Dataframe with Index and Fix MultiIndex
new_df = new_df[m].set_index(['Visit', 'Description'], append=True)

面膜的两个方面:

  1. 检查每个组的唯一值是否与预期匹配,并且每个组的长度是否与预期长度匹配
set(g) == check_values and len(g) == len(check_values)
  1. 检查此“访问”是否是每个“主题 ID”的最小值:
new_df.groupby(level=0)['Visit'].transform('min').eq(new_df['Visit'])

输出(new_df):

                                         Modality  Phase
Subject ID Study Date Visit Description                 
002_S_0413 6/21/2017  1     DTI                 1      1
                            FLAIR               1      1
                            T1                  1      1
002_S_1261 3/15/2017  1     DTI                 1      1
                            FLAIR               1      1
                            T1                  1      1

推荐阅读