首页 > 解决方案 > 按应用程序和用户 ID 分组后检索特定文本的所有行

问题描述

当用户以数字方式完成一个步骤时,列is_digitally_signed更改为YES。我正在尝试做的事情:如果任何步骤以数字方式完成,我想检索相同application_id和的所有行user_id。请在我想要的输出下方检查。

复制我的数据集的 R 代码

df <- data.table(application_id = c(1,1,1,2,2,2,3,3,3), 
                 user_id = c(123,123,123,456,456,456,789,789,789), 
                 application_status = c("incomplete", "details_verified", "complete"),
                 date = c("01/01/2018", "02/01/2018", "03/01/2018"),
                 is_digitally_signed = c("NULL", "NULL", "YES", "NULL", "NULL", "NULL", "NULL", "YES", "NULL")) %>%
  mutate(date = as.Date(date, "%d/%m/%Y"))

带输出

df
  application_id user_id application_status       date is_digitally_signed
              1     123         incomplete  2018-01-01                NULL
              1     123   details_verified  2018-01-02                NULL
              1     123           complete  2018-01-03                 YES
              2     456         incomplete  2018-01-01                NULL
              2     456   details_verified  2018-01-02                NULL
              2     456           complete  2018-01-03                NULL
              3     789         incomplete  2018-01-01                NULL
              3     789   details_verified  2018-01-02                 YES
              3     789           complete  2018-01-03                NULL

我的(不成功的)努力

df %>% group_by(application_id,user_id) %>% filter_all(all.vars(. == "YES"))

期望的结果

application_id user_id application_status       date is_digitally_signed
              1     123         incomplete 2018-01-01                NULL
              1     123   details_verified 2018-01-02                NULL
              1     123           complete 2018-01-03                 YES
              3     789         incomplete 2018-01-01                NULL
              3     789   details_verified 2018-01-02                 YES
              3     789           complete 2018-01-03                NULL

标签: rdplyrdata-manipulation

解决方案


dplyr

我们可以使用filterwith any,它检查给定组是否至少有一条记录 with is_digitally_signed == 'YES'

library(dplyr)

df %>% 
  group_by(application_id, user_id) %>%
  filter(any(is_digitally_signed == "YES"))

或使用该all函数对不是全部的组进行子集is_digitally_signed == "NULL"

df %>% 
  group_by(application_id, user_id) %>%
  filter(!all(is_digitally_signed == "NULL"))

数据表

我们也可以使用data.table,因为您已经将数据加载为 DT:

library(data.table)
dt = setDT(df)
dt[dt[,.I[any(is_digitally_signed == "YES")], by=.(application_id, user_id)]$V1,]

或与.SD

dt[,.SD[any(is_digitally_signed == "YES")], by=.(application_id, user_id)]

输出:

# A tibble: 6 x 5
# Groups:   application_id, user_id [2]
  application_id user_id application_status date       is_digitally_signed
           <dbl>   <dbl> <fct>              <date>     <fct>              
1              1     123 incomplete         2018-01-01 NULL               
2              1     123 details_verified   2018-01-02 NULL               
3              1     123 complete           2018-01-03 YES                
4              3     789 incomplete         2018-01-01 NULL               
5              3     789 details_verified   2018-01-02 YES                
6              3     789 complete           2018-01-03 NULL

推荐阅读