首页 > 解决方案 > 如何在不同的行中使用不同的条件进行过滤?

问题描述

数据结构:

library(tidyverse)

df <- tribble(
  ~"group", ~"word",
  1,"apple",
  1,"orange",
  1,"apple cider",
  1,"orange juice",
  1,"pear",
  1,"pear",
  2,"apple",
  2,"pear",
  3,"orange juice",
  3,"apple",
  4,"pear",
  4,"guava"
  )

我想用 str_detect 过滤“word”列上的“apple”和“orange”。如果“组”同时包含单词“apple”和“orange”,则仅返回观察结果。

期望的输出:

# A tibble: 6 x 2
  group word        
  <dbl> <chr>       
1     1 apple       
2     1 orange      
3     1 apple cider 
4     1 orange juice
5     3 orange juice
6     3 apple       

非常感谢!

标签: rdplyrtidyverse

解决方案


带有str_extract和的选项n_distinct

library(dplyr)
library(stringr)
df %>% 
    group_by(group) %>% 
    filter((n_distinct(unlist(str_extract_all(word, "apple|orange"))) >1) &
          str_detect(word, 'apple|orange'))
# A tibble: 6 x 2
# Groups:   group [2]
#  group word        
#  <dbl> <chr>       
#1     1 apple       
#2     1 orange      
#3     1 apple cider 
#4     1 orange juice
#5     3 orange juice
#6     3 apple       

解释

按'group'分组后,我们提取所有'apple'或'orange'的'word' str_extract_all(默认输出为a list),unlistthe list,并计算不同元素的数量(n_distinct),检查是否大于 1 作为一个条件,该条件与另一个检查“word”列是否包含“apple”或“orange”(str_detect)的条件相结合。基本上,它只会让那些同时拥有这两种情况的组并删除过程中的任何其他元素,即如果我们只使用第一个表达式

df %>% 
     group_by(group) %>% 
     filter((n_distinct(unlist(str_extract_all(word, "apple|orange"))) >1))
# A tibble: 8 x 2
# Groups:   group [2]
#  group word        
#  <dbl> <chr>       
#1     1 apple       
#2     1 orange      
#3     1 apple cider 
#4     1 orange juice
#5     1 pear      # // not needed, but it was kept 
#6     1 pear      # // because it is checking on distinct element  
#7     3 orange juice
#8     3 apple     

仅用第二个表达式

df %>% 
     group_by(group) %>% filter(str_detect(word, 'apple|orange'))
# A tibble: 7 x 2
# Groups:   group [3]
#  group word        
#  <dbl> <chr>       
#1     1 apple       
#2     1 orange      
#3     1 apple cider 
#4     1 orange juice
#5     2 apple    # // also keeps group 2 that includes only apple    
#6     3 orange juice
#7     3 apple      

通过执行&,将删除组 2 以及 'word' 列中的 'pear' 等元素


推荐阅读