首页 > 解决方案 > 在 r 中按特定列分组的不同行中过滤/搜索

问题描述

我有一个类似于下面的 repex 的数据集,其中每个主题都有不止一行用于他们的爱好、最喜欢的食物和他们的学习专业。

例如,我试图找出那些以徒步旅行为爱好、以肉为食物的人。(满足这一标准的就是下例中的主题 c)。

有没有办法在 dplyr 或其他包中做到这一点?


dd = structure(list(ID = c("a", "a", "a", "a", "b", "b", "b", "b", 
                      "b", "b", "c", "c", "c", "c", "c", "c"), itemType = c("hobby", 
                                                                            "hobby", "study", "food", "hobby", "hobby", "study", "study", 
                                                                            "food", "food", "hobby", "hobby", "study", "study", "study", 
                                                                            "food"), details = c("hiking, bike", "reading", "math, art", 
                                                                                                 "cheese, bread", "writing", "computer", "english", "science", 
                                                                                                 "meat, rice", "cheese", "reading", "swimming, hiking", "math, philosophy", 
                                                                                                 "computer", "social", "pasta, meat")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                            -16L))


如果我只是尝试如下简单的 dplyr 过滤器,它当然不会工作,它不会返回任何项目。是否有另一个论点或我可以添加的东西使它起作用?

我从未使用过数据库包,但它在这种情况下有用吗?

dd %>% 
  filter( str_detect( details, "hiking") &
            str_detect(details, "meat"))

标签: rsearchfilterdplyr

解决方案


如果我们需要在 'details' 中对同时具有 'hiking' 、 'meat' 的 'ID' 进行子集化,请做一个group_by'ID' 然后str_detect对 'hiking'、'meat' 应用any) 并使用&,

library(dplyr)
library(stringr)
dd %>%
  group_by(ID) %>%
  filter(any(str_detect(details, 'hiking')), any(str_detect(details, 'meat')))

-输出

# A tibble: 6 x 3
# Groups:   ID [1]
#  ID    itemType details         
#  <chr> <chr>    <chr>           
#1 c     hobby    reading         
#2 c     hobby    swimming, hiking
#3 c     study    math, philosophy
#4 c     study    computer        
#5 c     study    social          
#6 c     food     pasta, meat     

更新

如果我们想进一步基于子组进行检测,一个选项是对列进行子集化==str_detect仅应用这些元素

dd %>% 
     group_by(ID) %>%
     filter(any(str_detect(details[itemType == 'hobby'], 'hiking')),
            any(str_detect(details[itemType == 'food'], 'meat')))
# A tibble: 6 x 3
# Groups:   ID [1]
#  ID    itemType details         
#  <chr> <chr>    <chr>           
#1 c     hobby    reading         
#2 c     hobby    swimming, hiking
#3 c     study    math, philosophy
#4 c     study    computer        
#5 c     study    social          
#6 c     food     pasta, meat     
 

base Rave和一起使用grepl

subset(dd, as.logical(ave(details, ID, 
  FUN = function(x) any(grepl('hiking', x)) & any(grepl('meat', x)))))

它没有返回任何行的原因是因为“详细信息”中的任何元素都没有“远足”和“肉类”,因为&正在进行元素比较。相反,我们需要为每个“ID”使用“详细信息”中的&元素any


推荐阅读