首页 > 解决方案 > R计数多列部分字符串匹配的总和

问题描述

我正在处理一个不整洁的夏令营登记表。表单输出如下:

          leaders         teen_adventure
1 camp, overnight                   <NA>
2            <NA>                   <NA>
3 camp, overnight camp, float, overnight

我想生成新列来汇总每个可能答案的总数。

          leaders         teen_adventure camps overnights floats
1 camp, overnight                   <NA>     1          1      0
2            <NA>                   <NA>     0          0      0
3 camp, overnight camp, float, overnight     2          2      1

我从骨子里觉得这有一个 dplyr 解决方案,例如:

reprex %>%
  mutate(camps = sum(case_when(
    str_detect(select(., everything()), "camp") ~ 1,
    TRUE ~ 0
  )))

或者也许使用cross()。

这是样本数据集:

# data
reprex <- structure(list(leaders = c("camp, overnight", NA, "camp, overnight"), 
          teen_adventure = c(NA, NA, "camp, float, overnight")), 
          row.names = c(NA, -3L), class = "data.frame")

标签: rdplyrstringr

解决方案


我们可以通过str_extract_all遍历列(mapmtabulatelistsummarisesum

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(qdapTools)
library(data.table)
reprex %>% 
   map_dfr(~ str_extract_all(.x, "\\w+") %>%
             mtabulate, .id = 'grp') %>%
   group_by(grp = rowid(grp)) %>% 
   summarise(across(everything(), sum, na.rm = TRUE), 
       .groups = 'drop') %>%
   select(-grp) %>% 
   bind_cols(reprex, .)

-输出

#            leaders         teen_adventure camp overnight float
#1 camp, overnight                   <NA>    1         1     0
#2            <NA>                   <NA>    0         0     0
#3 camp, overnight camp, float, overnight    2         2     1

推荐阅读