首页 > 解决方案 > 传入变量名时,循环遍历 group_by

问题描述

我正在尝试编写一个 R 函数,该函数将根据列值提取社区中正数的比例。更具体地说,我有一个数据集,其中每一行都是一个人。为简化起见,第 1-5 列包含有关其个人特征的信息,第 6 列包含邮政编码,第 7 列包含他们报告阳性时拨打的电话号码,第 8 列包含星期几,第 9 列包含状态。目标是计算邮政编码、电话号码、星期几和州的聚合级别的阳性比例和数量。对于任何一个类别,我成功地使用了来自https://edwinth.github.io/blog/dplyr-recipes/的代码建立一个小组和总结功能(下)。输入数据框和列名,它将按该列上的不同值进行分组,并总结阳性的计数和比例。

group_and_summarize <- function(x, ...) {
  grouping = rlang::quos(...)
  temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(positive, na.rm = TRUE), number = n()) 
  temp = temp %>% filter(!is.na(!!!grouping))
  colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
  colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
  return(temp)
}

问题是,当我尝试跨多个列进行聚合时,该代码完全失败。我目前有四个要分组的字段,但是一旦完全收集了数据,我预计会有大约 15 列。我在这里的策略是将它们中的每一个存储为列表的单独元素以供以后使用。我试着用

output = vector(mode = "list", length = length(aggregate_cols)) #aggregate_cols lists columns needing count and proportion.
    #aggregate_cols = c("ZIP_CODE", "PHONE_NUMBER", "DAY", "STATE")
for(i in 1:length(aggregate_cols)){
output[i] = group_and_summarize(df,aggregate_cols[i])
          }

但收到以下错误消息

Warning messages:
1: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length
2: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length
3: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length
4: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length

测试第一个值

> i=1
> group_and_summarize(df,aggregate_cols[i])
# A tibble: 1 x 3
  `aggregate_cols[i]`  proportion number
  <chr>                 <dbl>  <int>
1 ZIP_CODE              0.168   5600

任何想法如何解决这个问题?我想不出涉及 map 或 apply 系列函数的好方法,尽管我愿意接受这些。

编辑:

可重现的代码如下。

group_and_summarize_demo <- function(x, ...) {
  grouping = quos(...)
  temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(am, na.rm = TRUE), number = n()) 
  temp = temp %>% filter(!is.na(!!!grouping))
  colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
  colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
  return(temp)
}

cars_cols = c("gear", "cyl")
output = vector(mode = "list", length = length(cars_cols))
for(i in 1:length(cars_cols)){
  output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
}


> group_and_summarize_demo(mtcars, cyl)
# A tibble: 3 x 3
    cyl cyl_proportion cyl_count
  <dbl>          <dbl>     <int>
1     4          0.727        11
2     6          0.429         7
3     8          0.143        14
> cars_cols = c("gear", "cyl")
> output = vector(mode = "list", length = length(cars_cols))
> for(i in 1:length(cars_cols)){
+   output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
+ }
 Show Traceback
 
 Rerun with Debug
 Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "function" 
> cars_cols[1]
[1] "gear"
> group_and_summarize_demo(mtcars, cars_cols[1])
# A tibble: 1 x 3
  `cars_cols[1]` `cars_cols[1]_proportion` `cars_cols[1]_count`
  <chr>                              <dbl>                <int>
1 gear                               0.406                   32

我不明白为什么这与运行 group_and_summarize_demo(mtcars,cyl); 不同 我怀疑理解这将解决这个错误。

标签: rdplyrlapply

解决方案


在循环之外,您将名称直接传递给函数:

group_and_summarize_demo(mtcars, cyl)

但是,在您的循环中,您将名称作为字符串传递:

group_and_summarize_demo(mtcars, "cyl") #error

实际上,在此设置中使用字符串更容易。为了使它工作,你不应该使用quos()but syms()

group_and_summarize_demo <- function(x, ..., quosure=TRUE) {
  if(quosure)
    grouping = quos(...)
  else
    grouping = syms(...)
  temp = x %>% 
    group_by(!!!grouping) %>% 
    summarise(proportion = mean(am, na.rm = TRUE), number = n()) 
  temp = temp %>% filter(!is.na(!!!grouping))
  colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
  colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
  return(temp)
}

group_and_summarize_demo(mtcars, cyl)
group_and_summarize_demo(mtcars, "cyl", quosure=F)

显然,在您的最终代码中,您应该选择其中之一并坚持下去。

编辑:

如果你一次只传递一个变量,使用省略号看起来有点矫枉过正,让事情变得复杂。此外,您的示例似乎不适用于多个变量 ( group_and_summarize_demo(mtcars, cyl, vs))。您可能需要考虑以下几个改进:

library(tidyverse)

group_and_summarize_demo <- function(x, gp_col) {
  gp_col = sym(gp_col)
  temp = x %>% 
    group_by(!!gp_col) %>% 
    summarise("{{gp_col}}_proportion" := mean(am, na.rm = TRUE), 
              "{{gp_col}}_count" := n()) %>% 
    filter(!is.na(!!gp_col))
  temp
}

c("gear", "cyl") %>%  
  map(~group_and_summarize_demo(mtcars, .x)) #try map_dfc() also
#> [[1]]
#> # A tibble: 3 x 3
#>    gear gear_proportion gear_count
#>   <dbl>           <dbl>      <int>
#> 1     3           0             15
#> 2     4           0.667         12
#> 3     5           1              5
#> 
#> [[2]]
#> # A tibble: 3 x 3
#>     cyl cyl_proportion cyl_count
#>   <dbl>          <dbl>     <int>
#> 1     4          0.727        11
#> 2     6          0.429         7
#> 3     8          0.143        14

reprex 包于 2021-04-27 创建 (v2.0.0 )

在这里,我使用了使用运算符的模板功能。我还使用了 for 循环来代替,其中记录了迭代。dplyr::summarise():=purrr::map().x


推荐阅读