首页 > 解决方案 > 如何在 R 中的组内汇总和计算非缺失、非零和非唯一值?

问题描述

我有以下数据集:

df1 <- structure(list(group_id = c(3, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3, 3, 3, 3, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 
10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16, 
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 
29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 
29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29), 
    score = c(35, 0, 37.5, 51.9, 43, 41, 36.9, 44.4, 27.5, 41.5, 
    60, 39.4, 39.5, 50, 55, 57.8, 44.7, 60.2, 40.4, 62.5, 61.1, 
    53.9, 67.2, 43.9, 37.6, 58.4, 34.1, 56.4, 41.5, 54.4, 50.3, 
    36.8, 41.4, 37.2, 51.3, 50.7, 75.4, 62.9, NA, 54.5, 53.9, 
    59.5, 24.5, 22.7, 53, 35.8, 28, 39.4, 44.5, NA, NA, 55.9, 
    52.5, 36, 43.5, 42.9, 25.5, 35, 46, NA, 60.2, 65.6, 30.5, 
    37.1, 49.1, 70.4, 34.1, 45.4, 30.8, 38.6, 28.7, 39.8, 38.5, 
    0, 72.6, 0, NA, 54.6, 0, 69.8, 31.6, 55.9, 47.3, 34.3, 0, 
    40.8, 69.7, 61.5, 48.6, 59.3, 0, 67.2, 52, 57, 0, NA, 0, 
    51.7, 47.1, 0)), row.names = c(NA, -100L), groups = structure(list(
    .rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
        10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 
        21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 
        32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 
        43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 
        54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 
        65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 
        76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 84L, 85L, 86L, 
        87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 97L, 
        98L, 99L, 100L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -100L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("rowwise_df", "tbl_df", "tbl", 
"data.frame"))

我需要按组创建一个只有一行信息的新数据集。我需要的信息将被总结为两列。第一列按组显示平均分数。只有非零且非缺失的值才应在平均分数中计算。第二列包含按组划分的非零和非缺失分数的数量。

我不想要一个不同/唯一的计数,而是旧的传统计数:如果两个分数具有相同的非缺失值和非零值,它们仍然必须计算两次。

预期结果是:

df2 <- structure(list(group_id = c(3L, 10L, 16L, 26L, 27L, 29L), score = c(43.04, 
49.56, 44.86, 49.05, 32.28, 54.18), n_individuals = c(14L, 20L, 
21L, 8L, 5L, 17L)), class = "data.frame", row.names = c(NA, -6L
))

我尝试了什么:

library(dplyr)
df2 <- df1 %>%
   mutate(score = case_when( 
      score == 0 ~ NA_real_,                                #assign missing values to zeros
      TRUE ~ score)) %>%                             
   group_by(group_id) %>%                                   #group by group_id
   summarise(score = mean(score, na.rm = TRUE),             #mean score
                    n_individuals = count(score))           #n of individuals with valid score

我得到什么:

Error: Problem with `summarise()` input `n_inviduals`. x no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')" i Input `n_inviduals` is `count(score)`. i The error occured in group 1: group_id = 3.

标签: rdplyr

解决方案


count输入将是atibbledata.frame。在这里,我们可以使用n()- 如果我们想要总行数或者如果我们想要 'score' 中的非 NA 元素的数量,创建一个逻辑向量并使用TRUE -> 1 和 FALSE -> 0is.na获取计数sum,所以sum是得到 1 的计数

library(dplyr)
df1 %>%
  ungroup %>%
  mutate(score = case_when( 
  score == 0 ~ NA_real_,                                #assign missing values to zeros
  TRUE ~ score)) %>% 
  group_by(group_id) %>% 
  summarise(n_individuals = sum(!is.na(score) & score != 0),
            score = mean(score, na.rm = TRUE) )

-输出

# A tibble: 6 x 3
#  group_id n_individuals score
#*    <dbl>         <int> <dbl>
#1        3            14  43.0
#2       10            20  49.6
#3       16            21  44.9
#4       26             8  49.0
#5       27             5  35.3
#6       29            17  54.2

推荐阅读