r - 如何在 R 中的组内汇总和计算非缺失、非零和非唯一值?
问题描述
我有以下数据集:
df1 <- structure(list(group_id = c(3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16,
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 16, 16, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27,
29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29),
score = c(35, 0, 37.5, 51.9, 43, 41, 36.9, 44.4, 27.5, 41.5,
60, 39.4, 39.5, 50, 55, 57.8, 44.7, 60.2, 40.4, 62.5, 61.1,
53.9, 67.2, 43.9, 37.6, 58.4, 34.1, 56.4, 41.5, 54.4, 50.3,
36.8, 41.4, 37.2, 51.3, 50.7, 75.4, 62.9, NA, 54.5, 53.9,
59.5, 24.5, 22.7, 53, 35.8, 28, 39.4, 44.5, NA, NA, 55.9,
52.5, 36, 43.5, 42.9, 25.5, 35, 46, NA, 60.2, 65.6, 30.5,
37.1, 49.1, 70.4, 34.1, 45.4, 30.8, 38.6, 28.7, 39.8, 38.5,
0, 72.6, 0, NA, 54.6, 0, 69.8, 31.6, 55.9, 47.3, 34.3, 0,
40.8, 69.7, 61.5, 48.6, 59.3, 0, 67.2, 52, 57, 0, NA, 0,
51.7, 47.1, 0)), row.names = c(NA, -100L), groups = structure(list(
.rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L,
21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L,
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L,
43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L,
54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L,
65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L,
76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 84L, 85L, 86L,
87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 97L,
98L, 99L, 100L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -100L), class = c("tbl_df",
"tbl", "data.frame")), class = c("rowwise_df", "tbl_df", "tbl",
"data.frame"))
我需要按组创建一个只有一行信息的新数据集。我需要的信息将被总结为两列。第一列按组显示平均分数。只有非零且非缺失的值才应在平均分数中计算。第二列包含按组划分的非零和非缺失分数的数量。
我不想要一个不同/唯一的计数,而是旧的传统计数:如果两个分数具有相同的非缺失值和非零值,它们仍然必须计算两次。
预期结果是:
df2 <- structure(list(group_id = c(3L, 10L, 16L, 26L, 27L, 29L), score = c(43.04,
49.56, 44.86, 49.05, 32.28, 54.18), n_individuals = c(14L, 20L,
21L, 8L, 5L, 17L)), class = "data.frame", row.names = c(NA, -6L
))
我尝试了什么:
library(dplyr)
df2 <- df1 %>%
mutate(score = case_when(
score == 0 ~ NA_real_, #assign missing values to zeros
TRUE ~ score)) %>%
group_by(group_id) %>% #group by group_id
summarise(score = mean(score, na.rm = TRUE), #mean score
n_individuals = count(score)) #n of individuals with valid score
我得到什么:
Error: Problem with `summarise()` input `n_inviduals`. x no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')" i Input `n_inviduals` is `count(score)`. i The error occured in group 1: group_id = 3.
解决方案
count
输入将是atibble
或data.frame
。在这里,我们可以使用n()
- 如果我们想要总行数或者如果我们想要 'score' 中的非 NA 元素的数量,创建一个逻辑向量并使用TRUE -> 1 和 FALSE -> 0is.na
获取计数sum
,所以sum
是得到 1 的计数
library(dplyr)
df1 %>%
ungroup %>%
mutate(score = case_when(
score == 0 ~ NA_real_, #assign missing values to zeros
TRUE ~ score)) %>%
group_by(group_id) %>%
summarise(n_individuals = sum(!is.na(score) & score != 0),
score = mean(score, na.rm = TRUE) )
-输出
# A tibble: 6 x 3
# group_id n_individuals score
#* <dbl> <int> <dbl>
#1 3 14 43.0
#2 10 20 49.6
#3 16 21 44.9
#4 26 8 49.0
#5 27 5 35.3
#6 29 17 54.2
推荐阅读
- jersey - 使用 Jersey 和 HK2 的 bindFactory 没有合适的方法
- python - 无法安装网状conda环境错误
- python - 我可以在带有超链接的文件中写一个 url 吗?
- powerbi - 基于度量的 power bi 中的条件格式行
- angular - ngx-translate-multi-http-loader :以下翻译文件出现问题
- mysql - MySQL 查询以收集比所选 ID 多 X 秒的唯一 ID
- javascript - 在 Java GraalVM 中访问复杂的 Javascript 对象
- reactjs - 如何使用成帧器运动对每个状态变化进行动画处理
- java - 打印从大到小和从小到大的数字列表
- nginx - 在nginx中查找错误页面的路径?