首页 > 解决方案 > 所有可能的变量组合的总结

问题描述

我有一个数据框,它是一个国家/地区的个人级别数据。在上述数据框中,我有关于居住县或市、性别、年龄、种族和癌症状况的信息。我想将数据聚合到一个按县排序并按年龄(按类别)、性别和种族分层的新数据框中。也就是说,创建由这些多个变量的组合定义的子组。原始数据的结构类似于下面的虚构数据。

    structure(list(Person_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40), County_ID = c(1, 
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 
4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6), Age = c(39, 
21, 65, 87, 19, 16, 48, 52, 31, 19, 24, 44, 38, 
39, 40, 27, 69, 71, 52, 53, 80, 23, 
21, 29, 38, 34, 39, 73, 54, 50, 52, 
43, 55, 57, 37, 24, 44, 37, 38, 
40), Sex = c("F", "F", "F", "M", "M", "M", "F", 
"M", "M", "F", "F", "F", "M", "M", "F", "F", "M", "M", "M", "M", 
"M", "F", "F", "F", "M", "F", "F", "M", "M", "M", "F", "F", "F", 
"F", "F", "F", "F", "F", "M", "M"), Race = c(1, 2, 1, 2, 3, 3, 
3, 1, 1, 2, 2, 1, 2, 1, 2, 3, 3, 3, 2, 1, 2, 2, 3, 1, 3, 2, 3, 
1, 2, 3, 3, 1, 2, 2, 2, 3, 1, 1, 2, 2), `Cancer-status` = c(0, 
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0)), row.names = c(NA, 
-40L), class = c("tbl_df", "tbl", "data.frame"))

具有类似的结构

Person_ID 县_ID 年龄 性别 种族 Cancer_status
1 1 30 1 1
2 1 41 2 0
3 1 19 F 1 0
4 1 37 F 3 1
5 2 28 F 3 0
6 3 65 1 1

其中 Cancer_status 是一个虚拟变量或二元变量,Race 是一个因子变量。

我想要一个格式如下的新数据框(类似于SpatialEpi 包中pennLC$data的数据结构)。癌症和人口计数按县排序并按 3 层(种族、性别和年龄)排序。新的年龄变量是一个因子或分类变量。

癌症 流行县 种族 性别 年龄
1 0 1492 1 F 40岁以下
1 0 365 1 F 40-59
1 1 68 1 F 60-69
1 0 73 1 F 70+
1 0 23351 2 F 40岁以下
1 5 12136 2 F 40-59

谢谢,

标签: rtidyr

解决方案


我假设你想要dplyr. 给定您的示例数据,试试这个:

library(dplyr)
DF %>%
  mutate(Age = cut(Age, c(0, 40, 60, 70, Inf), right = FALSE)) %>%
  group_by(County_ID, Race, Sex, Age) %>%
  summarize(cancer = sum(`Cancer-status`), pop_county = n()) %>%
  ungroup()
# # A tibble: 37 x 6
#    County_ID  Race Sex   Age      cancer pop_county
#        <dbl> <dbl> <chr> <fct>     <dbl>      <int>
#  1         1     1 F     [0,40)        0          1
#  2         1     1 F     [60,70)       0          1
#  3         1     2 F     [0,40)        0          1
#  4         1     2 M     [70,Inf)      0          1
#  5         1     3 M     [0,40)        0          1
#  6         2     1 M     [0,40)        1          1
#  7         2     1 M     [40,60)       0          1
#  8         2     2 F     [0,40)        1          2
#  9         2     3 F     [40,60)       0          1
# 10         2     3 M     [0,40)        0          1
# # ... with 27 more rows

你需要重新标记Age因素,


推荐阅读