r - 将样本随机分配到 R 中的组中
问题描述
我有一个大数据集,其中包含来自不同城市的每个人的一些人口统计信息。我想创建一个变量(例如类),将城市内同一年龄组的个人分配到大约 20(~15-25)人的组中。这是生成我的数据示例的 R 代码:
set.seed(10)
ID = seq(1:10000)
df <- as.data.frame(ID)
df$City <- cut(runif(10000, 0,100),breaks = c(0,7,20,35,47,55,61,74,85,91,100),include.lowest = T,right = F, labels = c("City 1","City 2","City 3","City 4","City 5","City 6","City 7","City 8","City 9","City 10"))
df$Age_Group <- cut(runif(10000, 0,100),breaks = c(0,10,20,30,40,50,60,70,80,90,101),include.lowest = T,right = F, labels = c("0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-89","90+"))
table(df$Age_Group,df$City)
我想df$class
将相似年龄组和城市的个人分组。阶级价值观需要在所有年龄组和城市之后继续。我怎样才能做到这一点?
谢谢
解决方案
该caret
软件包可以帮助您解决这个问题。它将尝试创建 n 个分区,同时尊重诸如此类的类别,Age
并且City
考虑到输入的不平衡性质,它不会是完美的。但是你可以选择分区的数量(又名折叠),看看什么适合你的需要,我选择了 5 个。
require(caret)
#> Loading required package: caret
#> Loading required package: lattice
#> Loading required package: ggplot2
set.seed(10)
ID = seq(1:10000)
df <- as.data.frame(ID)
df$City <- cut(runif(10000, 0,100),breaks = c(0,7,20,35,47,55,61,74,85,91,100),include.lowest = T,right = F, labels = c("City 1","City 2","City 3","City 4","City 5","City 6","City 7","City 8","City 9","City 10"))
df$Age_Group <- cut(runif(10000, 0,100),breaks = c(0,10,20,30,40,50,60,70,80,90,101),include.lowest = T,right = F, labels = c("0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-89","90+"))
# table(df$Age_Group, df$City)
df$class <- caret::createFolds(df$Age_Group,
5,
FALSE)
table(df$class, df$City, df$Age_Group)
#> , , = 0-9
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 18 27 28 29 15 8 22 21 9 21
#> 2 16 29 31 27 9 10 19 23 12 22
#> 3 12 20 26 26 20 11 30 22 12 18
#> 4 9 27 24 28 13 12 24 31 12 17
#> 5 10 22 36 31 13 13 23 24 11 15
#>
#> , , = 10-19
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 13 22 13 22 11 9 38 18 22 23
#> 2 12 23 34 21 13 7 26 22 16 16
#> 3 14 25 30 25 13 7 30 23 11 12
#> 4 13 29 31 19 22 17 23 16 9 11
#> 5 17 22 24 23 18 20 22 15 9 20
#>
#> , , = 20-29
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 14 28 31 24 12 10 35 22 12 14
#> 2 9 32 22 29 15 9 30 19 18 19
#> 3 18 35 25 17 14 13 22 18 19 21
#> 4 15 26 33 25 11 15 37 20 1 19
#> 5 14 20 31 32 12 14 23 16 18 21
#>
#> , , = 30-39
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 13 28 29 22 24 14 24 19 18 21
#> 2 15 28 31 32 19 14 21 25 16 12
#> 3 17 30 28 22 20 9 22 29 14 21
#> 4 18 26 33 23 10 16 23 24 13 26
#> 5 13 26 40 24 12 8 25 21 20 23
#>
#> , , = 40-49
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 16 26 41 16 19 13 19 18 16 22
#> 2 18 23 36 32 8 12 28 15 16 18
#> 3 19 27 29 23 11 16 33 13 15 21
#> 4 13 21 30 29 18 18 26 19 9 23
#> 5 9 34 27 27 17 9 27 22 11 23
#>
#> , , = 50-59
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 21 28 28 21 15 10 25 26 21 8
#> 2 12 17 24 25 20 20 25 32 14 13
#> 3 19 27 35 30 10 8 19 24 13 17
#> 4 19 23 30 23 19 11 19 25 16 18
#> 5 15 37 38 18 10 15 23 25 9 13
#>
#> , , = 60-69
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 12 29 31 25 14 15 12 27 11 20
#> 2 12 22 29 25 18 14 22 20 11 24
#> 3 11 27 30 21 15 16 22 23 15 16
#> 4 17 21 32 20 12 12 24 28 11 19
#> 5 12 27 37 31 11 11 17 16 17 18
#>
#> , , = 70-79
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 10 23 27 36 13 7 29 20 13 17
#> 2 25 19 27 27 18 8 25 17 10 20
#> 3 12 17 27 26 13 5 34 24 14 23
#> 4 12 28 34 22 15 8 28 21 14 13
#> 5 17 30 40 23 13 11 21 17 7 16
#>
#> , , = 80-89
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 10 27 26 34 17 16 23 19 8 16
#> 2 17 19 33 16 19 19 16 31 12 14
#> 3 14 24 27 23 14 10 25 23 12 23
#> 4 12 25 30 33 14 16 19 14 12 20
#> 5 24 24 25 26 20 6 18 20 13 20
#>
#> , , = 90+
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 16 21 30 25 20 15 31 23 10 11
#> 2 15 25 34 28 16 13 25 19 10 17
#> 3 12 23 30 26 19 14 24 23 13 18
#> 4 13 30 30 24 15 10 23 25 14 18
#> 5 13 16 24 24 23 17 30 23 18 15
由reprex 包于 2020-05-08 创建(v0.3.0)
推荐阅读
- excel - Rank & Rank.EQ 函数 - 数组参考误差
- python - 在运行时重新定义 Python 类
- ubuntu - 如何将碎片添加到 KDE 上下文菜单以安全删除文件/文件夹?
- kotlin - 如何在多平台 Kotlin 项目中隐藏 JavaScript 目标的包结构
- angular - 为 MatProgressSpinner 覆盖 NoopAnimationsModule
- node.js - Mongoose TypeScript: Is String the correct data type for a foreign key?
- python - 访问 Row 对象格式的数组元素并将它们连接起来-pySpark
- node.js - MongoDB 投影不排除字段
- swift - Getstream 错误的 iOS 文档
- sql - SQL添加相同的列但具有不同的值