r - 来自单个组的 dplyr sample_n
问题描述
我有一些数据,其中观察次数的摘要如下所示:
# A tibble: 14 x 3
# Groups: status [2]
status year n
<dbl> <dbl> <int>
1 0 2010 4593
2 0 2011 10990
3 0 2012 27711
4 0 2013 99989
5 0 2014 95407
6 0 2015 89010
7 0 2016 72289
8 1 2010 584
9 1 2011 785
10 1 2012 640
11 1 2013 667
12 1 2014 377
13 1 2015 460
14 1 2016 104
其中一个组的等级显着高于另一组的等级。如何在不对 1 类做任何事情的情况下随机抽样 0 类。也就是说,我想保留所有 1 类观察值,并随机抽样 0 类观察值 4593(这是当年的最小观察数)
使用group_by(status, year)
and thensample_n()
不起作用,因为 4593 值大于类 1 组中的值。
我的数据的一些随机样本:
structure(list(status = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
year = c(2013, 2014, 2012, 2013, 2016, 2013, 2015, 2014,
2013, 2016, 2015, 2016, 2011, 2014, 2016, 2012, 2013, 2012,
2014, 2014, 2012, 2012, 2012, 2016, 2016, 2012, 2016, 2015,
2013, 2014, 2015, 2013, 2015, 2015, 2014, 2015, 2011, 2014,
2013, 2012, 2011, 2016, 2015, 2015, 2015, 2014, 2012, 2013,
2015, 2012, 2015, 2016, 2015, 2013, 2014, 2014, 2014, 2013,
2013, 2016, 2016, 2013, 2015, 2012, 2014, 2014, 2013, 2015,
2014, 2016, 2016, 2014, 2012, 2016, 2013, 2010, 2011, 2014,
2016, 2013, 2016, 2014, 2014, 2013, 2013, 2013, 2016, 2016,
2012, 2014, 2013, 2015, 2016, 2013, 2013, 2015, 2013, 2014,
2013, 2015, 2013, 2013, 2011, 2014, 2016, 2013, 2010, 2012,
2014, 2012, 2011, 2011, 2013, 2015, 2014, 2010, 2010, 2013,
2010, 2014, 2011, 2011, 2014, 2013, 2014, 2015, 2015, 2013,
2014, 2013, 2011, 2013, 2014, 2013, 2011, 2013, 2012, 2015,
2012, 2012, 2012, 2010, 2013, 2013, 2011, 2011, 2011, 2012,
2016, 2013, 2011, 2011, 2012, 2012, 2014, 2010, 2013, 2014,
2011, 2012, 2010, 2012, 2012, 2011, 2015, 2011, 2011, 2013,
2015, 2010, 2015, 2011, 2015, 2015, 2012, 2012, 2013, 2012,
2014, 2014, 2012, 2012, 2014, 2010, 2011, 2013, 2014, 2012,
2013, 2016, 2014, 2012, 2012, 2013, 2010, 2012, 2013, 2014,
2014, 2011)), groups = structure(list(status = c(0, 1), .rows = structure(list(
1:100, 101:200), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), row.names = c(NA, -200L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
解决方案
我认为这会奏效。dat
是您的示例数据框。下面的代码将数据帧拆分为status
,然后用于imap
评估是否需要采样。如果列表元素的名称为"0"
,则进行采样。您可以将 更改为size = 1
实际数据框的最小数量。
library(dplyr)
library(purrr)
dat2 <- dat %>%
split(f = .$status) %>%
imap(function(x, y){
if (y %in% "0"){
x <- x %>%
group_by(status, year) %>%
sample_n(size = 1)
}
return(x)
}) %>%
bind_rows()
推荐阅读
- python - aiohttp php 像帖子中的数组
- outlook-restapi - 如何使用 OWA REST API 获取特定的内联图像?
- r - 如何将一串变量添加到数据框中
- powershell - 发送电子邮件时将变量添加到 -Body
- azure - 在 u-sql 脚本中获取实体的所有最新记录
- java - 如何从 MainActivity 添加到 network_security_config
- c# - 在剃刀页面上绑定多个对象的问题
- node.js - 为什么 fs.writeFile 在 Windows 上不起作用
- powershell - 使用 powershell 的远程 Webtest 脚本
- android - Android:使用哪个后台服务?