首页 > 解决方案 > 如何处理 R (dplyr) 中的空子集

问题描述

我希望通过在两个日期范围之间设置我的数据框来计算每组的平均数量。

我注意到并非所有产品在这两个日期之间都有记录,并且需要一些方法来处理这些,因为现在它们被捕获为空白,理想情况下这些将填充 0 而不是空白。

这是一个代表:

product <- c("A","A","A","A","B","B","B","B","C")
date1 <- c("2020-06-20","2020-06-26","2020-08-20","2020-08-22","2020-06-26","2020-06-27","2020-08-21","2020-08-22","2019-06-20")
winA_start <- rep("2020-06-24", times = 9)
winA_end <- rep("2020-06-30", times = 9)
winB_start <- rep("2020-08-19", times = 9)
winB_end <- rep("2020-08-26", times = 9)
vol <- c(200,201,600,800,800,1000,50,100,208)
x <- data.frame(product,date1,winA_start,winA_end,winB_start,winB_end,vol)
x[,2:6] <- lapply(x[,2:6], as.Date)

我希望运行以下代码来计算窗口 A 和窗口 B 的平均值

y <- x %>% group_by(product) %>% mutate(WIN_A_AVG = round(mean(vol[date1 >= winA_start & date1 <= winA_end]), digits = 0), WIN_B_AVG = round(mean(vol[date1 >= winB_start & date1 <= winB_end]), digit = 0))

除了 date1 不符合标准并因此返回 NaN 的产品 C 之外,这可以正常工作。

除了返回 NaN 之外,有没有一种方法可以捕获这些类型的错误并以一种优雅的方式返回 0?

在此处输入图像描述

标签: rdplyrsubset

解决方案


如果您只想将丢失的数据更改为零,则可以在同一个 mutate 语句中执行此操作。

y <- x %>% 
  group_by(product) %>% 
  mutate(WIN_A_AVG = round(mean(vol[date1 >= winA_start & date1 <= winA_end]), digits = 0), 
         WIN_B_AVG = round(mean(vol[date1 >= winB_start & date1 <= winB_end]), digit = 0), 
         WIN_A_AVG = case_when(is.na(WIN_A_AVG) ~ 0, 
                               TRUE ~ WIN_A_AVG), 
         WIN_B_AVG = case_when(is.na(WIN_B_AVG) ~ 0, 
                               TRUE ~ WIN_B_AVG))

这里,该case_when()函数只是表示当结果丢失时,将其替换为零,否则保持不变。但是,如果您只想捕获那些没有观察到的情况,那么您需要类似下面的代码。这会首先找到观测值的数量,然后仅用 0 替换那些观测值为零的观测值。如果有可能丢失的数据可能vol会导致您希望NA有时返回平均值作为检查,这会更好。

y <- x %>% 
  group_by(product) %>% 
  mutate(WIN_A_N = length(vol[date1 >= winA_start & date1 <= winA_end]), 
         WIN_B_N = length(vol[date1 >= winA_start & date1 <= winA_end]), 
         WIN_A_AVG = round(mean(vol[date1 >= winA_start & date1 <= winA_end]), digits = 0), 
         WIN_B_AVG = round(mean(vol[date1 >= winB_start & date1 <= winB_end]), digit = 0), 
         WIN_A_AVG = case_when(WIN_A_N == 0 ~ 0, 
                               TRUE ~ WIN_A_AVG), 
         WIN_B_AVG = case_when(WIN_B_N == 0 ~ 0, 
                               TRUE ~ WIN_B_AVG)) %>%
  select(-WIN_A_N, -WIN_B_N)
y
# A tibble: 9 x 9
# Groups:   product [3]
#  product date1      winA_start winA_end   winB_start winB_end     vol WIN_A_AVG WIN_B_AVG
#  <chr>   <date>     <date>     <date>     <date>     <date>     <dbl>     <dbl>     <dbl>
#1 A       2020-06-20 2020-06-24 2020-06-30 2020-08-19 2020-08-26   200       201       700
#2 A       2020-06-26 2020-06-24 2020-06-30 2020-08-19 2020-08-26   201       201       700
#3 A       2020-08-20 2020-06-24 2020-06-30 2020-08-19 2020-08-26   600       201       700
#4 A       2020-08-22 2020-06-24 2020-06-30 2020-08-19 2020-08-26   800       201       700
#5 B       2020-06-26 2020-06-24 2020-06-30 2020-08-19 2020-08-26   800       900        75
#6 B       2020-06-27 2020-06-24 2020-06-30 2020-08-19 2020-08-26  1000       900        75
#7 B       2020-08-21 2020-06-24 2020-06-30 2020-08-19 2020-08-26    50       900        75
#8 B       2020-08-22 2020-06-24 2020-06-30 2020-08-19 2020-08-26   100       900        75
#9 C       2019-06-20 2020-06-24 2020-06-30 2020-08-19 2020-08-26   208         0         0

推荐阅读