r - 根据条件进行条件汇总并每月重复组,使用 dplyr 更改日期间隔范围
问题描述
如果每个都满足以下条件,我正在尝试summarise
创建一个列:总金额(在特定月份)至少为 10 和至少两个不同的日期(在特定月份)。case_when
id
这个想法是创建一个名为 的新列2020-01
,如果满足这些条件,则为 1,否则为 0。
library(dplyr)
df <- data.frame(
date = as.Date(c("2020-01-01", "2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-03-02", "2020-01-05", "2020-01-08", "2020-02-18", "2020-02-18", "2020-03-01", "2020-03-02", "2020-01-01", "2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-03-02")),
id = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"),
amount = c(1, 5, 5, 5, 6, 2, 10, 4, 8, 10, 6, 5, 5, 1, 6, 2, 5, 5)
)
为此,我可以创建一个包含所有ids
满足此条件的向量,如下所述:
df_2020_01 <- df %>%
filter(date >= as.Date("2020-01-01") & date <= as.Date("2020-01-31")) %>%
group_by(id) %>%
summarise(
amount_sum = sum(amount),
date_distinct = n_distinct(date)
) %>%
ungroup() %>%
filter(amount_sum >= 10 & date_distinct >= 2) %>%
select(id)
使用这个向量,如果 if 满足这个条件,我可以用 allid
和1 创建一个概览:case_when
df_overview <- df %>%
distinct(id) %>%
mutate(`2020-01` =
case_when(id %in% df_2020_01 ~ 1,
TRUE ~ 0))
现在我想继续这个练习并创建一个额外的列2020-02
,但不同的是:日期间隔范围(上面定义为 2020-01-01 到 2020-01-31)应该有所不同 - 即如果第一个条件满足月(2020-01),从头开始计数amount_sum
(date_distinct
从2020-02-01到2020-02-29),对于ids
第一个月没有满足条件的(A和C),计数amount_sum
并且date_distinct
应该从头开始(即 2020-01-01 到 2020-02-29)。
在这种情况下,id
A 将满足此条件,因为在 2020-01-01 和 2020-02-29 之间,amount_sum
= 16 和date_distinct
= 3。
我们的想法是继续这个练习,但最大间隔应该是两个月。这意味着对于第三列2020-03
,如果 和id
不满足要求2020-01
,2020-02
则日期间隔范围应为 2020-02-01 到 2020-03-31。如果它在 上实现2020-01
,则将应用相同的范围(2020-02-01 到 2020-03-31)。但如果id
满足 的要求2020-02
,则日期间隔范围仅为 2020-03-01 到 2020-03-31。
回顾一下:我需要创建一个具有 unique 的数据框,如果满足这些条件ids
,则有一列(对于我的数据集中包含的所有日期)应该收到 1(否则为 0):year-month
amount_sum
(在特定月份)>= 10 并且date_distinct
(在特定月份)>= 2 (group_by
=id
)。- 日期间隔范围应为 1 或 2 个月(取决于上个月是否满足条件)。
- 如果上个月满足条件,下个月应从零开始(一个月/分析月份)的和
amount_sum
。如果不是,则变量的date_distinct
日期间隔范围总和应该是两个月。amount_sum
date_distinct
期望的输出:
id 2020-01 2020-02 2020-03
A 0 1 0
B 1 0 1
C 0 1 1
我希望我足够清楚地解释我的问题。提前致谢!
解决方案
修订后的新答案(2个月后开始)
library(tidyverse)
library(lubridate)
df <- data.frame(
date = as.Date(c("2020-01-01", "2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-03-02", "2020-01-05", "2020-01-08", "2020-02-18", "2020-02-18", "2020-03-01", "2020-03-02", "2020-01-01", "2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-03-02")),
id = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"),
amount = c(1, 5, 5, 5, 6, 2, 10, 4, 8, 10, 6, 5, 5, 1, 6, 2, 5, 5)
)
# function to calculate if condition is met for a given months range
calc_id <- function(.dat, m1, m2 = NULL) {
extr_date <- m1
if(is.null(m2)) {
m2 <- extr_date
} else {
m2 <- extr_date %m-% months(m2)
}
dat_end <- extr_date %m+% months(1)
dat_start <- m2
temp1 <- .dat %>%
filter(date < dat_end,
date >= dat_start)
if (nrow(temp1) == 0) return(NA)
temp2 <- temp1 %>%
summarise(
amount_sum = sum(amount),
date_distinct = n_distinct(date)
) %>%
filter(amount_sum >= 10 & date_distinct >= 2)
if (nrow(temp2) > 0) {
return(1)
} else {
return(0)
}
}
# function which decides which months range to choose
comb_calc <- function(.dat, m, mdiff) {
lag_date <- m %m-% months(1)
lag_date2 <- m %m-% months(2)
# added condition to return NA if one of the two preceeding month is NA
if (is.na(calc_id(.dat, lag_date2)) || is.na(calc_id(.dat, lag_date))) {
return(NA)
} else if (calc_id(.dat, lag_date) == 0) {
calc_id(.dat, m1 = m, m2 = mdiff)
} else {
calc_id(.dat, m1 = m)
}
}
# rearrange data
df %>%
nest_by(id) %>%
crossing(Date = floor_date(df$date, "month")) %>%
rowwise(id) %>%
# call comb_calc and choose number of months (here 2)
mutate(res = comb_calc(data, Date, 2)) %>%
select(-data) %>%
pivot_wider(names_from = Date,
values_from = res) %>%
rename_with(~ str_sub(., 1, 7), matches("^\\d{4}-\\d{2}"))
#> # A tibble: 3 x 4
#> id `2020-01` `2020-02` `2020-03`
#> <chr> <dbl> <dbl> <dbl>
#> 1 A NA NA 0
#> 2 B NA NA 1
#> 3 C NA NA 1
由reprex 包(v0.3.0)于 2020 年 6 月 29 日创建
新答案(适用于自定义月份数)
为了考虑不仅要考虑两个月,而且要考虑任何可能的月份,我改变了方法。它利用了两个自定义函数。
library(tidyverse)
library(lubridate)
df <- data.frame(
date = as.Date(c("2020-01-01", "2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-03-02", "2020-01-05", "2020-01-08", "2020-02-18", "2020-02-18", "2020-03-01", "2020-03-02", "2020-01-01", "2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-03-02")),
id = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"),
amount = c(1, 5, 5, 5, 6, 2, 10, 4, 8, 10, 6, 5, 5, 1, 6, 2, 5, 5)
)
# function to calculate if condition is met for a given months range
calc_id <- function(.dat, m1, m2 = NULL) {
extr_date <- m1
if(is.null(m2)) {
m2 <- extr_date
} else {
m2 <- extr_date %m-% months(m2)
}
dat_end <- extr_date %m+% months(1)
dat_start <- m2
temp1 <- .dat %>%
filter(date < dat_end,
date >= dat_start)
if (nrow(temp1) == 0) return(NA)
temp2 <- temp1 %>%
summarise(
amount_sum = sum(amount),
date_distinct = n_distinct(date)
) %>%
filter(amount_sum >= 10 & date_distinct >= 2)
if (nrow(temp2) > 0) {
return(1)
} else {
return(0)
}
}
# function which decides which months range to choose
comb_calc <- function(.dat, m, mdiff) {
lag_date <- m %m-% months(1)
if (!is.na(calc_id(.dat, lag_date)) && calc_id(.dat, lag_date) == 0) {
calc_id(.dat, m1 = m, m2 = mdiff)
} else {
calc_id(.dat, m1 = m)
}
}
# rearrange data
df %>%
nest_by(id) %>%
crossing(Date = floor_date(df$date, "month")) %>%
rowwise(id) %>%
# call comb_calc and choose number of months (here 2)
mutate(res = comb_calc(data, Date, 2)) %>%
select(-data) %>%
pivot_wider(names_from = Date,
values_from = res,
values_fill = 0) %>%
rename_with(~ str_sub(., 1, 7), matches("^\\d{4}-\\d{2}"))
#> # A tibble: 3 x 4
#> id `2020-01` `2020-02` `2020-03`
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 0 1 0
#> 2 B 1 0 1
#> 3 C 0 1 1
由reprex 包(v0.3.0)于 2020 年 6 月 29 日创建
旧答案(适用于两个月的窗口)
library(tidyverse)
df <- data.frame(
date = as.Date(c("2020-01-01", "2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-03-02", "2020-01-05", "2020-01-08", "2020-02-18", "2020-02-18", "2020-03-01", "2020-03-02", "2020-01-01", "2020-01-01", "2020-02-01", "2020-02-02", "2020-03-01", "2020-03-02")),
id = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"),
amount = c(1, 5, 5, 5, 6, 2, 10, 4, 8, 10, 6, 5, 5, 1, 6, 2, 5, 5)
)
calc_id <- function(.dat) {
.dat %>%
group_by(id) %>%
summarise(
amount_sum = sum(amount),
date_distinct = n_distinct(date)
) %>%
ungroup() %>%
filter(amount_sum >= 10 & date_distinct >= 2) %>%
pull(id)
}
df %>%
mutate(month = paste(lubridate::year(date), lubridate::month(date), sep = "-")) %>%
nest_by(month) %>%
ungroup() %>%
mutate(data2 = lag(data)) %>%
rowwise(month) %>%
mutate(data2 = list(bind_rows(data, data2)),
res = list(calc_id(data)),
id = list(calc_id(data2))) %>%
ungroup() %>%
mutate(res2 = lag(res, default = list(""))) %>%
unnest(res) %>%
unnest(res2) %>%
unnest(id) %>%
filter(! id == res2) %>%
select(month, id) %>%
distinct() %>%
mutate(val = 1) %>%
pivot_wider(names_from = month,
values_from = val,
values_fill = 0) %>%
arrange(id)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 4
#> id `2020-1` `2020-2` `2020-3`
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 0 1 0
#> 2 B 1 0 1
#> 3 C 0 1 1
由reprex 包(v0.3.0)于 2020-06-27 创建
推荐阅读
- php - Laravel 8 - 在构造函数之前运行路由中间件
- postgresql - postgresql jsonb 占用大量内存
- dns - 如果相同的 TXT 记录已应用于域根,我是否可以安全地删除应用于 www 的 TXT DNS 记录?
- python - 比较数据帧的值是否大于给定值
- postgresql - Postgresql:使用存储过程的 JSON 参数插入具有有效 FK 的两个表
- amazon-web-services - AWS secretsmanager,lambda,即使具有管理员权限也拒绝访问
- python - Selenium python单击悬停时出现的元素
- android - 每次最小化应用程序时显示通知
- python - 使用pyodbc通过linux连接MS-SQL数据库时如何解决错误
- django - 具有唯一约束的电子邮件字段会阻止序列化程序的 PUT 方法保存。django rest_framework