r - 通过对 2 列或更多列进行分组来计算基于日期差异的总和
问题描述
假设我有一个类似于下面的数据集:
| id | Date | Buyer | diff | Amount | ConsecutiveSum |
|------|:---------:|------:|------|--------|----------------|
| 334 | 6/15/2018 | Simon | NA | 1948 | 0 |
| 334 | 6/20/2018 | Simon | 5 | 4290 | 6238 |
| 334 | 8/17/2018 | Simon | 58 | 4260 | 8550 |
| 334 | 8/20/2018 | Simon | 3 | 79 | 4339 |
| 334 | 8/7/2018 | Wang | NA | 2145 | 0 |
| 334 | 8/9/2018 | Wang | 2 | 4192 | 6337 |
| 5006 | 3/4/2019 | Wang | NA | 1700 | 0 |
| 5006 | 3/7/2019 | Wang | 3 | 335 | 2035 |
| 5006 | 5/5/2019 | Wang | 59 | 4400 | 4735 |
| 5006 | 5/9/2019 | Wang | 4 | 2700 | 7100 |
| 5006 | 5/14/2019 | Wang | 5 | 4355 | 7055 |
| 5006 | 5/17/2019 | Wang | 3 | 3100 | 7455 |
我需要获取相同买方和相同 ID 的连续行金额总和 >=5000 但相差 5 天(<=5 天)的交易。例如,在上述数据集中,Simon 在 2018 年 6 月 15 日和 2018 年 6 月 20 日有交易,相差 5 天,ConsecutiveSum 也 >=5000,而对于 8/17/2018 和 8/ 20/2018 也是 5 天内的差异,但 ConsecutiveSum 不大于或等于 5000(所以,我不希望这些交易出现在输出中)。此外,王在 2019 年 5 月 5 日和 2019 年 5 月 9 日完成的交易相差 5 天之内,但我只能获得 2019 年 5 月 9 日的交易,而不是 2019 年 5 月 5 日的交易这篇文章如果连续行之间的差异满足条件,则计算一列的总和. 如何重组代码以包含此类事务?
下面是后面的代码:
df <- data.frame(id = c("334","334","334","334","334","334","5006","5006","5006","5006","5006","5006"),
Date = c("6/15/2018","6/20/2018","8/17/2018","8/20/2019","8/7/2018","8/9/2018","3/4/2019",
"3/7/2019","5/5/2019","5/9/2019","5/14/2019","5/17/2019"),
Buyer = c("Simon", "Simon", "Simon", "Simon", "Chang", "Chang", "Chang", "Chang", "Chang",
"Chang","Chang","Chang"),
diff = c("NA","5","58","3","NA","2","NA","3","59","4","5","3"),
Amount = c("1948","4290","4260","79","2145","4192","1700","335","4400","2700","4355","3100"),
ConsecutiveSum = c("0","6238","8550","4339","0","6337","0","2035","4735","7100","7055","7455"),stringsAsFactors = F)
df$Date <- as.Date(df$Date, '%m/%d/%Y')
df$Amount <- as.numeric(df$Amount)
df$diff <- as.numeric(df$diff)
df$ConsecutiveSum <- as.numeric(df$ConsecutiveSum)
df_sum = df %>% group_by(Buyer,id) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0))) %>%
filter(diff<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
我的预期输出应该如下所示:
| id | Date | Buyer | diff | Amount | ConsecutiveSum |
|------|:---------:|------:|------|--------|----------------|
| 334 | 6/15/2018 | Simon | NA | 1948 | 0 |
| 334 | 6/20/2018 | Simon | 5 | 4290 | 6238 |
| 334 | 8/7/2018 | Wang | NA | 2145 | 0 |
| 334 | 8/9/2018 | Wang | 2 | 4192 | 6337 |
| 5006 | 5/5/2019 | Wang | 59 | 4400 | 4735 |
| 5006 | 5/9/2019 | Wang | 4 | 2700 | 7100 |
| 5006 | 5/14/2019 | Wang | 5 | 4355 | 7055 |
| 5006 | 5/17/2019 | Wang | 3 | 3100 | 7455 |
解决方案
这是使用隐藏变量keep1
和的可能性keep2
。首先重复示例中的所有行,直到df$ConsecutiveSum <- as.numeric(df$ConsecutiveSum)
然后:
df %>% replace_na(list(diff=0)) %>%
mutate(keep1=ifelse((ConsecutiveSum>=5000 & diff<=5), 1, 0)) %>%
mutate(keep2=ifelse(lead(keep1)==1, 1, 0)) %>%
filter(keep1==1|keep2==1) %>% select(-keep1,-keep2)
结果是:
id Date Buyer diff Amount ConsecutiveSum
1 334 2018-06-15 Simon 0 1948 0
2 334 2018-06-20 Simon 5 4290 6238
3 334 2018-08-07 Chang 0 2145 0
4 334 2018-08-09 Chang 2 4192 6337
5 5006 2019-05-05 Chang 59 4400 4735
6 5006 2019-05-09 Chang 4 2700 7100
7 5006 2019-05-14 Chang 5 4355 7055
8 5006 2019-05-17 Chang 3 3100 7455
推荐阅读
- r - 如何绘制同一 DataFrame 中 3 个向量的百分比变化?
- apache-arrow - Apache 箭头 - 将序列化的 VectorSchemaRoot 读取到 C++
- asp.net - 如何使用 X-Frame-Options 标头防止 ClickJacking 攻击
- r - OfficeR 错误:“Docx 已编辑”,但文件甚至还不存在。会发生什么?
- ceph - Windows上的Ceph:无法映射RBDimage
- reactjs - react+nextjs 认证方式
- javascript - 将边连接到 mxgraph 中顶点的锚点
- python - 将 str 转换为 float 但数字中有多个点的问题
- java - 按正则表达式拆分与多个一个字符拆分性能
- ruby - 如何从救援内部救援异常