r - 在 R 中合并和分组的有效和更快的方法
问题描述
我有两个数据集,想要进行内部连接,然后进行分组,但由于我的数据有 >1 亿行,而内部连接是笛卡尔合并,进一步增加了最终数据集。我想了解如何以一种高效、快捷的方式来做到这一点。我当前的合并代码运行了很长时间。
示例数据集(此示例没有笛卡尔数据,但我的原始数据集有)
library("data.table")
library("sqldf")
library("purrr")
orders <- data.table(date = as.POSIXct(c('2012-08-28','2012-08-29','2012-09-01', '2012-08-30')),
first_name = as.character(c('John','George','Henry', 'Markel')),
last_name = as.character(c('Doe','Smith','Smith', 'Markel')),
qty = c(10,50,6, 0))
dates <- data.table(date = seq(from = as.POSIXct('2012-08-28'),
to = as.POSIXct('2012-09-07'), by = 'day'),
week = seq(from = 1, to = 11, by = 1))
我拥有的等效 sqldf 代码:这要慢得多
final_data <- sqldf("select first_name,
last_name,
week,
sum(qty) as total_qty
from orders a inner join dates b
on a.date = b.date
where a.first_name = a.last_name
group by first_name,
last_name,
week
having sum(qty) = 0;")
等效的 data.table 代码(必须匹配 sqldf 输出)
final_data_2 <- merge(
x = orders[ first_name == last_name,]
, y = dates
, all = FALSE
, allow.cartesian = TRUE) %>%
.[, total_qty := sum(qty), by = .(first_name, last_name, week) ] %>%
.[total_qty == 0, .(first_name, last_name, week, total_qty)]
@manoftheshark 的替代代码
orders[dates, on = 'date', allow.cartesian = TRUE][, total_qty := sum(qty), by = .(first_name, last_name, week)][total_qty == 0, .(first_name, last_name, week, total_qty)]
解决方案
不确定这将如何扩展到完整的数据集,但microbenchmark
测试数据显示了 15-30% 的改进
orders[dates, on = 'date', allow.cartesian = TRUE][, total_qty := sum(qty), by = .(first_name, last_name, week)][total_qty == 0, .(first_name, last_name, week, total_qty)]
推荐阅读
- python - Python / XML 请求
- python - 使用 TF 1.4 冻结 TensorFlowSharp 的微调图
- excel - 日期格式 英语/法语
- java - Log4J 不想创建日志文件
- angularjs - 带有异步响应拦截器的 ngResource
- python - wxPython : 改变 DataViewListCtrl 项的颜色
- python - Tkinter 对齐文本
- javascript - 指定插件 0,提供了“默认”的无效属性(Jest React Native)
- c# - WPF C# 如何在两个 ViewModel 中使用通用的 ObservableCollection?
- excel - 更新 Excel VBA 参考