首页 > 解决方案 > 在 R 中合并和分组的有效和更快的方法

问题描述

我有两个数据集,想要进行内部连接,然后进行分组,但由于我的数据有 >1 亿行,而内部连接是笛卡尔合并,进一步增加了最终数据集。我想了解如何以一种高效、快捷的方式来做到这一点。我当前的合并代码运行了很长时间。

示例数据集(此示例没有笛卡尔数据,但我的原始数据集有)

library("data.table")
library("sqldf")
library("purrr")

orders <- data.table(date = as.POSIXct(c('2012-08-28','2012-08-29','2012-09-01', '2012-08-30')),
                     first_name = as.character(c('John','George','Henry', 'Markel')),
                     last_name = as.character(c('Doe','Smith','Smith', 'Markel')),
                     qty = c(10,50,6, 0))

dates <- data.table(date = seq(from = as.POSIXct('2012-08-28'),
                               to = as.POSIXct('2012-09-07'), by = 'day'),
                    week = seq(from = 1, to = 11, by = 1))

我拥有的等效 sqldf 代码:这要慢得多

final_data <- sqldf("select first_name,
       last_name,
       week,
       sum(qty) as total_qty 
from orders a inner join dates b
on a.date = b.date
where a.first_name = a.last_name
group by first_name,
         last_name,
         week
having sum(qty) = 0;")

等效的 data.table 代码(必须匹配 sqldf 输出)

final_data_2 <- merge(
    x = orders[ first_name == last_name,]
  , y = dates
  , all = FALSE
  , allow.cartesian = TRUE) %>%
  .[, total_qty := sum(qty), by = .(first_name, last_name, week) ] %>% 
  .[total_qty == 0, .(first_name, last_name, week, total_qty)]

@manoftheshark 的替代代码

orders[dates, on = 'date', allow.cartesian = TRUE][, total_qty := sum(qty), by = .(first_name, last_name, week)][total_qty == 0, .(first_name, last_name, week, total_qty)]

标签: rpostgresqldata.tablesqldf

解决方案


不确定这将如何扩展到完整的数据集,但microbenchmark测试数据显示了 15-30% 的改进

orders[dates, on = 'date', allow.cartesian = TRUE][, total_qty := sum(qty), by = .(first_name, last_name, week)][total_qty == 0, .(first_name, last_name, week, total_qty)]

推荐阅读