首页 > 解决方案 > 大型数据集最佳实践的内部联接

问题描述

我正在尝试使用dplyr::inner_join. 我正在开发一台具有 40 多个内核的强大机器。我不确定我是否正在利用机器本身,因为无论如何我都没有并行化任务。我应该如何解决这个需要大量运行的问题?

最好的

标签: rparallel-processingdplyrbigdata

解决方案


我认为内部连接不会有性能问题,除非由于数据集中的键列重复(连接列的重复值),3.5M您的两个最终数据集将在连接之后3.5M * 3.5M

通常在 R 中,没有使用多核的函数。为此,您必须将可以单独处理的数据分批划分,然后将最终结果组合在一起并进一步计算。这是使用库dplyr&的伪代码doParallel

library(dplyr)
library(doParallel)

# Parallel configuration #####
cpuCount <- 10
# Note that doParallel will replicated your environment to and process on multiple core
# so if your environment is 10GB memory & you use 10 core
# it would required 10GBx10=100GB RAM to process data parallel
registerDoParallel(cpuCount)

data_1 # 3.5M rows records with key column is id_1 & value column value_1
data_2 # 3.5M rows records with key columns are id_1 & id_2

# Goal is to calculate some stats/summary of value_1 for each combination of id_1 + id_2
id_1_unique <- unique(data_1$id_1)
batchStep <- 1000
batch_id_1 <- seq(1, length(id_1_unique )+batchStep , by=batchStep )

# Do the join for each batch id_1 & summary/calculation then return the final_data
# foreach will result a list, for this psuedo code it is a list of datasets
# which can be combined use bind_rows
summaryData <- bind_rows(foreach(index=1:(length(batch_id_1)-1)) %dopar% {
    batch_id_1_current <- id_1_unique[index:index+batchStep-1]
    batch_data_1 <- data_1 %>% filter(id_1 %in% batch_id_1_current)
    joined_data <- inner_join(batch_data_1, data_2, by="id_1")
    final_data <- joined_data %>%
        group_by(id_1, id_2) %>%
        #calculation code here
        summary(calculated_value_1=sum(value_1)) %>%
        ungroup()
    return(final_data)
})

推荐阅读