首页 > 解决方案 > 将所有行粘贴在一起的最快方法

问题描述

我想将所有行逐列粘贴到同一个单元格中

例如,我有一个如下表:

library(tibble)

tibble::tribble(
  ~Col1, ~Col2, ~Col3,
  "AA",     "AA",    "AB",
  "AB",     "AB",    "BB",
  "BC",     "BB",    "AA"
  )
Col1  Col2  Col3

AA     AA    AB

AB     AB    BB
 
BC     BB    AA

我想要的输出是一个 3X1 表,如下所示:

Col1 AAABBC

Col2 AAABBB

Col3 ABBBAA 

但是,实际情况更复杂,因为我的原始表有 600,000 行和 2000 列。我想知道实现这一目标的最快方法是什么。我尝试了循环,但它需要很长时间才能完成逐列粘贴。

任何帮助表示赞赏,谢谢!

标签: rdata-wrangling

解决方案


如果您有足够的内存来存储数据的多个实例,那么这种使用doParallel包的方法可能会奏效。我在这里使用tidyverse家庭。


library(tidyverse)
library(doParallel)

n <- 1000
# Generate a 1000 rows df with ~3000 columns
big_table <- do.call("rbind", replicate(n, data, simplify = FALSE))
lapply(1:10, function(x) {big_table <<- bind_cols(big_table, big_table); return(x)})

# Get the list of column names
col_list <- names(big_table)
# Define number of cores you want to process
number_of_parallel_cores <- 4
col_group <- split(col_list, sort(rep_len(1:number_of_parallel_cores, length(col_list))))

# Running the code with timer
system.time({
  registerDoParallel(number_of_parallel_cores)
  combine_data <- bind_rows(foreach(i_col_group = col_group) %dopar% {
    big_table %>%
      select(one_of(i_col_group)) %>%
      summarize(across(.fns = paste, collapse = "")) %>%
      pivot_longer(cols = everything(), names_to = "col_names", values_to = "values")
  })
})

定时

   user  system elapsed 
  1.291   0.291   0.898 

输出

   col_names values                                                                                                                                                                                                               
   <chr>     <chr>                                                                                                                                                                                                                
 1 Col1...1  AAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAA…
 2 Col2...2  AAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAA…
 3 Col3...3  ABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAAB…
 4 Col1...4  AAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAA…
 5 Col2...5  AAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAA…
 6 Col3...6  ABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAAB…
 7 Col1...7  AAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAA…
 8 Col2...8  AAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAA…
 9 Col3...9  ABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAAB…
10 Col1...10 AAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAA…
# … with 3,062 more rows

但是我发现与并行设置相比,简单地绑定在一起要快得多。猜猜这个操作的间接成本是不可行的

system.time(
  big_table %>%
    select(one_of(col_list)) %>%
    summarize(across(.fns = paste, collapse = "")) %>%
    pivot_longer(cols = everything(), names_to = "col_names", values_to = "values")
)

   user  system elapsed 
  0.021   0.000   0.022 

推荐阅读