r - 将所有行粘贴在一起的最快方法
问题描述
我想将所有行逐列粘贴到同一个单元格中
例如,我有一个如下表:
library(tibble)
tibble::tribble(
~Col1, ~Col2, ~Col3,
"AA", "AA", "AB",
"AB", "AB", "BB",
"BC", "BB", "AA"
)
Col1 Col2 Col3
AA AA AB
AB AB BB
BC BB AA
我想要的输出是一个 3X1 表,如下所示:
Col1 AAABBC
Col2 AAABBB
Col3 ABBBAA
但是,实际情况更复杂,因为我的原始表有 600,000 行和 2000 列。我想知道实现这一目标的最快方法是什么。我尝试了循环,但它需要很长时间才能完成逐列粘贴。
任何帮助表示赞赏,谢谢!
解决方案
如果您有足够的内存来存储数据的多个实例,那么这种使用doParallel
包的方法可能会奏效。我在这里使用tidyverse
家庭。
library(tidyverse)
library(doParallel)
n <- 1000
# Generate a 1000 rows df with ~3000 columns
big_table <- do.call("rbind", replicate(n, data, simplify = FALSE))
lapply(1:10, function(x) {big_table <<- bind_cols(big_table, big_table); return(x)})
# Get the list of column names
col_list <- names(big_table)
# Define number of cores you want to process
number_of_parallel_cores <- 4
col_group <- split(col_list, sort(rep_len(1:number_of_parallel_cores, length(col_list))))
# Running the code with timer
system.time({
registerDoParallel(number_of_parallel_cores)
combine_data <- bind_rows(foreach(i_col_group = col_group) %dopar% {
big_table %>%
select(one_of(i_col_group)) %>%
summarize(across(.fns = paste, collapse = "")) %>%
pivot_longer(cols = everything(), names_to = "col_names", values_to = "values")
})
})
定时
user system elapsed
1.291 0.291 0.898
输出
col_names values
<chr> <chr>
1 Col1...1 AAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAA…
2 Col2...2 AAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAA…
3 Col3...3 ABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAAB…
4 Col1...4 AAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAA…
5 Col2...5 AAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAA…
6 Col3...6 ABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAAB…
7 Col1...7 AAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAA…
8 Col2...8 AAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAA…
9 Col3...9 ABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAABBBAAAB…
10 Col1...10 AAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAAABBCAA…
# … with 3,062 more rows
但是我发现与并行设置相比,简单地绑定在一起要快得多。猜猜这个操作的间接成本是不可行的
system.time(
big_table %>%
select(one_of(col_list)) %>%
summarize(across(.fns = paste, collapse = "")) %>%
pivot_longer(cols = everything(), names_to = "col_names", values_to = "values")
)
user system elapsed
0.021 0.000 0.022
推荐阅读
- machine-learning - DCGAN 模型的生成器和鉴别器的损失值保持不变
- powerbi - 使用平均值而不是 SUM 来汇总计算的度量
- ssl - OpenSSL:服务器不接受客户端通配符证书(错误=62:主机名不匹配)
- reactjs - 在一些 React 状态更新后如何获取数据(通过 GET)?
- apache-kafka - 零售店环境中的 Kafka
- java - 查找字符串中子字符串的出现次数
- python - SQLAlchemy 仅将 IN 与子查询的第一个值一起使用
- typescript - 自动从单一仓库中的命名空间/类型 NPM 模块导入 Typescript 命名空间
- google-chrome - 如何在第二个单选按钮上直接将 WebRTC 网络限制器部署到 Windows 计算机
- css - 通过媒体查询调整像组件一样使用的 svg 图像的大小