r - 递归求和匹配行的数据帧
问题描述
我想通过对具有匹配变量的列求和(而不是附加列)将一组数据框组合成一个数据框。
例如,给定
df1 <- data.frame(A = c(0,0,1,1,1,2,2), B = c(1,2,1,2,3,1,5), x = c(2,3,1,5,3,7,0))
df2 <- data.frame(A = c(0,1,1,2,2,2), B = c(1,1,3,2,4,5), x = c(4,8,4,1,0,3))
df3 <- data.frame(A = c(0,1,2), B = c(5,4,2), x = c(5,3,1))
我想匹配"A"
and"B"
并对 的值求和"x"
。对于此示例,我可以得到所需的结果,如下所示:
library(plyr)
library(dplyr)
# rename columns so that join_all preserves them all:
colnames(df1)[3] <- "x1"
colnames(df2)[3] <- "x2"
colnames(df3)[3] <- "x3"
# join the data frames by matching "A" and "B" values:
res <- join_all(list(df1, df2, df3), by = c("A", "B"), type = "full")
# get the sums and drop superfluous columns:
arrange(res, A, B) %>%
rowwise() %>%
mutate(x = sum(x1, x2, x3, na.rm = TRUE)) %>%
select(A, B, x)
结果:
A B x
<dbl> <dbl> <dbl>
1 0 1 6
2 0 2 3
3 0 5 5
4 1 1 9
5 1 2 5
6 1 3 7
7 1 4 3
8 2 1 7
9 2 2 2
10 2 4 0
11 2 5 3
更通用的解决方案是
library(dplyr)
# function to get the desired result for two data frames:
my_merge <- function(df1, df2)
{
m1 <- merge(df1, df2, by = c("A", "B"), all = TRUE)
m1 <- rowwise(res) %>%
mutate(x = sum(x.x, x.y, na.rm = TRUE)) %>%
select(A, B, x)
return(m1)
}
l1 <- list(df2, df3) # omit the first data frame
res <- df1 # initial value of the result
for(df in l1) res <- my_merge(res, df) # call the function repeatedly
是否有更有效的选择来组合大量数据框?理想情况下,它应该是递归的(即在计算总和之前最好不要将所有数据帧加入一个庞大的数据帧)。
解决方案
一个更简单的选择是绑定数据集的行,然后按感兴趣的列分组,并通过获取sum
“x”来获得汇总输出
library(tidyverse)
bind_rows(df1, df2, df3) %>%
group_by(A, B) %>%
summarise(x = sum(x))
# A tibble: 11 x 3
# Groups: A [?]
# A B x
# <dbl> <dbl> <dbl>
# 1 0 1 6
# 2 0 2 3
# 3 0 5 5
# 4 1 1 9
# 5 1 2 5
# 6 1 3 7
# 7 1 4 3
# 8 2 1 7
# 9 2 2 2
#10 2 4 0
#11 2 5 3
如果全局环境中有许多对象,其模式"df"
后跟一些数字
mget(ls(pattern= "^df\\d+")) %>%
bind_rows %>%
group_by(A, B) %>%
summarise(x = sum(x))
正如OP提到的memory
约束,如果我们先做join
然后使用rowSums
or +
with reduce
,它会更有效
mget(ls(pattern= "^df\\d+")) %>%
reduce(full_join, by = c("A", "B")) %>%
transmute(A, B, x = rowSums(.[3:5], na.rm = TRUE)) %>%
arrange(A, B)
# A B x
#1 0 1 6
#2 0 2 3
#3 0 5 5
#4 1 1 9
#5 1 2 5
#6 1 3 7
#7 1 4 3
#8 2 1 7
#9 2 2 2
#10 2 4 0
#11 2 5 3
这也可以用data.table
library(data.table)
rbindlist(mget(ls(pattern= "^df\\d+")))[, .(x = sum(x)), by = .(A, B)]
推荐阅读
- javascript - 用 3 个按钮构建一个带有图像的小页面
- python - 池未使用全部处理器容量
- php - 如何将数据从 python 程序发送到 php 以将数据存储在在线数据库中?
- sql - Oracle:用另一个字符串的项目替换字符串中的项目的SQL
- javascript - 加载资源失败:服务器响应状态为 404 (Not Found) javascript/application.js
- javascript - DynamoDB 无法按范围键获取项目?
- c# - 如何在没有 C# 驱动程序的情况下连接到 MongoDB?
- javascript - 提交前删除表单
- java - 如何测试我的 java 应用程序是否可以成功处理 Tomcat 服务器上的低内存/CPU 资源?
- java - 在数组列表中添加数据失败