r - 如果所有计算都相同,如何按组优化 dplyr 代码
问题描述
我有以下数据框,它是包含超过 300 万行的更大数据框的子集。
df <- data.frame(Group = c(1,1,1,2,2,3,3,3,2,2,4,4,1,4,1,3,1,3,2,4,2,1,3,2,4),
SubGroup = c("A","A","C","B","C","A","B","C","C","A","B","C","C","A","B","C","A","A","B","A","C","C","B","B","C"),
Type = c("Z","Y","Z","X","X","Y","Z","X","Z","Y","Y","Y","X","Y","Z","X","X","Z","Y","X","Z","Z","Y","X","Y"),
Value = c(4,7,2,8,4,6,1,8,3,6,3,6,3,5,6,7,3,2,7,4,9,4,7,8,1),
Dup = c(1,1,0,0,0,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,1,0,1,1,1))
对于每个组,我要计算五个不同的公式。1) 子组的值总和 2) 子组和 Dup 的值总和 3) 子组和类型 Z 的值总和 4) 子组和类型 Y 的值总和 5) 子组和类型的值总和X
这是我目前使用 dplyr 的代码:
result <- df %>%
group_by(Group) %>%
summarise(SubGroup.A.Total = sum(Value[SubGroup=="A"]),
SubGroup.A.Dup = sum(Value[SubGroup=="A" & Dup==1]),
SubGroup.A.TypeZ = sum(Value[SubGroup=="A" & Type=="Z"]),
SubGroup.A.TypeY = sum(Value[SubGroup=="A" & Type=="Y"]),
SubGroup.A.TypeX = sum(Value[SubGroup=="A" & Type=="X"]),
SubGroup.B.Total = sum(Value[SubGroup=="B"]),
SubGroup.B.Dup = sum(Value[SubGroup=="B" & Dup==1]),
SubGroup.B.TypeZ = sum(Value[SubGroup=="B" & Type=="Z"]),
SubGroup.B.TypeY = sum(Value[SubGroup=="B" & Type=="Y"]),
SubGroup.B.TypeX = sum(Value[SubGroup=="B" & Type=="X"]),
SubGroup.C.Total = sum(Value[SubGroup=="C"]),
SubGroup.C.Dup = sum(Value[SubGroup=="C" & Dup==1]),
SubGroup.C.TypeZ = sum(Value[SubGroup=="C" & Type=="Z"]),
SubGroup.C.TypeY = sum(Value[SubGroup=="C" & Type=="Y"]),
SubGroup.C.TypeX = sum(Value[SubGroup=="C" & Type=="X"]))
但是,我真的很想知道从运行时间和行数的角度来看是否有更有效的方法?因为我本质上是按 SubGroup 循环相同的五个公式,所以我想有一种更简单的方法来编写它。
解决方案
这个答案使用pivot_wider
,到目前为止仅在开发版本中可用tidyr
。
您正在对SubGroup
要计算的所有变量进行分组,因此值得将其添加到group_by
. 对于您还分组的某些变量Type
,因此我想您可以进行单独的分组Group, SubGroup, Type
并重新加入这些变量,但尚不清楚这是否值得。
坚持 just group_by(Group, SubGroup)
,你可以这样做:
df %>%
group_by(Group, SubGroup) %>%
summarize(
Total = sum(Value),
Dup = sum(Value[Dup == 1]),
X = sum(Value[Type == "X"]),
Y = sum(Value[Type == "Y"]),
Z = sum(Value[Type == "Z"])
) %>%
pivot_wider(
names_from = SubGroup,
values_from = c(-Group, -SubGroup)
)
推荐阅读
- python - join two columns of different dataframes into another dataframe
- c++ - 求平均成绩并求最大值(c++)
- java - Passing parameters to itemreader through @StepScope or @JobScope
- python - 展平 dict/list 对象 - Python vs Sql?
- c - 使用不透明指针时 _init 和 _new 函数有何不同?
- python - 如何修复错误:TypeError: expected str, bytes or os.PathLike object, not list
- flutter - 如何在颤动中映射 Elasticsearch match_phrase 查询
- asp.net-core - SoapCore Asp.net 核心 3.1 标头
- angular - Error using session storage with Angular 10
- macos - 与原生 macOS 可执行文件相比,为什么我的 u64 Rust 程序使用 Docker(桌面)运行得更快?