首页 > 解决方案 > 加速极慢的 for 循环

问题描述

这是我关于stackoverflow的第一个问题,所以请随时批评这个问题。

对于数据集中的每一行,我想总结以下行:

这样我就可以找到那场比赛的累计积分数,对于那支球队、赛季和模拟 ID,即 cumsum(simulation$team_points)。

我在不使用非常慢的 for 循环的情况下实现第二个条件时遇到问题。

数据如下所示:

match_ID 季节 模拟_ID 主队 团队 匹配结果 team_points
2084 2020-2021 1 真的 利物浦 客场胜利 0
2084 2020-2021 2 真的 利物浦 1
2084 2020-2021 3 真的 利物浦 客场胜利 0
2084 2020-2021 4 真的 利物浦 客场胜利 0
2084 2020-2021 5 真的 利物浦 主场获胜 3
2084 2020-2021 1 错误的 伯恩利 主场获胜 0
2084 2020-2021 2 错误的 伯恩利 1

我目前的解决方案是:

simulation$accumulated_points <- 0

for (row in 1:nrow(simulation)) {
  simulation$accumulated_points[row] <- 
  sum(simulation$team_points[simulation$season==simulation$season[row] & 
      simulation$match_ID<simulation$match_ID[row] & 
      simulation$simulation_ID==simulation$simulation_ID[row] & 
      simulation$team==simulation$team[row]], na.rm = TRUE)
}

这可行,但在大型数据集上使用显然太慢了。我无法弄清楚如何加快速度。这里有什么好的解决方案?

标签: r

解决方案


For loops are always slow in scripting languages like R and should best be avoided. This can be done using "vectorized operations", that apply a function to a vector rather than each element separately. Native functions in R or popular packages often rely on optimized C++ code and linear algebra libraries under the hood to do this, such that operations become much faster than a loop in R. For example, your CPU is usually able to process dozens of vector elements at the same time rather than going 1-by-1 as in a for loop. You can find more information about vectorization in this question.

In your specific example, you could for example use dplyr to transform your data:

library(dplyr)

df %>%
  # you want to perform the same operation for each of the groups
  group_by(team, season, simulationID) %>%
  # within each group, order the data by match_ID (ascending)
  arrange(match_ID) %>%
  # take the vector team_points in each group then calculate its cumsum
  # write that cumsum into a new column named "points"
  mutate(points = cumsum(team_points))

The code above essentially decomposes the match_points column into one vector for each group that you care about, then applies a single, highly optimized operation to each of them.


推荐阅读