首页 > 解决方案 > 如何转换我的代码以提高速度(for循环)

问题描述

更新:我将代码缩减为关键元素以缩短它

function_impact_calc 非常慢(100000 条记录数据帧需要 26 秒)。我认为主要原因是for 循环(也许 apply 或 map 会有所帮助?)。下面我模拟数据,编写impact_calc函数并记录运行时间。

library(dplyr)
library(data.table)
library(reshape2)

###########################################################
# Start Simulate Data
###########################################################


BuySell <- function(m = 40, s = 4) {
  S <- pmax(round(rnorm(10, m, s), 2), 0)
  S.sorted <- sort(S)
  data.frame(buy = rev(head(S.sorted, 5)), sell = tail(S.sorted, 5))
}

number_sates <- 10000

lst <- replicate(number_sates, BuySell(), simplify = FALSE)

# assemble prices data frame

prices <- as.data.frame(data.table::rbindlist(lst))
prices$state_id <- rep(1:number_sates, each = 5)
prices$level <- rep(1:5, times = number_sates)

prices$quantities <- round(runif(number_sates * 5, 100000, 1000000), 0)
# reshape to long format
prices_long <- reshape2::melt(prices,
  id.vars = c("state_id", "quantities", "level"),
  value.name = "price"
) %>%
  rename("side" = "variable") %>%
  setDT()

###########################################################
# End  Simulate Data
###########################################################

这是非常慢的函数 Impact_calc。


##########################################################
# function to optimize

impact_calc <- function(data, required_quantity) {
  
  # get best buy and sell
  
   best_buy <- data[, ,.SDcols = c("price", "side", "level")][side == "buy" & level == 1][1, "price"][[1]]

  best_sell <- data[, ,.SDcols = c("price", "side", "level")][side == "sell" & level == 1][1, "price"][[1]]
  
  # calculate mid
  
  mid <- 0.5 * (best_buy + best_sell)
  
 
    # buys
   
    remaining_qty <- required_quantity
    impact <- 0
    
     data_buy <- data[, ,][side == "buy"]
    
    
    levels <- data_buy[, ,][side == "buy"][, level]
    
    
    
    # i think this for loop is slow!
    
    for (level in levels) {
      price_difference <- mid - data_buy$price[level]
      if (data_buy$quantities[level] >= remaining_qty) {
        impact <- impact + remaining_qty * price_difference
        remaining_qty <- 0
        
        break
      } else {
        impact <- impact + data_buy$quantities[level] * price_difference
        remaining_qty <- remaining_qty - data_buy$quantities[level]
      }
    }
    
    rel_impact <- impact / required_quantity / mid
 
  
  
  return_list <- list("relative_impact" = rel_impact)
}

运行时的结果:

start_time <- Sys.time()
impact_buys <- prices_long[, impact_calc(.SD, 600000), by = .(state_id)]
end_time <- Sys.time()

end_time - start_time
# for 100000 data frame it takes
#Time difference of 26.54057 secs

谢谢你的帮助!

标签: rfor-loopdata.table

解决方案


OP 的怀疑是正确的:通过用向量运算替换 for 循环,我们可以将计算速度提高 100 倍以上:

required_quantity <- 600000
setDT(prices)
library(bench)
mark(
  orig = prices_long[, impact_calc(.SD, required_quantity), by = .(state_id)],
  mod1 = prices_long[, impact_calc2(.SD, required_quantity), by = .(state_id)],
  vec_w = prices[, {
    mid <- 0.5 * (buy[1L] + sell[1L])
    tmp <- cumsum(quantities) - required_quantity
    list(relative_impact = 
           sum(pmin(quantities, pmax(0, quantities - tmp)) * (mid - buy)) / 
           required_quantity / mid)
  }, by = .(state_id)],
  min_time = 1.0
)
# A tibble: 3 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory    time  gc    
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>    <lis> <list>
1 orig          28.1s    28.1s    0.0356    2.21GB     1.39     1    39      28.1s <data.ta~ <Rprofme~ <bch~ <tibb~
2 mod1          13.1s    13.1s    0.0762  658.42MB     1.45     1    19     13.12s <data.ta~ <Rprofme~ <bch~ <tibb~
3 vec_w       175.1ms  196.9ms    5.19    440.19KB     2.59     6     3      1.16s <data.ta~ <Rprofme~ <bch~ <tibb~

除了加速之外,矢量化版本vec_w分配的内存显着减少(大约 5000 次)。

请注意,矢量化版本vec_w使用的是宽格式的原始prices数据集。因此,无需将数据从宽格式重塑为长格式。

第二个基准案例mod1是for 循环impact_calc()代码之外的代码已被修改以更好地利用data.table语法的版本。仅这些微小的修改就可以将速度提高 2 倍。

结果是相同的,由mark().

的解释vec_w

如果我理解正确,OP 会考虑给定level订单中的数量,直到required_quantity达到。最后一个级别仅在完全满足所需的范围内被部分考虑required_quantity

在矢量化版本中,这可以通过嵌套来实现,ifelse()如下例所示:

library(data.table)
r <- 5
dt <- data.table(q = 1:4)
dt[, csq := cumsum(q)]
dt[, tmp := csq - r]
dt[, aq := ifelse(tmp < 0, q, ifelse(q - tmp > 0, q - tmp, 0))][]
   q csq tmp  aq
1: 1   1  -4   1
2: 2   3  -2   2
3: 3   6   1   2
4: 4  10   5   0

临时变量tmp保存数量的累积总和q与所需数量之间的差值r

第一个ifelse()测试数量的累积总和q是否低于所需数量r。如果是这样,那么使用q不扣除。如果没有,则用q需要的部分填满实际数量的累计和,aq1以满足需要的数量r

第二个ifelse()确保数量q减去扣除是正数(这是不完整级别的情况)或零(对于下面的剩余级别)。

由前面步骤得出的实际数量aq = c(1, 2, 2, 0)确实等于请求的数量r = 5


现在,这些ifelse()构造可以替换为pmin()and pmax()

dt[, aq := pmin(q, pmax(q - tmp, 0))]

我已经在单独的基准测试(此处未发布)中验证pmin()/pmax()方法比嵌套的ifelse().

的解释mod1

在函数impact_calc()中,可以修改一些代码行以使用data.table语法。

因此,

best_buy <- data[, .SD,.SDcols = c("price", "side", "level")][side == "buy" & level == 1][1, "price"][[1]]
best_sell <- data[, .SD,.SDcols = c("price", "side", "level")][side == "sell" & level == 1][1, "price"][[1]]

变得

best_buy <- data[side == "buy" & level == 1, first(price)]
best_sell <- data[side == "sell" & level == 1, first(price)]

data_buy <- data[, ,][side == "buy"]
levels <- data_buy[, ,][side == "buy"][, level]

变得

data_buy <- data[side == "buy"]
levels <- data[side == "buy", level]

我很惊讶地发现这些在 for 循环之外的修改已经大大提高了速度。


推荐阅读