首页 > 解决方案 > 使用样本进行逐行变异的有效方法

问题描述

对于每个0in x,我想在 1:10 之间随机插入一个数字,但我正在寻找一种有效的方法来执行此操作dplyr和/或data.table因为我有一个非常大的数据集(10m 行)。

library(tidyverse)
df <- data.frame(x = 1:10)
df[4, 1] = 0
df[6, 1] = 0
df
#     x
# 1   1
# 2   2
# 3   3
# 4   0
# 5   5
# 6   0
# 7   7
# 8   8
# 9   9
# 10 10

这不起作用,因为它每年都用相同的值替换:

set.seed(1)
df %>% 
  mutate(x2 = ifelse(x == 0, sample(1:10, 1), x))
#     x x2
# 1   1  1
# 2   2  2
# 3   3  3
# 4   0  9
# 5   5  5
# 6   0  9
# 7   7  7
# 8   8  8
# 9   9  9
# 10 10 10

虽然可以实现,rowwise但在大型数据集上速度很慢:

set.seed(1)
#use rowwise
df %>% 
  rowwise() %>% 
  mutate(x2 = ifelse(x == 0, sample(1:10, 1), x))
#        x    x2
#    <dbl> <dbl>
#  1     1     1
#  2     2     2
#  3     3     3
#  4     0     9
#  5     5     5
#  6     0     4
#  7     7     7
#  8     8     8
#  9     9     9
# 10    10    10

有什么建议可以加快速度吗?

谢谢

标签: rdplyrdata.table

解决方案


使用上述解决方案并对microbenchmark数据集稍作修改以进行设置:

library(data.table)
library(tidyverse)
df <- data.frame(x = 1:100000, y = rbinom(100000, size = 1, 0.5)) %>% 
  mutate(x = ifelse(y == 0, 0, x)) %>% 
  dplyr::select(-y)
dt <- setDT(df)


test <- microbenchmark::microbenchmark(
  base1 = {
    df$x[df$x == 0] <- sample(1:10, sum(df$x == 0), replace = T)
  },
  dplyr1 = {
     df %>% 
      mutate(x2 = replace(x, which(x == 0), sample(1:10, sum(x == 0), replace = T)))
  },
  dplyr2 = {
    df %>% group_by(id=row_number()) %>%
      mutate(across(c(x),.fns = list(x2 = ~ ifelse(.==0, sample(1:10, 1, replace = T), .)) )) %>%
      ungroup() %>% select(-id)
  },
  data.table = {
    dt[x == 0, x := sample(1:10, .N, replace = T)]
  },
  times = 500L
)
test
# Unit: microseconds
#        expr        min         lq          mean      median         uq        max neval cld
#       base1      733.7      785.9      979.0938      897.25     1137.0     1839.4   500  a 
#      dplyr1     5207.1     5542.1     6129.2276     5967.85     6476.0    21790.7   500  a 
#      dplyr2 15963406.4 16156889.2 16367969.8704 16395715.00 16518252.9 19276215.5   500  b
#  data.table     1547.4     2229.3     2422.1278     2455.60     2573.7    15076.0   500  a 

我认为data.table会最快,但基本解决方案似乎最好(假设我设置mircobenchmark正确?)。

编辑基于@chinsoon12 评论

1e5行:

Unit: microseconds
       expr    min      lq     mean  median      uq     max neval cld
      base1  730.4  839.30 1380.465 1238.00 1322.85 28977.3   500  a 
 data.table 1394.8 1831.85 2030.215 1946.95 2060.40 29821.9   500  b

1e6行:

Unit: milliseconds
       expr    min      lq      mean   median       uq      max neval cld
      base1 9.8703 11.6596 16.030715 11.76195 12.04145 326.0118   500  b
 data.table 2.3772  2.7939  3.855672  3.04700  3.25900  61.4083   500  a 

data.table是最快的


推荐阅读