首页 > 解决方案 > 加速 dplyr 管道,包括在更大的表上使用 mutate_if 和 if_else 检查

问题描述

我编写了一些代码来执行过采样,这意味着我在 data.frame 中复制了我的观察结果并为复制添加了噪声,因此它们不再完全相同。我很高兴它现在按预期工作,但是......它太慢了。我只是在学习dplyr,对data.table一无所知,但我希望有办法改进我的功能。我正在为 100 个 data.frames 的函数运行此代码,其中可能包含大约 10,000 列和 400 行。

这是一些玩具数据:

library(tidyverse)

train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))

这是复制每一行给定次数的代码,以及一个确定稍后添加的噪声是正还是负的函数:

# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]

# create a flip function
flip <- function() {
  sample(c(-1,1), 1)
}

在相关的“太慢”代码中,我将 row.names 子集为添加的“。” 过滤重复。比我只选择数字列。我逐行浏览这些列,如果它们为 0,则保持不变。如果不是,则添加一定数量(此处为 +- 1 %)。稍后,我将这个数据集与原始数据集结合起来,得到我的过采样 data.frame。

# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>% 
  rownames_to_column(var = "rowname") %>%
  filter(grepl("\\.", row.names(train_oversampled))) %>% 
  rowwise() %>%
  mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
  ungroup() %>%
  column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)

我假设有更快的方法使用例如data.table,但是让这段代码运行已经很困难了,我不知道如何提高它的性能。


编辑:

该解决方案在固定值下工作得非常好,但在 for 循环中调用我收到“粘贴错误(Sample,n,sep = “。”):找不到对象'Sample'

要复制的代码:

library(data.table)

train_set <- data.frame(
  x = c(rep(0, 10)), 
  y = c(0:9), 
  z = c(rep("Factor1", 10)))

# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)

for(current_table in train_list) {
  setDT(current_table, keep.rownames="Sample")
  cols <- names(current_table)[sapply(current_table, is.numeric)]
  noised_copies <- lapply(c(1,2), function(n) {
    copy(current_table)[,
      c("Sample", cols) := c(.(paste(Sample, n, sep=".")), 
        .SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
      .SDcols=cols]
  })
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually 
# store the results, so I have to remove the object
rm(train_noised)
}

Sample任何想法为什么现在找不到该列?

标签: rdplyrdata.table

解决方案


这是一个更矢量化的方法,使用data.table

library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
    copy(train_set)[,
        c("Sample", cols) := c(.(paste(Sample, n, sep=".")), 
            .SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
        .SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)

使用 data.table 版本 >= 1.12.9,您可以is.numeric直接传递给.SDcols参数,也可以通过更短的方式(例如(.SD)names(.SD))传递到:=


地址 OP 的更新帖子:

问题是,虽然data.frame列表中的每个都转换为 a data.tabletrain_list但未更新。for您可以在循环之前使用左绑定更新列表:

library(data.table)

train_set <- data.frame(
    x = c(rep(0, 10)), 
    y = c(0:9), 
    z = c(rep("Factor1", 10)))

# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))

train_list <- lapply(train_list, setDT, keep.rownames="Sample")

for(current_table in train_list) {
    cols <- names(current_table)[sapply(current_table, is.numeric)]
    noised_copies <- lapply(c(1,2), function(n) {
        copy(current_table)[,
            c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
                .SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
            .SDcols=cols]
    })
    train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
    # As this is an example, I did not write anything to actually
    # store the results, so I have to remove the object
    rm(train_noised)
}

推荐阅读