首页 > 解决方案 > 如何针对 r 中的更大数据优化此 for 循环?

问题描述

我有一些可重现的数据,(我的原始数据集包含大约 2,000,000 行)。出于这个原因,我的 for 循环变得低效,并且需要很长时间才能运行这么多数据。我想知道是否有更有效的方法来运行这些数据。我用可重现的数据附加了我的代码

#----Reproducible data example--------------------#
#Upload first data set#
words1<-c("How","did","Quebec","nationalists","see","their","province","as","a","nation","in","the","1960s")
words2<-c("Why","does","volicty","effect","time",'?',NA,NA,NA,NA,NA,NA,NA)
words3<-c("How","do","I","wash","a","car",NA,NA,NA,NA,NA,NA,NA)
library<-c("The","the","How","see","as","a","for","then","than","example")
embedding1<-c(.5,.6,.7,.8,.9,.3,.46,.48,.53,.42)
embedding2<-c(.1,.5,.4,.8,.9,.3,.98,.73,.48,.56)
df <- data.frame(words1,words2,words3)
names(df)<-c("words1","words2","words3")

#--------Upload 2nd dataset-------#
df2 <- data.frame(library,embedding1, embedding2)
names(df2)<-c("library","embedding1","embedding2")
df2$meanembedding=rowMeans(df2[c("embedding1","embedding2")],na.rm=T)
df2<-df2[,-c(2,3)]

#-----Find columns--------#
l=ncol(df)
names<-names(df)
head(names)
classes<-sapply(df[,c(1:l)],class)
head(classes)

#------Combine and match libary to training data------#
require(gridExtra)
List = list()
for( name in names){
  df1<-df[,name]
  df1<-as.data.frame(df1)
  x_train2<-merge(x= df1, y = df2, 
                  by.x = "df1", by.y = 'library',all.x=T, sort=F)
  x_train2<-x_train2[,-1]
  x_train2<-as.data.frame(x_train2)
  names(x_train2) <- name
  List[[length(List)+1]] = x_train2
}

标签: roptimization

解决方案


更好的方法是使用lapply

myList2 <- lapply(names(df), function(x){
  y <- merge(x = df[, x, drop = FALSE], 
        y = df2,
        by.x = x,
        by.y = 'library',
        all.x = T, 
        sort = F)[, -1, drop = FALSE]
  names(y) <- x
  return(y)
})

我们循环遍历 vector names(df)、 subset 和 merge ,[drop = FALSE]用于防止从 one-column-data.frame 简化为 vector,并覆盖列名。输出是一个列表。

发布脚本:正如@RuiBarradas 指出的那样,从技术上讲,您不需要使用drop = FALSEifdf[x]代替。但我认为在需要对行和列进行子集化的情况下df[, x]了解该选项会很有帮助。drop = FALSE


推荐阅读