首页 > 解决方案 > R 中的 FOR 循环优化

问题描述

我有以下代码,它需要永远在我的 80k 行 CBP 表上运行。任何人都可以帮助我优化我的循环。尝试简单地查找在某些(不是全部)列中共享相同值的重复项,获取重复项的数量,然后返回每个重复项的 id:

for (row in 1:nrow(CBP)){

    subs <- subset(CBP, CBP$Lower_Bound__c == CBP[row,"Lower_Bound__c"] & CBP$Price_Book__c == CBP[row,"Price_Book__c"] & CBP$Price__c == CBP[row,"Price__c"] & CBP$Product__c == CBP[row,"Product__c"] & CBP$Department__c == CBP[row,"Department__c"] & CBP$UOM__c == CBP[row,"UOM__c"] & CBP$Upper_Bound__c == CBP[row,"Upper_Bound__c"])

    if (nrow(subs)>1){
        CBP[row,]$dup <- nrow(subs)
        CBP[row,]$dupids <- paste(subs[,"Id"], collapse = ",")

    }
    print(row)

}

标签: rloops

解决方案


我很难理解你的例子。但是,这是一种简单的 data.table 方法,可能适用于您的情况。您可以创建一个变量(在示例中),如果某物与nsame多个变量重复(在示例中),则该变量会计算在内。然后只需获取行索引。var1var2

library(data.table)

# generate some example data
dt <- data.table(
    var1 = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
    var2 = c("a", "a", "z", "b", "y", "b", "c", "c", "c"),
    var3 = 1:9
)

# counter for each combination of var1-var2
dt[ , nsame := 1:.N, by=.(var1, var2)]

# duplicates are where the counter is > 1
which(dt$nsame > 1)
## 2 6 8 9

推荐阅读