首页 > 解决方案 > 如何用从同一列推算的值替换列中的缺失值?

问题描述

我想按组更新列中的缺失值,中值是从同一列计算的。例如,这里是一个基于泰坦尼克号数据的 train.csv 数据集的代表。我想要做的是按性别、登船和 Pclass 变量定义的组的中位年龄更新 Age 列中的缺失值(我在获得中位年龄之前将这些变量保存为因素)。

我能想到的唯一方法是对缺失的数据进行子集化,用组的中位年龄填充缺失的年龄,然后rbind将结果与数据集的其余部分一起填充。

    library(data.table)
    train <- fread("https://raw.githubusercontent.com/nybbles/kaggle/master/train.csv")
    train[,sex := factor(sex)]
    train[,survived := factor(survived, labels = c("did not survive","survived"))]
    train[embarked == "",embarked := "S"]
    train[,embarked := factor(embarked,labels = c("Cherbourg","Queenstown","Southampton"))]
    train[,pclass := factor(pclass,ordered = T,levels = c(3,2,1))]

    train[is.na(age),.N] # 177 missing in age column
    age_imputed <- train[!is.na(age),.(age = median(age)),.(sex,embarked,pclass)] #Step 1
    age_missing <- train[is.na(age)] #Step 2
    train <- train[!is.na(age)] #Step 3
    age_missing[,age:=NULL] #Step 4
    age_missing <- age_imputed[age_missing,on = c("sex","embarked","pclass")] #Step 5
    train <- rbindlist(list(train,age_missing), use.names = TRUE)# Step 6

相反,是否有一种“更快”的方法可以通过引用来执行此操作,而不是对数据进行子集化?对我来说,对数据进行子集化和rbind数据化似乎是不必要的操纵和容易出错。我试过了

train[,Age := ifelse(is.na(Age),age_imputed$Age[which(age_imputed$Sex == train$Sex & age_imputed$Embarked == train$Embarked & age_imputed$Pclass == train$Pclass)],
                     Age)]

作为一个潜在的解决方案,但不断遇到各种错误。

标签: rdata.table

解决方案


您可以尝试更新联接,其中左表 ( .SD) 是具有 NA 的行Age

train[is.na(Age), Age :=
    age_imputed[.SD, on=.(Sex, Embarked, Pclass), x.Age]
]

推荐阅读