首页 > 解决方案 > R中缺失字符串的插补

问题描述

我有一个包含 20% 缺失字符串的大数据集。

NAME      |  AREA
--------------------------
Andy      |  Sales
Andy      |  NA
Andy      |  Sales
Andy      |  Sales
Andy      |  NA
Andy      |  Sales
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  NA
Sandy     |  Construction
Sandy     |  Construction
Wendy     |  Planting
Wendy     |  Driving
Wendy     |  NA
Wendy     |  NA
Wendy     |  NA

在我的数据的大多数情况下,几乎很明显,安迪从事销售工作,而桑迪从事建筑工作。但我们不能确定温迪。

我想要的结果是:

NAME      |  AREA
--------------------------
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Andy      |  Sales
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  Construction
Sandy     |  Construction
Wendy     |  Planting
Wendy     |  Driving
Wendy     |  NA
Wendy     |  NA
Wendy     |  NA

哪个是处理它的最佳插补包?或者,也许,您有更好的解决方案?

提前致谢!

标签: r

解决方案


也许您可以尝试根据每个组中的不同值进行条件填充

library(dplyr)

df %>%
  group_by(NAME) %>%
  mutate(AREA = if(n_distinct(AREA, na.rm = TRUE) == 1) first(AREA) else AREA)


#   NAME  AREA        
#   <fct> <fct>       
# 1 Andy  Sales       
# 2 Andy  Sales       
# 3 Andy  Sales       
# 4 Andy  Sales       
# 5 Andy  Sales       
# 6 Andy  Sales       
# 7 Sandy Construction
# 8 Sandy Construction
# 9 Sandy Construction
#10 Sandy Construction
#11 Sandy Construction
#12 Wendy Planting    
#13 Wendy Driving     
#14 Wendy NA          
#15 Wendy NA          
#16 Wendy NA      

数据

df <- structure(list(NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Andy", "Sandy", 
"Wendy"), class = "factor"), AREA = structure(c(4L, NA, 4L, 4L, 
NA, 4L, 1L, 1L, NA, 1L, 1L, 3L, 2L, NA, NA, NA), .Label = 
c("Construction", "Driving", "Planting", "Sales"), 
class = "factor")), class = "data.frame", row.names = c(NA, -16L))    

推荐阅读