首页 > 解决方案 > R不平衡数据帧数据清洗

问题描述

我的数据框如下所示:

c1 c2 c3 c4

T1 NA NA NA
NA a  NA NA
NA NA B  NA
NA NA NA b
T2 NA NA NA
NA NA C  NA
NA NA NA c

我想拥有它

c1 c2 c3 c4

T1 a  B b
T2 NA C c

我尝试了以下类似的方法,这是我从另一篇文章中看到的,但我认为它不适用于我的问题,请问有什么帮助吗?

stri_list2matrix(lapply(., function(x) x[x!='NA']), fill='', byrow=FALSE)

标签: rdataframedplyr

解决方案


这是lapplyfrom的一个选项base R。遍历数据集的列后,使用(返回子集的逻辑向量)删除NA元素。然后,根据列表元素的最大数量在最后is.na填充list输出,然后NAmaxlengthcbind

lst1 <- lapply(df1, function(x) x[!is.na(x)])
do.call(cbind, lapply(lst1, `length<-`, max(lengths(lst1))))
#    c1   c2  c3  c4 
#[1,] "T1" "a" "B" "b"
#[2,] "T2" NA  "C" "c"

它也可以用cbind.fillfromrowrmap

library(purrr)
library(rowr)
map(df1, ~ .x[!is.na(.x)]) %>%
    reduce(cbind.fill, fill = NA) %>%
    set_names(names(df1))
#  c1   c2 c3 c4
#1 T1    a  B  b
#2 T2 <NA>  C  c

或者通过在删除行的同时重塑为“长”格式,然后将其重塑回“宽”格式

library(tidyr)
df1 %>% 
     pivot_longer(everything(), values_drop_na = TRUE) %>% 
     group_by(name) %>% 
     mutate(rn = row_number()) %>%
     pivot_wider(names_from = name, values_from = value) %>%
     select(-rn)
# A tibble: 2 x 4
#  c1    c2    c3    c4   
#  <chr> <chr> <chr> <chr>
#1 T1    a     B     b    
#2 T2    <NA>  C     c    

或与melt/dcast

library(data.table)
dcast(melt(setDT(df1)[, rn := seq_len(.N)], id.var = 'rn',
        na.rm = TRUE), rowid(variable) ~ variable, value.var = 'value')

数据

df1 <- structure(list(c1 = c("T1", NA, NA, NA, "T2", NA, NA), c2 = c(NA, 
"a", NA, NA, NA, NA, NA), c3 = c(NA, NA, "B", NA, NA, "C", NA
), c4 = c(NA, NA, NA, "b", NA, NA, "c")), class = "data.frame",
row.names = c(NA, 
-7L))

推荐阅读