首页 > 解决方案 > 如果 R 中有许多 NA,则合并行数据

问题描述

我有以下数据表:

library(data.table)
# Example table
table <- data.table(ID = c("Entity_A","Entity_A","Entity_B","Entity_B"),
                  Level = c("Individual_1","Individual_2","Individual_1","Individual_2"),
                  Amount1 = c("100","100","120","n.a."),
                  Amount2 = c("n.a.","40","30","30"),
                  Amount3 =c("20","n.a.","40","n.a."),
                  Amount4 =c("10","n.a.","n.a.","n.a.")
                  )
# Transform "n.a." in real NA
table <- table %>% mutate(across(where(is.character), ~na_if(., "n.a.")))
# Count which rows have more NAs
table$na_count <- apply(table, 1, function(x) sum(is.na(x)))
# Show example table
table
         ID        Level Amount1 Amount2 Amount3 Amount4 na_count
1: Entity_A Individual_1     100    <NA>      20      10        1
2: Entity_A Individual_2     100      40    <NA>    <NA>        2
3: Entity_B Individual_1     120      30      40    <NA>        1
4: Entity_B Individual_2    <NA>      30    <NA>    <NA>        3

对于每个实体(“ID”列中的实体 A、实体 B 等),我想获取 NA 数量最多的行中可用的值(来自“na_count”列)并将此信息与具有最少 NA 数量的相应行(如果实际上有要合并的信息)。生成的数据框将是:

         ID        Level Amount1 Amount2 Amount3 Amount4
1: Entity_A Individual_1     100      40      20      10
2: Entity_B Individual_1     120      30      40    <NA>

例如,对于实体 A,Amount2(以前的 NA)在第一行(Individual_1,实体 A 的 NA 数量最少)中不可用,但它实际上在第二行(Individual_2,具有最高实体 A 的 NA 数量)。所以代码应该用第二行中可用的内容填充第一行。而对于实体 B,因为在第 4 行中没有可以合并的其他信息,最后一行将继续像第 3 行一样。有人可以帮忙吗?

标签: rna

解决方案


arrange每个数据的数据和na_count值,然后选择每个组中的第一行。fillNAID

library(dplyr)
library(tidyr)

table %>%
  arrange(ID, na_count) %>%
  group_by(ID) %>%
  fill(starts_with('Amount'), .direction = 'updown') %>%
  slice(1L) %>%
  ungroup %>% 
  dplyr::select(-na_count)

#  ID       Level        Amount1 Amount2 Amount3 Amount4
#  <chr>    <chr>        <chr>   <chr>   <chr>   <chr>  
#1 Entity_A Individual_1 100     40      20      10     
#2 Entity_B Individual_1 120     30      40      NA     

推荐阅读