首页 > 解决方案 > 为什么我在合并函数中有一些重复的行?

问题描述

为什么当我合并 2 个数据集时,会出现一些重复的行?这是示例:

dput(head(OverLaps))
OverLap<-structure(list(
               SAMPN = c("   19", "   19", "   19", "   78","  102", "  102"), 
               id = 1:6,
               overlap = c("3", NA, "1", NA, NA, NA),
               PERNO = structure(c(1L, 2L, 2L, 1L, 1L, 1L),
                 .Label = c("1","2", "3", "4", "5", "6", "7"),
                 class = "factor")),
               row.names = c(NA, 6L), class = "data.frame")

comp<-structure(list(
               SAMPN = c("   19", "   19", "   19", "   19","   78", "  102"), 
               MODE1 = structure(c(2L, 2L, 2L, 3L, 4L, 2L),
                  .Label = c("1", "2", "3", "4"), class = "factor"),
               PERNO = structure(c(1L, 2L, 2L, 2L, 1L, 1L),
                  .Label = c("1", "2", "3", "4", "5", "6", "7"),
                  class = "factor"),
               PLANO = structure(c(1L, 1L, 4L, 5L, 1L, 1L),
                  .Label = c(" 2", " 3", " 4", " 5", " 6", " 7", " 8", " 9", 
                  "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", 
                  "20", "21", "22", "23", "24", "27"), class = "factor"),
               loop = structure(c(2L,2L, 2L, 3L, 2L, 2L),
                  .Label = c("1", "2", "3", "4", "5", "6", "7", "8"),
                  class = "factor")),
               row.names = c(11L, 12L, 13L, 14L, 69L, 125L),
               class = "data.frame")

我通过以下方式合并它们

OverLaps1<-merge( OverLaps,comp, all.y = TRUE)

如果您查看 output , OverLaps 中的 id 列对于每一行都是唯一的。但在合并中,我有几行相同的 ID,它重复了一些行。

  SAMPN PERNO id overlap MODE1
1    19     1  1       3     2
2    19     2  2    <NA>     2
3    19     2  2    <NA>     2
4    19     2  2    <NA>     3
5    19     2  3       1     2
6    19     2  3       1     2

结构体:

OverLaps

str(OverLaps)
'data.frame':   1676 obs. of  6 variables:
 $ SAMPN   : chr  "   19" "   19" "   19" "   19" ...
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ overlap : chr  "4" NA NA "1" ...
 $ PERNO   : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 2 1 1 1 1 2 2 ...

比较:

str(comp[1:5])
    'data.frame':   1763 obs. of  5 variables:
     $ SAMPN: chr  "   19" "   19" "   19" "   19" ...
     $ MODE1: Factor w/ 4 levels "1","2","3","4": 2 2 2 3 4 2 2 2 2 4 ...
     $ PERNO: Factor w/ 7 levels "1","2","3","4",..: 1 2 2 2 1 1 1 1 2 2 ...
     $ PLANO: Factor w/ 24 levels " 2"," 3"," 4",..: 1 1 4 5 1 1 7 8 9 2 ...
     $ loop : Factor w/ 8 levels "1","2","3","4",..: 2 2 2 3 2 2 2 3 2 2 ...

标签: rdataframe

解决方案


问题是您在两个数据框中都有两个非唯一键。因此,当您加入他们时,您会创建重复项。

我不知道哪个数据框是 OverLaps 或哪个是 comp,但如果我们假设 OverLaps 是第一个而 comp 是第二个,我们可以使用dplyrR 中的包并创建一个 left_join

library(dplyr)
OverLaps$SAMPN<-as.character(OverLaps$SAMPN) # need to have the same type of variable across the dataframes.
OverLaps1<-left_join(OverLaps,comp,by=c('SAMPN'='SAMPN','PERNO'='PERNO')) # these are the overlapping keys in each dataframe.

   SAMPN id overlap PERNO MODE1 PLANO loop
1     19  1       3     1     2     2    2
2     19  2    <NA>     2     2     2    2
3     19  2    <NA>     2     2     5    2
4     19  2    <NA>     2     3     6    3
5     19  3       1     2     2     2    2
6     19  3       1     2     2     5    2
7     19  3       1     2     3     6    3
8     78  4    <NA>     1     4     2    2
9    102  5    <NA>     1     2     2    2
10   102  6    <NA>     1     2     2    2

但是,如果您的结构代码中只有 SAMPN 跨每个数据帧,那么您想使用以下

library(dplyr)
    OverLaps$SAMPN<-as.character(OverLaps$SAMPN) # need to have the same type of variable across the dataframes.
    OverLaps1<-left_join(OverLaps,comp,by=c('SAMPN'='SAMPN'))

推荐阅读