r - 为什么我在合并函数中有一些重复的行?
问题描述
为什么当我合并 2 个数据集时,会出现一些重复的行?这是示例:
dput(head(OverLaps))
OverLap<-structure(list(
SAMPN = c(" 19", " 19", " 19", " 78"," 102", " 102"),
id = 1:6,
overlap = c("3", NA, "1", NA, NA, NA),
PERNO = structure(c(1L, 2L, 2L, 1L, 1L, 1L),
.Label = c("1","2", "3", "4", "5", "6", "7"),
class = "factor")),
row.names = c(NA, 6L), class = "data.frame")
comp<-structure(list(
SAMPN = c(" 19", " 19", " 19", " 19"," 78", " 102"),
MODE1 = structure(c(2L, 2L, 2L, 3L, 4L, 2L),
.Label = c("1", "2", "3", "4"), class = "factor"),
PERNO = structure(c(1L, 2L, 2L, 2L, 1L, 1L),
.Label = c("1", "2", "3", "4", "5", "6", "7"),
class = "factor"),
PLANO = structure(c(1L, 1L, 4L, 5L, 1L, 1L),
.Label = c(" 2", " 3", " 4", " 5", " 6", " 7", " 8", " 9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19",
"20", "21", "22", "23", "24", "27"), class = "factor"),
loop = structure(c(2L,2L, 2L, 3L, 2L, 2L),
.Label = c("1", "2", "3", "4", "5", "6", "7", "8"),
class = "factor")),
row.names = c(11L, 12L, 13L, 14L, 69L, 125L),
class = "data.frame")
我通过以下方式合并它们
OverLaps1<-merge( OverLaps,comp, all.y = TRUE)
如果您查看 output , OverLaps 中的 id 列对于每一行都是唯一的。但在合并中,我有几行相同的 ID,它重复了一些行。
SAMPN PERNO id overlap MODE1
1 19 1 1 3 2
2 19 2 2 <NA> 2
3 19 2 2 <NA> 2
4 19 2 2 <NA> 3
5 19 2 3 1 2
6 19 2 3 1 2
结构体:
OverLaps
str(OverLaps)
'data.frame': 1676 obs. of 6 variables:
$ SAMPN : chr " 19" " 19" " 19" " 19" ...
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ overlap : chr "4" NA NA "1" ...
$ PERNO : Factor w/ 7 levels "1","2","3","4",..: 1 2 2 2 1 1 1 1 2 2 ...
比较:
str(comp[1:5])
'data.frame': 1763 obs. of 5 variables:
$ SAMPN: chr " 19" " 19" " 19" " 19" ...
$ MODE1: Factor w/ 4 levels "1","2","3","4": 2 2 2 3 4 2 2 2 2 4 ...
$ PERNO: Factor w/ 7 levels "1","2","3","4",..: 1 2 2 2 1 1 1 1 2 2 ...
$ PLANO: Factor w/ 24 levels " 2"," 3"," 4",..: 1 1 4 5 1 1 7 8 9 2 ...
$ loop : Factor w/ 8 levels "1","2","3","4",..: 2 2 2 3 2 2 2 3 2 2 ...
解决方案
问题是您在两个数据框中都有两个非唯一键。因此,当您加入他们时,您会创建重复项。
我不知道哪个数据框是 OverLaps 或哪个是 comp,但如果我们假设 OverLaps 是第一个而 comp 是第二个,我们可以使用dplyr
R 中的包并创建一个 left_join
library(dplyr)
OverLaps$SAMPN<-as.character(OverLaps$SAMPN) # need to have the same type of variable across the dataframes.
OverLaps1<-left_join(OverLaps,comp,by=c('SAMPN'='SAMPN','PERNO'='PERNO')) # these are the overlapping keys in each dataframe.
SAMPN id overlap PERNO MODE1 PLANO loop
1 19 1 3 1 2 2 2
2 19 2 <NA> 2 2 2 2
3 19 2 <NA> 2 2 5 2
4 19 2 <NA> 2 3 6 3
5 19 3 1 2 2 2 2
6 19 3 1 2 2 5 2
7 19 3 1 2 3 6 3
8 78 4 <NA> 1 4 2 2
9 102 5 <NA> 1 2 2 2
10 102 6 <NA> 1 2 2 2
但是,如果您的结构代码中只有 SAMPN 跨每个数据帧,那么您想使用以下
library(dplyr)
OverLaps$SAMPN<-as.character(OverLaps$SAMPN) # need to have the same type of variable across the dataframes.
OverLaps1<-left_join(OverLaps,comp,by=c('SAMPN'='SAMPN'))
推荐阅读
- linux - 什么是文件系统 UUID?
- android - 如何在三星设备中以编程方式启用自动启动选项?
- echarts - echarts 中的圆角条
- mysql - 非分组字段和具有聚合函数的字段可以同时存在吗?
- python - How to create a dictionary from a list of strings with same key, value in python
- amazon-elastic-beanstalk - Elastic Beanstalk 找不到 server.js 文件
- spring - 在读取(通过 RepositoryItemReader)并在同一个表中写入时,某些项目在春季批处理中被读者跳过?
- python - 单线程中的 os.makedir 竞争条件?如何确保在继续之前创建目录
- regex - 从包含以特定 id 开头的名称的主机文件中删除一行
- sql - 是否有更短的方法来更新 Oracle SQL 上的范围值列?