r - 将两个数据集与一个共享列结合起来?
问题描述
我有两个这样的数据集:
training.csv
last_name ob1 ob2
Adam 2:01 2:02
Barry, S 3:30 2:50
Barry, D 2:45
Charlie 4:00
Don 2:00 1:50
Earl 2:50 2:30
Johnson, A 2:57 2:54
Johnson, T 3:15 3:10
和
racing.csv
last_name first_name 1mile-time 500m-time
Barry Sue 4:45 1:50
Don Regan 4:35 0:50
Earl Sage 4:50 1:30
Johnson Adam 4:37 1:54
Johnson Terry 4:50 2:10
所以我用merge(training, racing, by = "last_name", all = TRUE)
了,但有些人有一个共同的姓氏。在共享姓氏的情况下,将其作为姓氏输入,首字母以逗号分隔。
另一个需要注意的重要事情是,并不是每个参加训练的人都能参加比赛。所以会有一些独特的名字training.csv
在racing.csv
.
期望的输出
last_name first_name ob1 ob2 1mile-time 500m-time
Adam Bob 2:01 2:02
Barry, S Sue 3:30 2:50 4:45 1:50
Barry, D Derrick 2:45
Charlie Charles 4:00
Don Regan 2:00 1:50 4:35 0:50
Earl Sage 2:50 2:30 4:50 1:30
Johnson, A Adam 2:57 2:54 4:50 2:10
Johnson, T Terry 3:15 3:10 4:50 2:10
解决方案
您可以尝试下一个解决方案。使用您提供的数据作为输入,我们有:
#Data
df1 <- structure(list(last_name = c("Adam", "Barry, S", "Barry, D",
"Charlie", "Don", "Earl", "Johnson, A", "Johnson, T"), ob1 = c("2:01",
"3:30", "", "4:00", "2:00", "2:50", "2:57", "3:15"), ob2 = c("2:02",
"2:50", "2:45", "", "1:50", "2:30", "2:54", "3:10")), class = "data.frame", row.names = c(NA,
-8L))
df2 <- structure(list(last_name = c("Barry", "Don", "Earl", "Johnson",
"Johnson"), first_name = c("Sue", "Regan", "Sage", "Adam", "Terry"
), `1mile.time` = c("4:45", "4:35", "4:50", "4:37", "4:50"),
`500m.time` = c("1:50", "0:50", "1:30", "1:54", "2:10")), class = "data.frame", row.names = c(NA,
-5L))
现在代码:
#Format vars
df1$last_name <- trimws(df1$last_name)
df2$last_name <- trimws(df2$last_name)
df2$first_name <- trimws(df2$first_name)
#Create index to check names
index2 <- which(df2$last_name %in% df1$last_name)
df2$Empty <- NA
df2$Empty[index2] <- df2$last_name[index2]
#Replace NA
df2$Empty <- ifelse(is.na(df2$Empty),paste0(df2$last_name,', ',substring(df2$first_name,1,1)),df2$Empty)
#Format data with new id
df3 <- df2
df3$last_name <- df3$Empty
df3$Empty <- NULL
#Now merge
Merged <- merge(df1,df3,by='last_name',all.x=T)
#Arrange
Merged <- Merged[,c(1,4,2,3,5,6)]
你最终会得到这个:
last_name first_name ob1 ob2 1mile.time 500m.time
1 Adam <NA> 2:01 2:02 <NA> <NA>
2 Barry, D <NA> 2:45 <NA> <NA>
3 Barry, S Sue 3:30 2:50 4:45 1:50
4 Charlie <NA> 4:00 <NA> <NA>
5 Don Regan 2:00 1:50 4:35 0:50
6 Earl Sage 2:50 2:30 4:50 1:30
7 Johnson, A Adam 2:57 2:54 4:37 1:54
8 Johnson, T Terry 3:15 3:10 4:50 2:10
我们使用索引来检查名称,然后创建一个新的 id 进行合并。请记住,在您的数据中,并非所有姓氏都是名字,因此会有NA
一些名字。