首页 > 解决方案 > 将两个数据集与一个共享列结合起来?

问题描述

我有两个这样的数据集:

training.csv
last_name   ob1   ob2 
Adam        2:01  2:02
Barry, S    3:30  2:50
Barry, D          2:45
Charlie     4:00  
Don         2:00  1:50
Earl        2:50  2:30
Johnson, A  2:57  2:54
Johnson, T  3:15  3:10

racing.csv
last_name    first_name   1mile-time   500m-time
Barry        Sue          4:45         1:50
Don          Regan        4:35         0:50
Earl         Sage         4:50         1:30
Johnson      Adam         4:37         1:54
Johnson      Terry        4:50         2:10

所以我用merge(training, racing, by = "last_name", all = TRUE)了,但有些人有一个共同的姓氏。在共享姓氏的情况下,将其作为姓氏输入,首字母以逗号分隔。

另一个需要注意的重要事情是,并不是每个参加训练的人都能参加比赛。所以会有一些独特的名字training.csvracing.csv.

期望的输出

last_name   first_name   ob1   ob2   1mile-time   500m-time
Adam        Bob          2:01  2:02
Barry, S    Sue          3:30  2:50   4:45         1:50
Barry, D    Derrick            2:45
Charlie     Charles      4:00  
Don         Regan        2:00  1:50   4:35         0:50
Earl        Sage         2:50  2:30   4:50         1:30
Johnson, A  Adam         2:57  2:54   4:50         2:10
Johnson, T  Terry        3:15  3:10   4:50         2:10

标签: rcsvdplyretldata-cleaning

解决方案


您可以尝试下一个解决方案。使用您提供的数据作为输入,我们有:

#Data
df1 <- structure(list(last_name = c("Adam", "Barry, S", "Barry, D", 
"Charlie", "Don", "Earl", "Johnson, A", "Johnson, T"), ob1 = c("2:01", 
"3:30", "", "4:00", "2:00", "2:50", "2:57", "3:15"), ob2 = c("2:02", 
"2:50", "2:45", "", "1:50", "2:30", "2:54", "3:10")), class = "data.frame", row.names = c(NA, 
-8L))
df2 <- structure(list(last_name = c("Barry", "Don", "Earl", "Johnson", 
"Johnson"), first_name = c("Sue", "Regan", "Sage", "Adam", "Terry"
), `1mile.time` = c("4:45", "4:35", "4:50", "4:37", "4:50"), 
    `500m.time` = c("1:50", "0:50", "1:30", "1:54", "2:10")), class = "data.frame", row.names = c(NA, 
-5L))

现在代码:

#Format vars
df1$last_name <- trimws(df1$last_name)
df2$last_name <- trimws(df2$last_name)
df2$first_name <- trimws(df2$first_name)
#Create index to check names
index2 <- which(df2$last_name %in% df1$last_name)
df2$Empty <- NA
df2$Empty[index2] <- df2$last_name[index2]
#Replace NA
df2$Empty <- ifelse(is.na(df2$Empty),paste0(df2$last_name,', ',substring(df2$first_name,1,1)),df2$Empty)
#Format data with new id
df3 <- df2
df3$last_name <- df3$Empty
df3$Empty <- NULL
#Now merge
Merged <- merge(df1,df3,by='last_name',all.x=T)
#Arrange
Merged <- Merged[,c(1,4,2,3,5,6)]

你最终会得到这个:

   last_name first_name  ob1  ob2 1mile.time 500m.time
1       Adam       <NA> 2:01 2:02       <NA>      <NA>
2   Barry, D       <NA>      2:45       <NA>      <NA>
3   Barry, S        Sue 3:30 2:50       4:45      1:50
4    Charlie       <NA> 4:00            <NA>      <NA>
5        Don      Regan 2:00 1:50       4:35      0:50
6       Earl       Sage 2:50 2:30       4:50      1:30
7 Johnson, A       Adam 2:57 2:54       4:37      1:54
8 Johnson, T      Terry 3:15 3:10       4:50      2:10

我们使用索引来检查名称,然后创建一个新的 id 进行合并。请记住,在您的数据中,并非所有姓氏都是名字,因此会有NA一些名字。


推荐阅读