首页 > 解决方案 > Automatic matching of right column

问题描述

In R, having the following two dataframes:

df1 <- structure(list(id = 1:8, value = c(100L, 100L, 100L, 100L, 100L, 
100L, 100L, 100L)), class = "data.frame", row.names = c(NA, -8L))

df2 <- structure(list(randomColumnName = c(100L, 2L, 3L, 40L, 5L, 6L, 73L, 8L), OtherRandomColumn =    structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "'test'", class = "factor"), value = c(100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L)), class = "data.frame", row.names = c(NA, -8L))

print(df1)

  id value
1  1   100
2  2   100
3  3   100
4  4   100
5  5   100
6  6   100
7  7   100
8  8   100

print(df2)

  randomColumnName OtherRandomColumn value
1              100            'test'   100
2                2            'test'   100
3                3            'test'   100
4               40            'test'   100
5                5            'test'   100
6                6            'test'   100
7               73            'test'   100
8                8            'test'   100

Dataframe 1 has a unique id column called 'id'. I know that df2 also has a column which contains some but not all id's. Upfront, I do not know the column name of the corresponding column in dataframe 2.

Question:

Is there any handy function or mechanism to check the first few entries (e.g. 50) of each column in df2, compare them to the id column of df, and return the column name of df2 that is most likely to match the ID column?

In this case, it should return that column 'randomColumnName' is the most likely column to match df1's ID column.

标签: rdataframe

解决方案


You can try to loop over df2 and check each column against df1$id. The column with the biggest sum is your column to merge on. Note that the output of that is a named vector so you can use either the name or its index.

i1 <- which.max(sapply(df2, function(i) sum(df1$id == i)))
names(i1)
#[1] "randomColumnName"
i1
#randomColumnName 
#               1 

推荐阅读