首页 > 解决方案 > 基于部分字符串匹配合并来自另一个数据帧的值,而匹配列的顺序不同

问题描述

我想df2通过df1匹配df1$District_name和合并一列df2$Districtsdf1$District_name但是和中的字符值df2$Districts的顺序不同,并且df1df2的长度也不同。值不完全匹配。df1的行数多于df2,因此这些额外区域名称的对应值应为零。

 df1=data.frame(State_name=c("Maharashtra","Andhra Pradesh","Bihar","Bihar","West Bengal","Gujarat","Gujarat","Assam"),
           District_name=c("Nashik","Chittoor","Madhepura","Kishanganj","Howrah","Gandhinagar","Ahmadabad","Sivasagar"),
           Value1=c(5,3,6,4,4,3,2,4))

df2=data.frame(Districts=c("Nashik","Chitoor","Kishanganj","Madhepur","Sibhasagar","Ahmadabad"),
           FinanceIndex=c(0.20975,0.12187,0.37155,0.66128,0.10918,0.54730))


# df1
      State_name District_name Value1
1    Maharashtra        Nashik      5
2 Andhra Pradesh      Chittoor      3
3          Bihar     Madhepura      6
4          Bihar    Kishanganj      4
5    West Bengal        Howrah      4
6        Gujarat   Gandhinagar      3
7        Gujarat     Ahmadabad      2
8          Assam     Sivasagar      4

# df2
      Districts FinanceIndex
1        Nashik      0.20975
2       Chitoor      0.12187
3    Kishanganj      0.37155
4      Madhepur      0.66128
5    Sibhasagar      0.10918
6     Ahmadabad      0.54730

我使用了 match 函数,但由于拼写差异,我将它们中的大多数设为零值。

index<-match(df1$District_name, df2$Districts)
df1$finindex=df2$FinanceIndex[index]
df1$finindex[is.na(df1$finindex]=0

对于字符串匹配,我发现这个函数可以匹配相似的拼音词:

library(RecordLinkage)
soundex('Nellore')==soundex('Vellore')
#FALSE

输出应该是:

# df1
     State_name District_name Value1 finindex
1    Maharashtra        Nashik      5  0.20975
2 Andhra Pradesh      Chittoor      3  0.12187
3          Bihar     Madhepura      6  0.66128
4          Bihar    Kishanganj      4  0.37155
5    West Bengal        Howrah      4  0.00000
6        Gujarat   Gandhinagar      3  0.00000
7        Gujarat     Ahmadabad      2  0.54730
8          Assam     Sivasagar      4  0.10918

有什么办法可以一起使用这两个功能来解决问题?或者有什么其他方法可以解决问题?

标签: rdictionarymatch

解决方案


一种选择是进行部分匹配stringddist

library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("District_name" = "Districts")) %>%
    select(-Districts)
#      State_name District_name Value1 FinanceIndex
#1    Maharashtra        Nashik      5      0.20975
#2 Andhra Pradesh      Chittoor      3      0.12187
#3          Bihar     Madhepura      6      0.66128
#4          Bihar    Kishanganj      4      0.37155
#5    West Bengal        Howrah      4           NA
#6        Gujarat   Gandhinagar      3           NA
#7        Gujarat     Ahmadabad      2      0.54730
#8          Assam     Sivasagar      4      0.10918

推荐阅读