首页 > 解决方案 > 使用 stringdist_join 通过多列连接

问题描述

我有两个数据框,其中列x可能有拼写错误,并且列y总是正确的。我不明白为什么通过多列加入stringdist会给出这些对:

library(dplyr)
library(fuzzyjoin)
a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2", "3", "1","6"))

b <- data.frame(x = c("season", "seson", "seson", "package", "pakkage"), y = c("1","2", "3", "2","6"))

c <- a %>%
  stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0))

      x.x y.x     x.y  y.y
1  season   1  season    1
2  season   1   seson    2
3  season   1   seson    3
4  season   2   seson    2
5  season   3  season    1
6  season   3   seson    2
7  season   3   seson    3
8 package   1 package    2
9 package   6    <NA> <NA>

我想得到

      x.x y.x     x.y  y.y
1  season   1  season    1
2  season   2   seson    2
3  season   3   seson    3
4 package   1    <NA> <NA>
5 package   6 pakkage    6

标签: rjoinleft-join

解决方案


我们可以通过根据两个数据集中“x”列中列值的相似性创建一个新列来完成这项工作,然后执行left_join

library(stringdist)
library(dplyr)
a %>%
    mutate(grp = phonetic(x)) %>%
   left_join(b %>% mutate(grp = phonetic(x), y2 = y), by = c('grp', 'y')) %>% 
   select(-grp)

-输出

#      x.x y     x.y   y2
#1  season 1  season    1
#2  season 2   seson    2
#3  season 3   seson    3
#4 package 1    <NA> <NA>
#5 package 6 pakkage    6

或者其他选项是将其methodstringdist_left_join默认选项(osa-> Optimal string aligment,(restricted Damerau-Levenshtein distance)。)更改为soundex(基于 soundex 编码的距离)

library(fuzzyjoin)
a %>%
   stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0), 
            method = "soundex")
#      x.x y.x     x.y  y.y
#1  season   1  season    1
#2  season   2   seson    2
#3  season   3   seson    3
#4 package   1    <NA> <NA>
#5 package   6 pakkage    6

根据?"stringdist-metrics"

对于 soundex 距离 (method='soundex'),字符串被转换为 soundex 代码(有关规范,请参见语音)。当它们具有相同的 soundex 代码时,字符串之间的距离为 0,否则为 1。请注意,soundex 重新编码仅对 az 和 AZ 范围内的字符有意义。遇到不可打印或非 ascii 字符时会发出警告。


推荐阅读