首页 > 解决方案 > R:有没有办法在两个不同的 dfs 中找到与两个字符串列的相同第一个元素的部分字符串匹配?

问题描述

我在两个不同的数据帧 df1 和 df2 ->df1$name和中有两个字符串列df2$name。df1 有超过 10000 行,而 df2 有大约 200+ 行。例如:

df1 <- data.frame(name = c("Peter P", "Jim Gordon",  "Bruce Wayne", "Tony Stark","Mony Blake" ))

df2<- data.frame(name = c( "Jeter P", "Bruce Wayne", "Mony Blake" ))

注意:dfs 比这些大得多。

我首先使用了合并功能。它首先匹配公共行,但对于“Jeter P”却没有。然后我使用了amatch来自 Stringdist 库的部分匹配函数method = "lv"。它将 Peter P 与 Jeter P 相匹配,这是两个不同的人。我知道 amatch 会改变位置和字母等,但我希望该函数在匹配字符串的同时保持字符串的第一个元素相同的同时搜索 df。

Jeter P基本上,当我对in使用部分字符串匹配时,df2$name它只会将df1$name字符串以 J 开头的行视为潜在的部分匹配。是否可以?

提前致谢。

标签: rstringstring-matching

解决方案


@RonakShah 今天早些时候发布了这个版本,但后来删除了它,因为他的解决方案不太符合要求。

这个想法是使用fuzzyjoin包,它有很多功能可以在两个数据集之间进行模糊匹配。它们都不完全符合这个问题的要求,但这里有一个更长的答案应该这样做。

stringdist_inner_join函数进行常规模糊匹配。它通过构造一个复杂的函数来工作fuzzy_join。它不导出该功能;但是您可以创建自己的函数(我正在调用它stringdist_match)来创建函数并将其导出。然后将其与比较第一个字母的组合,并使用组合函数 ( custom_match) in fuzzy_join。这是一些代码。大部分功能都是从包stringdist_match中复制的。fuzzyjoin

library(fuzzyjoin)

stringdist_match <- function(max_dist = 2,
                            method = c("osa", "lv", "dl", "hamming", "lcs", "qgram",
                                       "cosine", "jaccard", "jw", "soundex"),
                            mode = "inner",
                            ignore_case = FALSE,
                            distance_col = NULL, ...) {
  # It's a good idea to force evaluation of all the arguments
  # in case they get changed between when we call this function and 
  # when we use the function it returns.

  force(max_dist)
  force(mode)
  force(ignore_case)
  force(distance_col)
  forceotherargs <- list(...)

  method <- match.arg(method)

  if (method == "soundex") {
    # soundex always returns 0 or 1, so any other max_dist would
    # lead either to always matching or never matching
    max_dist <- .5
  }

  function(v1, v2) {
    if (ignore_case) {
      v1 <- stringr::str_to_lower(v1)
      v2 <- stringr::str_to_lower(v2)
    }

    # shortcut for Levenshtein-like methods: if the difference in
    # string length is greater than the maximum string distance, the
    # edit distance must be at least that large

    # length is much faster to compute than string distance
    if (method %in% c("osa", "lv", "dl")) {
      length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
      include <- length_diff <= max_dist

      dists <- rep(NA, length(v1))

      dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method, ...)
    } else {
      # have to compute them all
      dists <- stringdist::stringdist(v1, v2, method = method, ...)
    }
    ret <- tibble::tibble(include = (dists <= max_dist))
    if (!is.null(distance_col)) {
      ret[[distance_col]] <- dists
    }
    ret
  }
}

# Now the example.  First, create a matching function that
# just does the fuzzy part.
fuzzy_match <- stringdist_match()

# Next create a matching function that just compares first letters.
first_letter_match <- function(col1, col2) 
  sub("(^.).*", "\\1", col1) == sub("(^.).*", "\\1", col2)

# Now create one that requires both to match.
custom_match <- function(col1, col2) 
  first_letter_match(col1, col2) & fuzzy_match(col1, col2)

# Now run the example

df1 <- data.frame(name = c("Peter P", "Jim Gordon",  "Bruce Wayne", "Tony Stark","Mony Blake" ))

df2<- data.frame(name = c( "Jeter P", "Bruce Wayne", "Mony Blake" ))

fuzzy_inner_join(df1, df2, by = "name", match_fun = custom_match)
#>        name.x      name.y
#> 1 Bruce Wayne Bruce Wayne
#> 2  Mony Blake  Mony Blake

reprex 包(v0.3.0)于 2020-02-21 创建

有关 的所有参数的文档stringdist_match,请参阅?fuzzyjoin::stringdist_join


推荐阅读