首页 > 解决方案 > 使用自定义字典模糊匹配和替换数据框中的字符串

问题描述

我有这个相似的数据框(语法差异很小的字符串)

 place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis  ")
 place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
 place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")

 places2clean <- data.frame(place1, place2, place3)

这是我的自定义词典

  dictionnary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")

  dictionnary <- data.frame(dictionnary)

我想根据自定义字典匹配和替换所有字符串。

预期结果:

    place1     place2     place3
 Pondichéry     Lorient    Lorient
 Pondichéry Pondichéry Pondichéry
 Pondichéry    Lorient      Brest
 Port-Louis Port-Louis Port Louis
 Port-Louis   Port-Louis     Nantes

如何使用 stringdistance 匹配和替换所有数据框?

标签: rfuzzy-searchfuzzy-comparison

解决方案


基本 R 函数adiststringdist::amatch函数将在这里使用。没有理由把你的字典变成 a data.frame,所以我没有在这里。

如果您想进行实验,您可以对 stringdist 包使用不同的方法,但默认设置在这里可以正常工作。请注意,这两个函数都选择了最佳匹配,但如果没有紧密匹配(由 maxDist 参数定义),则返回 NA。

library(stringdist)
# Using stringdist package
clean_places <- function(places, dictionary, maxDist = 5) {
  dictionary[amatch(places, dictionary, maxDist = maxDist)]
}

# Using base R
clean_places2 <- function(places, dictionary, maxDist = 5) {
  sm <- adist(places, dictionary)
  sm[sm > maxDist] <- NA
  dictionary[apply(sm, 1, which.min)]
}

dictionary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")
place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis  ")
place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")

clean_places(place1, dictionary)
# [1] "Pondichéry" "Pondichéry" "Pondichéry" "Port-Louis" "Port-Louis"
clean_places(place2, dictionary)
# [1] "Lorient"    "Pondichéry" "Lorient"    "Port-Louis" "Port-Louis"
clean_places(place3, dictionary)
# [1] "Lorient"    "Pondichéry" "Brest"      "Port-Louis" "Nantes"    

clean_places2(place1, dictionary)
# [1] "Pondichéry" "Pondichéry" "Pondichéry" "Port-Louis" "Port-Louis"
clean_places2(place2, dictionary)
# [1] "Lorient"    "Pondichéry" "Lorient"    "Port-Louis" "Port-Louis"
clean_places2(place3, dictionary)
# [1] "Lorient"    "Pondichéry" "Brest"      "Port-Louis" "Nantes"    

推荐阅读