首页 > 解决方案 > R中字符串的模糊逻辑

问题描述

我有 2 个数据框:DF1

ID   Address
AB1  VILL +PO CHAPAR TAPUKADA  ALWAR
AB2  VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3  RAMKUMAR YADAV VILL  KANSL   0 JAIPUR
AB4  VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR

并且,df2

    Name
    CHHAPPAR
    CHHAPAR
    KANSAL
    KANSIL
    KANSOL
    KHERK
    KHERKIA
    PAR
    UR
   WAR
   RIYA
   DAV
   LI

我想在 DF1 字符串中应用模糊逻辑。如果 DF1 中给出的名称与 DF2 匹配,请给我 DF2 名称

输出应该像

ID   Address                                                                 Name
AB1  VILL +PO CHAPAR TAPUKADA  ALWAR                                         CHHAPPAR, CHHAPAR
AB2  VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3  RAMKUMAR YADAV VILL  KANSL   0 JAIPUR                                   KANSAL, KANSIL, KANSOL
AB4  VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR                           KHERK, KHERKIA

我尝试应用 FuzzywuzzyR 但出现错误

我也尝试了 agrep,但它给我的结果是 True/False。

请帮我解决这个问题。另外,如果我应该尝试其他包模糊

标签: rfuzzy-logicfuzzywuzzyagrep

解决方案


我会为此使用该软件包fuzzyjoin,它与以下逻辑一起使用tidytext

library(tidytext)
library(fuzzyjoin)
library(tidyverse)

df1 %>% 
  unnest_tokens(word, Address, to_lower = FALSE) %>% 
  fuzzyjoin::stringdist_left_join(df2, by = c("word" = "Name"), max_dist = 1) %>% 
  group_by(ID) %>% # collapse unnested tokens back to text if you want
  summarise(text = paste(word, collapse = " "),
            Name = toString(na.omit(Name)))
#> # A tibble: 4 x 3
#>   ID    text                                                 Name               
#>   <chr> <chr>                                                <chr>              
#> 1 AB1   VILL PO CHAPAR TAPUKADA ALWAR                        "CHHAPAR"          
#> 2 AB2   VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POS~ ""                 
#> 3 AB3   RAMKUMAR YADAV VILL KANSL KANSL KANSL 0 JAIPUR       "KANSAL, KANSIL, K~
#> 4 AB4   VILL KHERKI KHERKI MUKKER POSTPANIYA PUTLI JAIPUR    "KHERK, KHERKIA"

数据

df1 <- read.csv(text = "ID,Address
AB1,VILL +PO CHAPAR TAPUKADA  ALWAR
AB2,VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3,RAMKUMAR YADAV VILL  KANSL   0 JAIPUR
AB4,VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR", stringsAsFactors = FALSE)

df2 <- read.csv(text = "Name
CHHAPPAR
CHHAPAR
KANSAL
KANSIL
KANSOL
KHERK
KHERKIA", stringsAsFactors = FALSE)

推荐阅读