首页 > 解决方案 > 如何使用 quanteda 和 kwic 进行模糊模式匹配?

问题描述

我有医生写的文本,我希望能够在他们的上下文中突出显示特定的单词(我在他们的文本中搜索的单词之前的 5 个单词和之后的 5 个单词)。假设我想搜索“自杀”这个词。然后我会在 quanteda 包中使用 kwic 函数:

kwic(数据集,模式 = “自杀”,窗口 = 5)

到目前为止,一切都很好,但是说我想考虑错别字的可能性。在这种情况下,我想允许三个不同的字符,对这些字符在单词中的位置没有限制。

是否可以使用 quanteda 的 kwic 函数来做到这一点?

例子:

dataset <- data.frame("patient" = 1:9, "text" = c("On his first appointment, the patient was suicidal when he showed up in my office", 
                                  "On his first appointment, the patient was suicidaa when he showed up in my office",
                                  "On his first appointment, the patient was suiciaaa when he showed up in my office",
                                  "On his first appointment, the patient was suicaaal when he showed up in my office",
                                  "On his first appointment, the patient was suiaaaal when he showed up in my office",
                                  "On his first appointment, the patient was saacidal when he showed up in my office",
                                  "On his first appointment, the patient was suaaadal when he showed up in my office",
                                  "On his first appointment, the patient was icidal when he showed up in my office",
                                  "On his first appointment, the patient was uicida when he showed up in my office"))

dataset$text <- as.character(dataset$text)
kwic(dataset$text, pattern = "suicidal", window = 5)

只会给我第一个拼写正确的句子。

标签: rtext-miningquanteda

解决方案


好问题。我们没有将近似匹配作为“值类型”,但这是未来发展的一个有趣的想法。同时,我建议生成一个固定模糊匹配列表,base::agrep()然后在这些匹配上进行匹配。所以这看起来像:

library("quanteda")
## Package version: 1.5.2

dataset <- data.frame(
  "patient" = 1:9, "text" = c(
    "On his first appointment, the patient was suicidal when he showed up in my office",
    "On his first appointment, the patient was suicidaa when he showed up in my office",
    "On his first appointment, the patient was suiciaaa when he showed up in my office",
    "On his first appointment, the patient was suicaaal when he showed up in my office",
    "On his first appointment, the patient was suiaaaal when he showed up in my office",
    "On his first appointment, the patient was saacidal when he showed up in my office",
    "On his first appointment, the patient was suaaadal when he showed up in my office",
    "On his first appointment, the patient was icidal when he showed up in my office",
    "On his first appointment, the patient was uicida when he showed up in my office"
  ),
  stringsAsFactors = FALSE
)
corp <- corpus(dataset)

# get unique words
vocab <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE) %>%
  types()

用于agrep()生成最接近的模糊匹配 - 在这里我运行了几次,max.distance每次都从默认的 0.1 略微增加。

# get closest matches to "suicidal"
near_matches <- agrep("suicidal", vocab,
  max.distance = 0.3,
  ignore.case = TRUE, fixed = TRUE, value = TRUE
)
near_matches
## [1] "suicidal" "suicidaa" "suiciaaa" "suicaaal" "suiaaaal" "saacidal" "suaaadal"
## [8] "icidal"   "uicida"

然后,将其用作以下pattern参数kwic()

# use these for fuzzy matching
kwic(corp, near_matches, window = 3)
##                                                        
##  [text1, 9] the patient was | suicidal | when he showed
##  [text2, 9] the patient was | suicidaa | when he showed
##  [text3, 9] the patient was | suiciaaa | when he showed
##  [text4, 9] the patient was | suicaaal | when he showed
##  [text5, 9] the patient was | suiaaaal | when he showed
##  [text6, 9] the patient was | saacidal | when he showed
##  [text7, 9] the patient was | suaaadal | when he showed
##  [text8, 9] the patient was |  icidal  | when he showed
##  [text9, 9] the patient was |  uicida  | when he showed

还有其他基于类似解决方案的可能性,例如,fuzzyjoinstringdist包,但这是一个来自基本包的简单解决方案,应该可以很好地工作。


推荐阅读