r - 如何使用 quanteda 和 kwic 进行模糊模式匹配?
问题描述
我有医生写的文本,我希望能够在他们的上下文中突出显示特定的单词(我在他们的文本中搜索的单词之前的 5 个单词和之后的 5 个单词)。假设我想搜索“自杀”这个词。然后我会在 quanteda 包中使用 kwic 函数:
kwic(数据集,模式 = “自杀”,窗口 = 5)
到目前为止,一切都很好,但是说我想考虑错别字的可能性。在这种情况下,我想允许三个不同的字符,对这些字符在单词中的位置没有限制。
是否可以使用 quanteda 的 kwic 函数来做到这一点?
例子:
dataset <- data.frame("patient" = 1:9, "text" = c("On his first appointment, the patient was suicidal when he showed up in my office",
"On his first appointment, the patient was suicidaa when he showed up in my office",
"On his first appointment, the patient was suiciaaa when he showed up in my office",
"On his first appointment, the patient was suicaaal when he showed up in my office",
"On his first appointment, the patient was suiaaaal when he showed up in my office",
"On his first appointment, the patient was saacidal when he showed up in my office",
"On his first appointment, the patient was suaaadal when he showed up in my office",
"On his first appointment, the patient was icidal when he showed up in my office",
"On his first appointment, the patient was uicida when he showed up in my office"))
dataset$text <- as.character(dataset$text)
kwic(dataset$text, pattern = "suicidal", window = 5)
只会给我第一个拼写正确的句子。
解决方案
好问题。我们没有将近似匹配作为“值类型”,但这是未来发展的一个有趣的想法。同时,我建议生成一个固定模糊匹配列表,base::agrep()
然后在这些匹配上进行匹配。所以这看起来像:
library("quanteda")
## Package version: 1.5.2
dataset <- data.frame(
"patient" = 1:9, "text" = c(
"On his first appointment, the patient was suicidal when he showed up in my office",
"On his first appointment, the patient was suicidaa when he showed up in my office",
"On his first appointment, the patient was suiciaaa when he showed up in my office",
"On his first appointment, the patient was suicaaal when he showed up in my office",
"On his first appointment, the patient was suiaaaal when he showed up in my office",
"On his first appointment, the patient was saacidal when he showed up in my office",
"On his first appointment, the patient was suaaadal when he showed up in my office",
"On his first appointment, the patient was icidal when he showed up in my office",
"On his first appointment, the patient was uicida when he showed up in my office"
),
stringsAsFactors = FALSE
)
corp <- corpus(dataset)
# get unique words
vocab <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE) %>%
types()
用于agrep()
生成最接近的模糊匹配 - 在这里我运行了几次,max.distance
每次都从默认的 0.1 略微增加。
# get closest matches to "suicidal"
near_matches <- agrep("suicidal", vocab,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE, value = TRUE
)
near_matches
## [1] "suicidal" "suicidaa" "suiciaaa" "suicaaal" "suiaaaal" "saacidal" "suaaadal"
## [8] "icidal" "uicida"
然后,将其用作以下pattern
参数kwic()
:
# use these for fuzzy matching
kwic(corp, near_matches, window = 3)
##
## [text1, 9] the patient was | suicidal | when he showed
## [text2, 9] the patient was | suicidaa | when he showed
## [text3, 9] the patient was | suiciaaa | when he showed
## [text4, 9] the patient was | suicaaal | when he showed
## [text5, 9] the patient was | suiaaaal | when he showed
## [text6, 9] the patient was | saacidal | when he showed
## [text7, 9] the patient was | suaaadal | when he showed
## [text8, 9] the patient was | icidal | when he showed
## [text9, 9] the patient was | uicida | when he showed
还有其他基于类似解决方案的可能性,例如,fuzzyjoin或stringdist包,但这是一个来自基本包的简单解决方案,应该可以很好地工作。
推荐阅读
- javascript - 如何在 iphone 6s 和 5s 的键盘上方隐藏图标和文本密码
- multithreading - 为什么我可以将静态 &str “移动”到 Rust 中的多个线程中?
- sql - SQL Server:来自内部视图的带有表值参数的表值 UDF
- java - Java中n个素数的总和
- php - php 一个变量中的两行
- c++ - 智能指针二维数组作为参数
- java - 如何使用 Spring data jdbc 插入具有自定义 id 的记录?
- python - 获取每个数字小于某个数字的数字组合
- json - GSON 或 Jackson 是否支持从巨大的 JSON 文档中提取对象数组作为流的方法?
- javascript - Javascript如何查看“$this”是否包含某个ID