r - 了解 R 中 agrep 模糊匹配中的约束
问题描述
这看起来很简单,但由于某种原因,我不理解agrep
涉及替换的模糊匹配的行为。当指定时,两个替换会产生预期的匹配all=2
,但不会在指定时产生匹配substitutions=2
。为什么是这样?
# Finds a match as expected
agrep("abcdeX", "abcdef", value = T,
max.distance = list(sub=1, ins=0, del=0))
#> [1] "abcdef"
# Doesn't find a match as expected
agrep("abcdXX", "abcdef", value = T,
max.distance = list(sub=1, ins=0, del=0))
#> character(0)
# Finds a match as expected
agrep("abcdXX", "abcdef", value = T,
max.distance = list(all=2))
#> [1] "abcdef"
# Doesn't find a match UNEXPECTEDLY
agrep("abcdXX", "abcdef", value = T,
max.distance = list(sub=2, ins=0, del=0))
#> character(0)
由reprex 包于 2021-06-03 创建 (v2.0.0 )
解决方案
all
是一个始终适用的上限,与其他max.distance
控件无关(除了cost
)。默认为 10%。
# one characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
max.distance = list(sub = 2, ins = 0, del = 0, all = 0.1))
# character(0)
# two characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
max.distance = list(sub = 2, ins = 0, del = 0, all = 0.2))
# [1] "abcdef"
# one character can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
max.distance = list(sub = 1, ins = 1, del = 0, all = 0.1))
# character(0)
# two characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
max.distance = list(sub = 1, ins = 1, del = 0, all = 0.2))
# [1] "abcdef"
设置all
的小数模式在 1 处切换到整数模式有一点问题。
# 8 insertions allowed
agrep(pattern = "abcdXXef", x = "abcdef", value = TRUE,
max.distance = list(sub = 0, ins = 2, del = 0, all = 1 - 1e-9))
# [1] "abcdef"
# 1 insertion allowed
agrep(pattern = "abcdXXef", x = "abcdef", value = TRUE,
max.distance = list(sub = 0, ins = 2, del = 0, all = 1))
# character(0)
当您all
通过将其设置为小于 1 来抑制时,将应用距离模式的限制。
# two substitutions allowed
agrep(pattern = "abcdXX",
x = c("abcdef", "abcXdef", "abcefg"), value = TRUE,
max.distance = list(sub = 2, ins = 0, del = 0, all = 1 - 1e-9))
# [1] "abcdef"
设置成本的目的是允许您以不同的速率在不同的方向上在突变空间中移动。这将取决于您的用例。例如,某些语言方言可能更可能添加字母。您可能会选择让删除花费两次插入。默认情况下,所有的权重均相等costs = NULL
,即costs = c(ins = 1, del = 1, sub = 1)
。
编辑:关于您关于为什么某些模式匹配而其他模式不匹配的评论,10% 是指模式中的字符数,向上取整。
agrep(pattern = "01234567XX89", x = "0123456789", value = TRUE,
max.distance = list(sub = 0, ins = 2, del = 0))
# [1] "0123456789"
agrep(pattern = "01234567XX", x = "0123456789", value = TRUE,
max.distance = list(sub = 2, ins = 0, del = 0))
# character(0)
num_mutations <- nchar(c("01234567XX89", "01234567XX")) * 0.1
num_mutations
# [1] 1.2 1.0
ceiling(num_mutations)
[1] 2 1
第二个模式只有 10 个字符,所以只允许替换一个。
推荐阅读
- java - 线程执行 run() 方法两次
- regex - 从末尾替换字符串中出现的 n 个字符
- c++ - 当索引位于末尾且为空数组时,将元素插入数组的问题
- hibernate - hibernate5中如何使用DatabaseInformation、TableInformation和ColumnInformation来确定列的大小?
- android - 如何在android studio的模拟时钟中修复全高分针和时针?
- javascript - Angular 6:如何将用户和密码设置为需要基本身份验证的 REST API?
- ios - 更新 EXIF 数据以在 IOS react-native 上保存具有最新日期的媒体
- javascript - 计算javascript中列的索引
- ruby-on-rails - 运行 rails active_storage:install 时出现 LoadError
- javascript - 如何以数字和字母组合的形式显示javascript对象的值?