python - 快速搜索给定字符串的起始位置
问题描述
在这里,我想将给定的字符串匹配match_text
到更长的字符串text
。我想在 中找到最近match_text
的开始位置text
(您可以假设只有一个位置)。我当前版本的代码是for
循环遍历text
并计算 Levenshtein 距离。但是,有时文本很长(最多 90k 个字符)。我不确定是否有一种快速的方法来进行字符串搜索。这是我编写的代码段的当前版本:
import numpy as np
import Levenshtein as lev # pip install python-Levenshtein
def find_start_position(text, match_text):
lev_distances = []
for i in range(len(text) - len(match_text)):
match_len = len(match_text)
lev_distances.append(lev.distance(match_text, text[i: i + match_len]))
pos = np.argmin(lev_distances)
return pos
# example
find_start_position('I think this is really cool.', 'this iz')
>> 8
如果有人知道或进行快速字符串搜索,我将不胜感激。
解决方案
请注意:模式中的空格已标准化。这是你想要的?
import Levenshtein as lev # pip install python-Levenshtein
import sys
# author hry@solaris-it.com
def splitTextInWords(text):
retVal = text.split()
return retVal
def getBestFit(allLevenshteinValues):
bestFit = [sys.maxsize, '', 0]
for k, value in allLevenshteinValues.items():
if value[0] < bestFit[0]:
bestFit = value
bestFit.append(k + 1)
return bestFit
def catchAllCosts(text, matchText):
textAsWordList = splitTextInWords(text)
matchTextPattern = ' '.join(splitTextInWords(matchText))
maxIndx = len(textAsWordList)
allLevenshteinValues = {}
for i in range(0, maxIndx):
extCnt = 0
textPattern = textAsWordList[i]
while (len(textPattern) < len(matchTextPattern)
and i + extCnt + 1 < maxIndx):
if i + extCnt + 1 < maxIndx:
extCnt += 1
textPattern = ' '.join([textPattern, textAsWordList[i + extCnt]])
allLevenshteinValues[i] = [ lev.distance(
textPattern, matchTextPattern), textPattern ]
return allLevenshteinValues
def main():
# text: pattern you are crowling
text = '''x AlongLongLongWord and long long long long string
is going be here string I think really S is cXXXl.
x AlongLongLongWord 今x Go今天今 I think really this would is cxol.x
AlongLongLongWord I think this izreally this iz cool.'''
# matchText: pattern you want find the best match for
matchText = 'this is'
allLevenshteinValues = catchAllCosts(text, matchText)
bestFit = getBestFit(allLevenshteinValues)
costs, sequence, wordNr, = bestFit[0], bestFit[1], bestFit[2]
print("first best match starting by word nr.",
wordNr, "costs:", costs, "sequence: >>", sequence, "<<")
matchAnotherPattern = '今天 Go今x天今'
allLevenshteinValues = catchAllCosts(text, matchAnotherPattern)
bestFit = getBestFit(allLevenshteinValues)
costs, sequence, wordNr, = bestFit[0], bestFit[1], bestFit[2]
print("first best match starting by word nr.",
wordNr, "costs:", costs, "sequence: >>", sequence, "<<")
if __name__ == '__main__':
main()
推荐阅读
- python - 我如何从 django 中的函数设置值
- python - 比较列表python中元素的最快方法
- html - ngModel 未绑定到选择标记内的 ngFor 选项
- python - 如何检查numpy数组是否包含空列表
- java - 如何在 Java 中处理 BigDecimal 中的路由
- php - 换行符在我的 php 程序中不起作用?
- c - Fwrite()结构数组以在c中归档
- c++ - 无法在 SDL 2 中创建流畅的动画
- java - 使用多线程设置 Java 对象的两个或多个变量
- python - Azure Python App Function 不再在本地运行 - 模块“azure.functions_worker”没有属性“start_async”