python - Python NLP:如何自动更正文本并将其标记为一组单词?
问题描述
例子:
token_list = ['Allen Bradley', 'Haas', 'Fanuc']
input_string = 'I use Alln Brdly machins but dont no how to use Has ones.'
output_tokens = ['Allen Bradley', 'Haas']
解决方案
使用textdistance可以帮助您找到两个单词的距离,例如使用汉明距离。
import textdistance as td
list = ['Allen', 'Bradley', 'Haas', 'Fanuc']
string = 'I use Alln Brdly machins but dont no how to use Has ones.'
#Defining a weight function to estimate the metrical distance of two words
#here the hamming similarity and distance are used
def word_correlation(word1: str, word2: str):
sim_norm = td.hamming.normalized_similarity(word1, word2)
dist_norm = td.hamming.normalized_distance(word1, word2)
return {"similarity": sim_norm,
"distance": dist_norm
}
#splitting the sentence "string" into single words
words = [word for word in string.split(" ")]
#calculating the hamming distances and similarities for each word of the sentence
#with each of the chosen keywords contained in list
statistics = []
for i in range(len(list)):
statistics.append({"check": list[i],
"with": {"words": [],
"cor": []
}
}
)
for word in words:
statistics[i]["with"]["words"].append(word)
statistics[i]["with"]["cor"].append(word_correlation(word, list[i]))
#printing only the results with high similarities
result = []
for res in statistics:
correction = res["check"]
i = 0
for cor in res["with"]["cor"]:
#filtering of the propositional corrections by the normalized hamming
#similarity
if (cor["similarity"] > 0.25):
result.append({"correction": correction,
"word": res["with"]["words"][i],
"likelyhood": cor["similarity"]
}
)
i += 1
print(result)
这将返回:
[{'correction': 'Allen', 'word': 'Alln', 'likelyhood': 0.6}, {'correction': 'Bradley', 'word': 'Brdly', 'likelyhood': 0.2857142857142857}, {'correction': 'Haas', 'word': 'Has', 'likelyhood': 0.5}]
您绝对应该研究两个单词之间度量的定义,因为我使用的给定解决方案,例如汉明距离,可以为不同长度的单词给出不同的结果!该定义应仅适用于相同大小的单词。汉明距离
由于我的示例使用汉明距离,因为预计单词相等,在大多数情况下,拼写错误只会将长度更改 +-1。因此,在textdistance中使用汉明距离或汉明相似度应该在简单的情况下工作。
推荐阅读
- php - Laravel 返回自定义 api 资源
- ios - 为什么在swift中为collectionView多次调用willDisplay?
- android - 如何在撰写中控制 AnimatedVisibility 的持续时间?
- python - Python 3.7.7 子解释器在 multiprocessing.Process 失败
- python - 在python中增加内循环的索引和更新外循环的索引
- java - 整数类型的值在 DeadLetterPublishingRecoverer 上出现错误
- java - 如何从外部方法访问 ViewHolder 上的 MediaPlayer
- laravel - 如何为 Laravel 中的多语言帖子分配多语言类别
- linux - Percona 客户端安装在 aarch64 架构的 linux 服务器中失败
- virtual-machine - Ubuntu 18.04 VM 未连接到 Internet