首页 > 解决方案 > Python NLP:如何自动更正文本并将其标记为一组单词?

问题描述

例子:

token_list = ['Allen Bradley', 'Haas', 'Fanuc']

input_string = 'I use Alln Brdly machins but dont no how to use Has ones.'

output_tokens = ['Allen Bradley', 'Haas']

标签: pythonnlpautocorrect

解决方案


使用textdistance可以帮助您找到两个单词的距离,例如使用汉明距离。

import textdistance as td

list = ['Allen', 'Bradley', 'Haas', 'Fanuc']

string = 'I use Alln Brdly machins but dont no how to use Has ones.'

#Defining a weight function to estimate the metrical distance of two words
#here the hamming similarity and distance are used
def word_correlation(word1: str, word2: str):
    sim_norm = td.hamming.normalized_similarity(word1, word2)
    dist_norm = td.hamming.normalized_distance(word1, word2)

    return {"similarity": sim_norm,
            "distance": dist_norm
            }

#splitting the sentence "string" into single words
words = [word for word in string.split(" ")]

#calculating the hamming distances and similarities for each word of the sentence
#with each of the chosen keywords contained in list
statistics = []
for i in range(len(list)):
    statistics.append({"check": list[i],
                   "with": {"words": [],
                            "cor": []
                            }
                   }
                  )
    for word in words:
        statistics[i]["with"]["words"].append(word)
        statistics[i]["with"]["cor"].append(word_correlation(word, list[i]))


#printing only the results with high similarities
result = []
for res in statistics:
    correction = res["check"]

    i = 0
    for cor in res["with"]["cor"]:
        
        #filtering of the propositional corrections by the normalized hamming
        #similarity
        if (cor["similarity"] > 0.25):
                result.append({"correction": correction,
                               "word": res["with"]["words"][i],
                               "likelyhood": cor["similarity"]
                               }
                              )

        i += 1


print(result)

这将返回:

[{'correction': 'Allen', 'word': 'Alln', 'likelyhood': 0.6}, {'correction': 'Bradley', 'word': 'Brdly', 'likelyhood': 0.2857142857142857}, {'correction': 'Haas', 'word': 'Has', 'likelyhood': 0.5}]

您绝对应该研究两个单词之间度量的定义,因为我使用的给定解决方案,例如汉明距离,可以为不同长度的单词给出不同的结果!该定义应仅适用于相同大小的单词。汉明距离

由于我的示例使用汉明距离,因为预计单词相等,在大多数情况下,拼写错误只会将长度更改 +-1。因此,在textdistance中使用汉明距离或汉明相似度应该在简单的情况下工作。


推荐阅读