首页 > 解决方案 > 如何更正熊猫数据框中的单词?

问题描述

我正在尝试纠正包含句子的 CSV 文件中的拼写错误。

输入_csv:

id  text
0   my telephon not working
1   I have mobil in my bag
2   car is expensiv

此处使用附魔提供的代码通过提供建议来纠正这个词:

我想使用这种拼写更正方法来更正熊猫数据框中的单词。我有以下代码,其中首先对每个句子进行标记,然后检查拼写并提出最佳建议:

import enchant, difflib, nltk
from nltk.tokenize import word_tokenize
import pandas as pd

text = "telephon mobil" # This is only a sample
token = word_tokenize(text)

for word in token:
    best_words = []
    best_ratio = 0
    a = set(d.suggest(word))
    for b in a:
        tmp = difflib.SequenceMatcher(None, word, b).ratio()
        if tmp > best_ratio:
            best_words = [b]
            best_ratio = tmp
        elif tmp == best_ratio:
            best_words.append(b)
    print('word:[', word, '] -> best suggest:[', best_words[0],']')

word:[ telephon ] -> best suggest:[ telephone ]
word:[ mobil ] -> best suggest:[ mobile ]

现在我的问题是,如何将其应用于我的 panda 数据框并更正每一行中的拼写错误,并输出如下:

输出csv:

id  text
0   my telephone not working
1   I have mobile in my bag
2   car is expensive

标签: pythonpandas

解决方案


将您的代码放入一个函数中,然后使用以下命令在每一行上调用它apply

def word_suggest(word):
    d = enchant.Dict("en_US")
    if d.check(word):
        return word
    best_words = []
    best_ratio = 0
    a = set(d.suggest(word))
    for b in a:
        tmp = difflib.SequenceMatcher(None, word, b).ratio()
        if tmp > best_ratio:
            best_words = [b]
            best_ratio = tmp
        elif tmp == best_ratio:
            best_words.append(b)
    return best_words[0]

>>> df["text"].apply(lambda x: " ".join(word_suggest(word) for word in word_tokenize(x)))
0    my telephone not working
1     I have mobile in my bag
2            car is expensive
Name: text, dtype: object

推荐阅读