首页 > 解决方案 > NLTK Tokenizer 编码问题

问题描述

标记化后,我的句子包含许多奇怪的字符。我怎样才能删除它们?这是我的代码:

def summary(filename, method):
    list_names = glob.glob(filename)
    orginal_data = []
    topic_data = []
    print(list_names)
    for file_name in list_names:
        article = []
        article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
        for line in article_temp:
            print(line)
            if (line.strip()):
                tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
                sentences = tokenizer.tokenize(line)
                print(sentences)
                article = article + sentences
        orginal_data.append(article)
        topic_data.append(preprocess_data(article))
    if (method == "orig"):
        summary = generate_summary_origin(topic_data, 100, orginal_data)
    elif (method == "best-avg"):
        summary = generate_summary_best_avg(topic_data, 100, orginal_data)
    else:
        summary = generate_summary_simplified(topic_data, 100, orginal_data)
    return summary

print(line)打印一行txt 。并print(sentences)在行中打印标记化的句子。

但有时句子经过 nltk 处理后会包含奇怪的字符。

Assaly, who is a fan of both Pusha T and Drake, said he and his friends 
wondered if people in the crowd might boo Pusha T during the show, but 
said he never imagined actual violence would take place.

[u'Assaly, who is a fan of both Pusha T and Drake, said he and his 
friends wondered if people in\xa0the crowd might boo Pusha\xa0T during 
the show, but said he never imagined actual violence would take 
place.']

像上面的例子一样,\xa0和从哪里来\xa0T

标签: pythonnlpnltk

解决方案


x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in\xa0the crowd might boo Pusha\xa0T during the show, but said he never imagined actual violence would take place.'

# method 1 
x.replace('\xa0', ' ')

# method 2
import unicodedata
unicodedata.normalize('NFKD', x)

print(x)

输出:

Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.

参考:unicodedata.normalize()


推荐阅读