python - NLTK Tokenizer 编码问题
问题描述
标记化后,我的句子包含许多奇怪的字符。我怎样才能删除它们?这是我的代码:
def summary(filename, method):
list_names = glob.glob(filename)
orginal_data = []
topic_data = []
print(list_names)
for file_name in list_names:
article = []
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary
print(line)
打印一行txt 。并print(sentences)
在行中打印标记化的句子。
但有时句子经过 nltk 处理后会包含奇怪的字符。
Assaly, who is a fan of both Pusha T and Drake, said he and his friends
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.
[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people in\xa0the crowd might boo Pusha\xa0T during
the show, but said he never imagined actual violence would take
place.']
像上面的例子一样,\xa0
和从哪里来\xa0T
?
解决方案
x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in\xa0the crowd might boo Pusha\xa0T during the show, but said he never imagined actual violence would take place.'
# method 1
x.replace('\xa0', ' ')
# method 2
import unicodedata
unicodedata.normalize('NFKD', x)
print(x)
输出:
Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.
推荐阅读
- python - Unable to send e-mail using python
- google-cloud-dataflow - CombinePerKey not merging accumulators across multiple workers
- docker - Pod 卡在“CrashLoopBackOff”上,即使它应该进入 /bin/bash
- python - 在 Alpine Linux 中使用与 apk 一起安装的 Python 包
- html - Bootstrap 无法右对齐我的 DIV - 右拉不起作用
- php - Wordpress 插件开发中的错误,包括我的脚本;无法登录
- sql - 此错误“通信链路故障”的原因可能是什么?
- bash - 用于提取开发人员 ID 安装的 CL 选项(策略?)
- typescript - 打字稿中默认值的参数类型
- visio - 如何在 Visio 属性中调整属性框的大小