首页 > 解决方案 > 清理标记化数据时,如何在列表列表中使用 .isalpha() 来返回值,而不是布尔值?

问题描述

我正在使用 nltk 库练习 NLP,我想为此构建一个数据集。我将几个文档组合成一个列表列表,然后对它们进行预处理。首先我对其进行标记,将其小写,然后我想删除标点符号。它适用于 vecor,但不适用于列表列表:

向量示例:

a = 'This is a Testsentence and it is beautiful times 10!**!.' 
b = word_tokenize(a) 
c = [x.lower() for x in b] 
['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times', '10', '.'] 
d = [x for x in c if x.isalpha()] 
['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times']

现在我想在列表列表中执行此操作,但我没有在最后编写列表理解:

aa = 'This is a Testsentence and it is beautiful times 10.'
bb = 'It is a beautiful Testsentence?'
cc = 'Testsentence beautiful!'
dd = [aa, bb, cc]
ee = [word_tokenize(x) for x in dd]
ff = [[x.lower() for x in y] for y in ee]
[['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times', '10', '.'], ['it', 'is', 'a', 'beautiful', 'testsentence', '?'], ['testsentence', 'beautiful', '!']]

这是我的问题开始的地方,因为我无法弄清楚如何正确编写列表理解。

gg = [[j.isalpha() for j in i] for i in ff]

这是结果

[[True, True, True, True, True, True, True, True, True, False, False], [True, True, True, True, True, False], [True, True, False]]

但我想要这样的东西:

[['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times', '10', '.'], ['it', 'is', 'a', 'beautiful', 'testsentence', '?'], ['testsentence', 'beautiful', '!']]

谢谢 :)

标签: nltklist-comprehension

解决方案


尝试以下

gg = [[j for j in i if j.isalpha()] for i in ff]

这将返回预期的答案

[['this', 'is', 'a', 'testsentence', 'and', 'it', 'is', 'beautiful', 'times'],
['it', 'is', 'a', 'beautiful', 'testsentence'],
['testsentence', 'beautiful']]

推荐阅读