首页 > 解决方案 > python中的单词n-gram句子列表

问题描述

我想生成大小为 2 到 4 的 char-n-gram。这就是我现在所拥有的:

from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much']

for i in range(len(sentence)):
    for n in range(2, 4):
        n_grams = ngrams(sentence[i].split(), n)
        for grams in n_grams:
            print(grams)

这会给我:

('i', 'have')
('have', 'an')
('an', 'apple')
('i', 'have', 'an')
('have', 'an', 'apple')
('i', 'like')
('like', 'apples')
('apples', 'so')
('so', 'much')
('i', 'like', 'apples')
('like', 'apples', 'so')
('apples', 'so', 'much')

我怎样才能以最佳方式做到这一点?我有一个非常大的条目数据,我的解决方案包含 for in for 所以复杂性有点大,算法需要很长时间才能完成。

标签: pythonnlpnltkn-gram

解决方案


假设您的意思是 n-gram 单词而不是 char),不确定是否有重复句子的机会,但您可以尝试set输入句子,可能是list comprehension

%%timeit
from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
n_grams = []
for i in range(len(sentence)):
    for n in range(2, 4):
        for item in ngrams(sentence[i].split(), n):
            n_grams.append(item)

结果:

1000 loops, best of 3: 228 µs per loop

只是使用list comprehension,它有一些改进:

%%timeit
from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
n_grams = [item for sent in sentence for n in range(2, 4) for item in ngrams(sent.split(), n)]

结果:

1000 loops, best of 3: 214 µs per loop

其他方法是使用setand list comprehension

%%timeit
from nltk import ngrams
sentences = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
# use of set
sentence = set(sentences)
n_grams = [item for sent in sentence for n in range(2, 4) for item in ngrams(sent.split(), n)]

结果:

10000 loops, best of 3: 23.5 µs per loop

所以,如果有很多重复的句子,它可能会有所帮助。


推荐阅读