python - python中的单词n-gram句子列表
问题描述
我想生成大小为 2 到 4 的 char-n-gram。这就是我现在所拥有的:
from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much']
for i in range(len(sentence)):
for n in range(2, 4):
n_grams = ngrams(sentence[i].split(), n)
for grams in n_grams:
print(grams)
这会给我:
('i', 'have')
('have', 'an')
('an', 'apple')
('i', 'have', 'an')
('have', 'an', 'apple')
('i', 'like')
('like', 'apples')
('apples', 'so')
('so', 'much')
('i', 'like', 'apples')
('like', 'apples', 'so')
('apples', 'so', 'much')
我怎样才能以最佳方式做到这一点?我有一个非常大的条目数据,我的解决方案包含 for in for 所以复杂性有点大,算法需要很长时间才能完成。
解决方案
(假设您的意思是 n-gram 单词而不是 char),不确定是否有重复句子的机会,但您可以尝试set
输入句子,可能是list comprehension
:
%%timeit
from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
n_grams = []
for i in range(len(sentence)):
for n in range(2, 4):
for item in ngrams(sentence[i].split(), n):
n_grams.append(item)
结果:
1000 loops, best of 3: 228 µs per loop
只是使用list comprehension
,它有一些改进:
%%timeit
from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
n_grams = [item for sent in sentence for n in range(2, 4) for item in ngrams(sent.split(), n)]
结果:
1000 loops, best of 3: 214 µs per loop
其他方法是使用set
and list comprehension
:
%%timeit
from nltk import ngrams
sentences = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
# use of set
sentence = set(sentences)
n_grams = [item for sent in sentence for n in range(2, 4) for item in ngrams(sent.split(), n)]
结果:
10000 loops, best of 3: 23.5 µs per loop
所以,如果有很多重复的句子,它可能会有所帮助。
推荐阅读
- android - LinearLayout 水平无法正常工作
- php - 图片上传代码中的上传路径似乎无效
- alexa-skills-kit - 如何使用 c# 向使用 Dialog.ElicitSlot 的用户提问?
- java - Java允许运行私有方法吗?有人可以解释为什么吗?
- r - mclapply() 的性能明显比 lapply() 差。我怎样才能加快速度?
- angular - Webstorm:“必须是左值”是什么意思
- c# - C#上的二维数组
- php - Postman 的响应显示 200,但 Android 返回 403
- python - Pandas 对 DateTime 列的错误排序
- google-chrome-extension - Chrome 扩展电子邮件问题中的 Google Oauth 集成