首页 > 解决方案 > 提高词袋模型中排名的效率

问题描述

我正在创建一个文本摘要器并使用基本模型来使用词袋方法。
我正在执行的代码正在使用 nltk 库。读取的文件是一个超过 2500000 字的大文件。下面是我正在处理的循环,但这需要 2 多个小时才能运行和完成。有没有办法优化这段代码

f= open('Complaints.csv', 'r')
raw = f.read()
len(raw)
tokens = nltk.word_tokenize(raw)
len(tokens)
freq = nltk.FreqDist(text)
top_words = [] # blank dictionary 
top_words = freq.most_common(100)
print(top_words)
sentences = sent_tokenize(raw)
print(raw)
ranking = defaultdict(int)
for i, sent in enumerate(raw):
for word in word_tokenize(sent.lower()):
    if word in freq:
        ranking[i]+=freq[word]
top_sentences = nlargest(10, ranking, ranking.get)
print(top_sentences)

这只是一个一个文件,实际部署有超过10-15个类似大小的文件。我们如何改进这一点。
请注意,这些是来自聊天机器人的文本,是实际句子,因此不需要删除空格、词干和其他文本预处理方法

标签: pythonnlpnltk

解决方案


Firstly, you open at once a large file that needs to fit into your RAM. If you do not have a really good computer, this might be the first bottleneck for perfomance. Read each line separately, or try to use some IO buffer. What CPU do you have? If you have enough cores, you can get a lot of extra performance when parallelizing the program with an async Pool from Multiprocessing library because you really use the full power of all cores (choose the number of processes according to the thread number. With this method, I reduced a model on 2500 data sets from ~5 minutes to ~17 seconds on 12 threads). You would have to implement the processes to return a dict each, updating them after the processes have finished.

Otherwise, there are machine learning approches for text summarization (sequence to sequence RNNs). With a tensorflow implementation, you can use a dedicated GPU on your local machine (even a decent 10xx or a 2060 from Nvidia will help) to speed up your model.

https://docs.python.org/2/library/multiprocessing.html https://arxiv.org/abs/1602.06023

hope this helps


推荐阅读