python - 提高词袋模型中排名的效率
问题描述
我正在创建一个文本摘要器并使用基本模型来使用词袋方法。
我正在执行的代码正在使用 nltk 库。读取的文件是一个超过 2500000 字的大文件。下面是我正在处理的循环,但这需要 2 多个小时才能运行和完成。有没有办法优化这段代码
f= open('Complaints.csv', 'r')
raw = f.read()
len(raw)
tokens = nltk.word_tokenize(raw)
len(tokens)
freq = nltk.FreqDist(text)
top_words = [] # blank dictionary
top_words = freq.most_common(100)
print(top_words)
sentences = sent_tokenize(raw)
print(raw)
ranking = defaultdict(int)
for i, sent in enumerate(raw):
for word in word_tokenize(sent.lower()):
if word in freq:
ranking[i]+=freq[word]
top_sentences = nlargest(10, ranking, ranking.get)
print(top_sentences)
这只是一个一个文件,实际部署有超过10-15个类似大小的文件。我们如何改进这一点。
请注意,这些是来自聊天机器人的文本,是实际句子,因此不需要删除空格、词干和其他文本预处理方法
解决方案
Firstly, you open at once a large file that needs to fit into your RAM. If you do not have a really good computer, this might be the first bottleneck for perfomance. Read each line separately, or try to use some IO buffer. What CPU do you have? If you have enough cores, you can get a lot of extra performance when parallelizing the program with an async Pool from Multiprocessing library because you really use the full power of all cores (choose the number of processes according to the thread number. With this method, I reduced a model on 2500 data sets from ~5 minutes to ~17 seconds on 12 threads). You would have to implement the processes to return a dict each, updating them after the processes have finished.
Otherwise, there are machine learning approches for text summarization (sequence to sequence RNNs). With a tensorflow implementation, you can use a dedicated GPU on your local machine (even a decent 10xx or a 2060 from Nvidia will help) to speed up your model.
https://docs.python.org/2/library/multiprocessing.html https://arxiv.org/abs/1602.06023
hope this helps
推荐阅读
- c# - 使用 MVC 加载视图时如何解决此错误
- python - 列表上的逻辑函数 - 奇怪的 python 行为
- xml - 使用 XSLT 将元素转换为属性时如何在 XML 中设置命名空间?
- c# - 用 JObject 过滤 c# 对象
- pandas - 如何解开熊猫数据框以获取计数
- javascript - PackBits 算法的实现
- css - 如何结合双色调效果和 CSS mix-blend-mode:difference
- django - 限制用户获取或更新特定用户 django 制作的对象
- swift - 难以自动布局(ImageView 中的 stackView)
- dataframe - 是否可以编写一个将函数名作为参数并将其应用于数据帧的包装类?