python - Gensim LDA 模型主题差异导致 nan
问题描述
我对主题建模和 Gensim 很陌生。所以,我仍在努力理解许多概念。我正在尝试在包含大约 25,446,114 条推文的语料库上运行 gensim 的 LDA 模型。我使用 gensim 创建了流式语料库和 id2word 词典。我正在使用 num_topics = 100,块大小 = 85000(一次加载 85000 条推文)
我正在使用 Gensim:3.5.0 Numpy:1.15.3
以下是语料库和 id2word 词典的链接:https ://drive.google.com/drive/folders/1FrJ8gJbiDqp3VC5syOjRVcQPcESdYOYa?usp=sharing
我不知道我做错了什么或如何解决这个问题。主题 diff 首先点击 inf 然后 nan ,我开始得到相同的主题。请帮忙 !!
这是代码:
import pprint
import logging
import gensim
logging.basicConfig(filename='gensim.log',
format="%(asctime)s:%(levelname)s:%(message)s",
level=logging.INFO)
corpus = gensim.corpora.MmCorpus('disasterTweets.mm')
id2word = gensim.corpora.Dictionary.load('disasterTweets.dict')
id2word.filter_tokens(bad_ids=[id2word.token2id['eofeofeof']])
print('eofeofeof' in id2word.token2id)
lda_model = gensim.models.LdaMulticore(corpus=corpus,
id2word=id2word,
chunksize=85000,
num_topics=100)
pprint.pprint(lda_model.print_topics())
以下是我收到的错误:
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:1023: RuntimeWarning: divide by zero encountered in log
diff = np.log(self.expElogbeta)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:690: RuntimeWarning: overflow encountered in add
sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:700: RuntimeWarning: invalid value encountered in multiply
sstats *= self.expElogbeta
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:690: RuntimeWarning: overflow encountered in add
sstats[:, ids] += np.outer(expElogthetad.T, cts / phinorm)
/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py:700: RuntimeWarning: invalid value encountered in multiply
sstats *= self.expElogbeta
Process ForkPoolWorker-30:
Traceback (most recent call last):
File "/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/linuxbrew/.linuxbrew/Cellar/python/3.7.0/lib/python3.7/multiprocessing/pool.py", line 105, in worker
initializer(*initargs)
File "/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamulticore.py", line 333, in worker_e_step
worker_lda.do_estep(chunk) # TODO: auto-tune alpha?
File "/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py", line 725, in do_estep
gamma, sstats = self.inference(chunk, collect_sstats=True)
File "/home/ec2-user/env/lib/python3.7/site-packages/gensim/models/ldamodel.py", line 662, in inference
expElogbetad = self.expElogbeta[:, ids]
IndexError: index 287500 is out of bounds for axis 1 with size 287500
解决方案
据我了解阅读 Gensim Github 问题页面问题 217中的线程,这似乎是一个错误,那里的一些人报告说通过更改某些参数解决了该问题。请先检查一下,看看那里的建议是否能解决您的问题。
推荐阅读
- flutter - 颤振布局和间距
- python - 如何找出Dataframe中两列的组合?当数据框中有多个列时?
- mongodb - 是否可以在 MongoDb 中更新 $jsonSchema?
- eclipse-rcp - Eclipse RCP 产品导出:收集要安装的项目时出错
- reactjs - reactjs中如何将类组件更改为功能组件?
- django - 如果找不到图像,则使 Django 静默失败
- android - 在 React with Cordova 中点击通知点击重定向到页面或屏幕
- vue.js - Vue.js Vue3 子组件的不同模板
- ios - 带有 gem 本机扩展的 Cocoapods 错误失败
- ansi - 在DOSBOX中检测QB45中的ANSI