python-3.x - 主题建模、Gensim、Python、根据固定ID或关联数据获取主题模型
问题描述
我有一个关于通过 python 和 gensim 库进行主题建模的问题:当我运行以下代码时,它运行良好并提出了相关主题,但我想查看 .csv 文件中列出的每个文档的每个主题,但它会随机播放。例如,第 1 个主题来自第 2 个文档,但第 2 个主题来自第 1 个文档,第 3 个来自第 3 个文档。当我运行相同的代码时,它会再次随机播放。如何解决这个问题并获取每个文档的主题或/和直接链接到第一列中可能列出的文档的 ID 或作者的主题?
代码:
步骤1:
import nltk
import csv
import re
import nltk.corpus
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
第 2 步:加载数据和处理
doc_complete = open('/home/erdal/Desktop/big_data/abstract1.csv', 'r').readlines()
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc).split() for doc in doc_complete]
print(doc_clean)
dictionary = corpora.Dictionary(doc_clean)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)
print(ldamodel.print_topics(num_topics=3, num_words=3))
第 3 步:打印出主题
myData = ldamodel.print_topics(num_topics=3, num_words=3)
myFile = open('/home/erdal/Desktop/big_data/all_data_1_topics.csv', 'w')
with myFile:
writer = csv.writer(myFile)
writer.writerows(myData)
print("Writing complete")
话题:
[(0, '0.036*"learning" + 0.036*"student" + 0.026*"intergroup"'), (1, '0.005*"abstract" + 0.005*"significant" + 0.005*"using"'), (2, '0.042*"clickers" + 0.027*"motivation" + 0.027*"student"')]
解决方案
推荐阅读
- list - 如何将列表分解为多个熊猫数据框
- sql-server - Valid MSSQL query returning no rows
- c++ - Why auto specifier deduce "top and low" level const from an &pointer?
- node.js - Ping an open Node connection without sending data
- python - python脚本打开Windows命令提示符并打印一些字符串
- angular - 将默认值设置为Angular 4中FormArray内Formcontrol中新添加的选择选项控件
- javascript - Zend Framework 3:如何使用和(推送)更新进度条?
- php - 调用类:方法而不是回调函数phpslim3
- arrays - 将项目从子数组列表分发到其他子数组/有时会随机出错
- javascript - 选择框背景颜色html