python - 使用 pyspark.ml 从 LDA 模型中获取推荐的主题数量
问题描述
我使用 pyspark 训练了一个 LDA 模型,以按主题对文本进行分类,尝试不同的K
值。但是,要验证所选的K
,我想使用这种方法评估-主题-模型-in-python-latent-dirichlet-allocation-lda
但是,spark.ml
我不知道如何获得等价的。gensim
CoherenceModel
数据框如下所示:
tokenizedText.show(truncate=True, n=5)
+------------+--------------------+
| ID| Tokens|
+------------+--------------------+
|0000qaqdWUAQ|[limpieza, mala, ...|
|0000qaqe2UAA|[transporte, deja...|
|0000qasxUUAQ| [correcto]|
|0000qatEJUAY| [bien]|
|0000qaqwMUAQ|[experiencia, agr...|
+------------+--------------------+
基本模型是这样的:
from pyspark.ml.feature import IDF, HashingTF, Tokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.clustering import LDA, LDAModel
counter = CountVectorizer(inputCol="Tokens", outputCol="term_frequency", minDF=5)
counterModel = counter.fit(tokenizedText)
vectorizedLaw = counterModel.transform(trainingData)
idf = IDF(inputCol="term_frequency", outputCol="tf_idf")
tfidfLaw = idf.fit(vectorizedLaw).transform(vectorizedLaw)
lda = LDA(k=7, maxIter=50, featuresCol="tf_idf", seed=1234)
model = lda.fit(tfidfLaw)
我得到:
model.logLikelihood(tfidfLaw)
Out[295]: -17745244.739330653
model.logPerplexity(tfidfLaw)
Out[296]: 7.63661972904619
使用gensim
并遵循evaluate-topic-model-in-python-latent-dirichlet-allocation-lda(计算模型困惑度和连贯性分数以及超参数调整)示例,由于数据大小,它不可行。经过长时间的执行,我得到了这个错误:
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.net.NoRouteToHostException: No route to host
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779)
at shaded.v9_4.org.eclipse.jetty.io.SelectorManager.doFinishConnect(SelectorManager.java:355)
at shaded.v9_4.org.eclipse.jetty.io.ManagedSelector.processConnect(ManagedSelector.java:232)
at shaded.v9_4.org.eclipse.jetty.io.ManagedSelector.access$1400(ManagedSelector.java:62)
at shaded.v9_4.org.eclipse.jetty.io.ManagedSelector$SelectorProducer.processSelected(ManagedSelector.java:543)
at shaded.v9_4.org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:401)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at shaded.v9_4.org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:367)
at shaded.v9_4.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782)
at shaded.v9_4.org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:914)
at java.base/java.lang.Thread.run(Thread.java:834)
我在 Databricks 运行时版本 6.5 ML(包括 Apache Spark 2.4.5、Scala 2.11)上运行,驱动程序类型:15.3 GB 内存、2 个内核、1 个 DBU。
您知道使用适当的选项,以获取使用“使用LDA”模型的建议数量的主题pyspark.ml
?还是可以使用解决方案Gensim
得分来避免执行问题?.
解决方案
推荐阅读
- rabbitmq - MassTransit 序列化非 Masstransit 消息
- python - 在 keras 中编写自定义层的正确方法?
- excel - COUNTIF 返回错误结果
- asp.net-core-3.1 - ASP.Net Core Web App Navigation 问题 - 它似乎没有执行 Action 方法
- arduino - arduino TFT屏蔽干扰BNO055(I2C)
- qt - 视频会议时 Chromium 崩溃
- react-admin - 反应管理员中的导入问题
- php - 将自定义值添加到另一个 wordpress 插件中的函数
- java - 如何从另一个应用程序获取 Swing 点击事件?
- flutter - 发布模式下 quick_actions 插件的 MissingPluginException