首页 > 解决方案 > Pyspark 中的 LDA 一致性

问题描述

我正在使用pyspark(版本2.3.1),并且我正在尝试根据以下代码重现相同的结果:

lda = LDA(k=10, seed=5, optimizer="em", featuresCol="features")
ldamodel = lda.fit(rescaledData)
ldatopics = ldamodel.describeTopics()
ldatopics.show(10)

输出 1:

+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    0|[0, 199, 2, 35, 1...|[0.02179604286102...|
|    1|[267, 142, 76, 50...|[0.01640698273265...|
|    2|[14, 6, 12, 29, 7...|[0.01542644578135...|
|    3|[279, 193, 21, 74...|[0.01304181652577...|
|    4|[12, 70, 252, 151...|[0.01104580800704...|
|    5|[9, 75, 474, 255,...|[0.01606660426132...|
|    6|[13, 4, 88, 3, 27...|[0.02825736583107...|
|    7|[42, 146, 26, 700...|[0.01156411695149...|
|    8|[89, 2, 82, 403, ...|[0.01666772169015...|
|    9|[1, 303, 411, 83,...|[0.02547416776649...|
+-----+--------------------+--------------------+

即使我使用了种子,每次我重新启动应用程序(关闭并重新打开笔记本)我都会得到不同的结果。看第二个输出:

+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    0|[403, 199, 414, 1...|[0.01236421045802...|
|    1|[75, 109, 251, 5,...|[0.01551907510059...|
|    2|[12, 188, 6, 314,...|[0.01206780033644...|
|    3|[91, 76, 23, 82, ...|[0.01244511461388...|
|    4|[162, 127, 12, 14...|[0.01380643020451...|
|    5|[4, 46, 7, 220, 2...|[0.01591219626409...|
|    6|[89, 71, 272, 279...|[0.02027028435250...|
|    7|[1, 3, 13, 57, 27...|[0.02192425215634...|
|    8|[2, 0, 35, 87, 65...|[0.02033711369900...|
|    9|[194, 15, 37, 42,...|[0.01436615776405...|
+-----+--------------------+--------------------+

请注意,我在.transform阶段遇到了同样的问题(即使使用种子)。使用的代码如下:

paramMap = {ldamodel.seed: 5}
ldaResults = ldamodel.transform(rescaledData, params=paramMap)

你有什么提示可以帮助我吗?

非常感谢,洛伦佐

标签: apache-sparkpysparkapache-spark-mllibtext-mininglda

解决方案


推荐阅读