首页 > 解决方案 > 如何从 pyspark 2.4.0 中的 polynomialExpansion 获取特征名称

问题描述

pyspark 2.4.0.

这是代码:

from pyspark.ml.feature import PolynomialExpansion
from pyspark.ml.linalg import Vectors

df = spark\
    .createDataFrame([(Vectors.dense([-2.0, 2.3]),),
                      (Vectors.dense([0.0, 0.0]),),
                      (Vectors.dense([0.6, -1.1]),)],
                     ["features"])
px = PolynomialExpansion(degree=3, inputCol="features", outputCol="polyFeatures")
polyDF = px.transform(df)

spark.createDataFrame(polyDF.rdd.map(lambda row: [row['features']]+row.polyFeatures.tolist())).show()

在此处输入图像描述 我需要column names _2,_3从前。诸如此类的事情x,x*x,x*y,y,y*y.....

我需要将其应用到 3 级,并相应地知道组合名称pyspark。请指导这是否可以以简单的方式完成?

标签: python-3.xpyspark

解决方案


我遇到了同样的问题,所以我决定尝试在不实际计算递归函数的情况下发现递归。

为了弄清楚乘法,我使用了素数:

from pyspark.ml.linalg import Vectors

ex = spark.createDataFrame([(1.0, Vectors.dense(2,3,5,7,11))], ["label", "features"])

然后我创建了 3 次的多项式展开并应用它:

from pyspark.ml.feature import PolynomialExpansion

px = PolynomialExpansion(degree=3, inputCol="features", outputCol="polyFeatures")

这使:

ex = px.transform(ex)

ex.show(truncate=False)

+-----+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features              |polyFeatures                                                                                                                                                                                                                                                                                    |
+-----+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1.0  |[2.0,3.0,5.0,7.0,11.0]|[2.0,4.0,8.0,3.0,6.0,12.0,9.0,18.0,27.0,5.0,10.0,20.0,15.0,30.0,45.0,25.0,50.0,75.0,125.0,7.0,14.0,28.0,21.0,42.0,63.0,35.0,70.0,105.0,175.0,49.0,98.0,147.0,245.0,343.0,11.0,22.0,44.0,33.0,66.0,99.0,55.0,110.0,165.0,275.0,77.0,154.0,231.0,385.0,539.0,121.0,242.0,363.0,605.0,847.0,1331.0]|
+-----+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

我们a=2, b=3, c=5, d=7, e=11看到以下内容:

  • 2.0 = 一个
  • 4.0 = a*a
  • 8.0 = a*a*a
  • 3.0 = b
  • 6.0 = b*a
  • 12.0 = b*a*a
  • 9.0 = b*b
  • 18.0 = b*b*a
  • 27.0 = b*b*b
  • 5.0 = c
  • 10.0 = c*a
  • 20.0 = c*a*a
  • 15.0 = c*b
  • 30.0 = c*b*a
  • 45.0 = c*b*b
  • 25.0 = c*c
  • 50.0 = c*c*a
  • 75.0 = c*c*b
  • 125.0 = c*c*c ...

因此,对于递归,您有 3 个索引。您迭代所有值的第一个索引,第二个索引将采用从 1 到第一个索引的值(并且还采用“空”索引,您没有为第二个索引选择任何值),第三个索引将采用从 1 到第 2 个索引的值。代码如下所示:

l = [1, 2.0,3.0,5.0,7.0,11.0]
for i in range(1, 6):
    for j in range(i+1):
        for k in range(j+1):
            print(l[i]*l[j]*l[k])

请注意,我必须以 1 开始示例,即“空”索引案例。


推荐阅读