首页 > 解决方案 > Pyspark:将密集向量转换为列

问题描述

我有一个包含四列的数据框,其中一列是密集向量。

cust_id 标签 预言 可能性
1 0 0 {"vectorType":"dense","length":2,"values":[0.5745528913772013,0.4254471086227987]}
2 0 0 {"vectorType":"dense","length":2,"values":[0.5185219003114524,0.4814780996885476]}
3 0 1 {"vectorType":"dense","length":2,"values":[0.37871114732242217,0.6212888526775778]}
4 0 1 {"vectorType":"dense","length":2,"values":[0.4352110724347864,0.5647889275652135]}
5 1 1 {"vectorType":"dense","length":2,"values":[0.49476519185173606,0.505234808148264]}

我想将密集向量转换为列并将输出与剩余的列一起存储。

cust_id 标签 预言 split_int[0] split_int[1]
1 0 0 0.574552891 0.425447109
2 0 0 0.5185219 0.4814781
3 0 1 0.378711147 0.621288853
4 0 1 0.435211072 0.564788928
5 1 1 0.494765192 0.505234808

我在网上找到了一些代码,并且能够拆分密集向量。

import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType

def split_array_to_list(col):
    def to_list(v):
        return v.toArray().tolist()
    return F.udf(to_list, ArrayType(DoubleType()))(col)

df3 = selected.select(split_array_to_list(F.col("probability")).alias("split_int")).select([F.col("split_int")[i] for i in range(2)])
df3.show()

如何添加其他列?我试过这个但得到 TypeError: 'Column' object is not callable

df3 = selected.select(F.col("cust_id") + ((split_array_to_list(F.col("probability")).alias("split_int")).select([F.col("split_int")[i] for i in range(2)])))

标签: pythonpyspark

解决方案


尝试withColumn使用您的 udf

df3 = selected.withColumn("split_int", split_array_to_list(F.col("probability"))).select(F.col("*"), *[F.col("split_int")[i] for i in range(2)])

推荐阅读