首页 > 解决方案 > 将向量列展开为 DataFrame 中的普通列

问题描述

我想将向量列展开为数据框中的普通列。.transform 创建单独的列,但是当我尝试 .show 时,数据类型或“可为空”出现错误 - 请参阅下面的示例代码。如何解决问题?

from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import udf

spark = SparkSession\
        .builder\
        .config("spark.driver.maxResultSize", "40g") \
        .config('spark.sql.shuffle.partitions', '2001') \
        .getOrCreate()

data = [(0.2, 53.3, 0.2, 53.3),
        (1.1, 43.3, 0.3, 51.3),
        (2.6, 22.4, 0.4, 43.3),
        (3.7, 25.6, 0.2, 23.4)]     
df = spark.createDataFrame(data, ['A','B','C','D'])
df.show(3)
df.printSchema() 

vecAssembler = VectorAssembler(inputCols=['C','D'], outputCol="features")
new_df = vecAssembler.transform(df)
new_df.printSchema()
new_df.show(3)

split1_udf = udf(lambda value: value[0], DoubleType())
split2_udf = udf(lambda value: value[1], DoubleType())
new_df = new_df.withColumn('c1', split1_udf('features')).withColumn('c2', split2_udf('features'))
new_df.printSchema()
new_df.show(3)

标签: apache-sparkpyspark

解决方案


我不知道UDF有什么问题。但我在下面找到了另一个解决方案。

data = [(0.2, 53.3, 0.2, 53.3),
        (1.1, 43.3, 0.3, 51.3),
        (2.6, 22.4, 0.4, 43.3),
        (3.7, 25.6, 0.2, 23.4)]      
df = spark.createDataFrame(data, ['A','B','C','D'])  

vecAssembler = VectorAssembler(inputCols=['C','D'], outputCol="features")
new_df = vecAssembler.transform(df)

def extract(row):
    return (row.A, row.B,row.C,row.D,) + tuple(row.features.toArray().tolist())

extracted_df = new_df.rdd.map(extract).toDF(['A','B','C','D', 'col1', 'col2'])
extracted_df.show()

推荐阅读