首页 > 解决方案 > 如何在pyspark中将数据框附加在一起?

问题描述

我有一个 pyspark 数据框,它是机器学习预测的输出,如下所示:

predictions = model.transform(test_data)
+-----------------+-----------------+-----+------------------+-------+--------------------+--------------------+----------+
|col1_imputed     |col2_imputed     |label|          features|row_num|       rawPrediction|         probability|prediction|
+-----------------+-----------------+-----+------------------+-------+--------------------+--------------------+----------+
|        -0.002353|           0.9762|    0|[-0.002353,0.9762]|      1|[-0.8726465863653...|[0.29470390100153...|       1.0|
|         -0.08637|          0.06524|    0|[-0.08637,0.06524]|      3|[-0.6029409441836...|[0.35367114067727...|



root
 |-- col1_imputed: double (nullable = true)
 |-- col2_imputed: double (nullable = true)
 |-- label: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- row_num: integer (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

我将该probability列转换为仅选择其向量中的正预测,但我想将此新转换附加到上面的数据框(或用这个新的唯一正概率替换当前概率列),我在尝试时遇到错误这个。

from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

secondelement=udf(lambda v:float(v[1]),FloatType())
pos_prob = predictions.select(secondelement('probability')) #selects second element in probability column

#trying to add the new pos_prob column and naming it 'prob' to the dataframe:
df = predictions.withColumn('prob', predictions.select(secelement('probability'))).collect()

AssertionError: col should be Column

我也尝试lit()通过阅读类似的问题来解决它,但这给出了另一个错误:

df = all_preds.withColumn('prob', lit(all_preds.select(secelement('probability')))).collect()

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

标签: pythonapache-sparkpyspark

解决方案


您可以将 UDF 与 一起使用withColumn,例如

from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

secondelement = udf(lambda v: float(v[1]), FloatType())
df = predictions.withColumn('prob', secondelement('probability'))

推荐阅读