首页 > 解决方案 > 在pyspark中将两个列表传递给pandas_udf?

问题描述

我正在尝试计算相应对之间的欧几里得距离。我试过使用普通的udf,它工作正常。我想尝试使用pandas_udf以使其更快。

@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
    from scipy.spatial import distance
    dist = float(distance.euclidean(feature1, feature2))
    return float(dist)

这就是数据的样子。列 feature1 和 feature2 是两个相同维度的列表。

all_pairs_remove_same_pair_df.select("feature1", "feature2").show()

+--------------------+--------------------+
|            feature1|            feature2|
+--------------------+--------------------+
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|

all_pairs_remove_same_pair_df.withColumn("distance", calculate_euclidean_distance(array(F.col("feature1"), F.col("feature2"))))

这是我得到的错误-

TypeError: calculate_euclidean_distance() missing 1 required positional argument: 'feature2'

标签: pythonapache-sparkpysparkuser-defined-functions

解决方案


推荐阅读