首页 > 解决方案 > 两个数据帧之间的欧式距离

问题描述

我有两个数据框。为简单起见,假设它们每个只有一个条目

+--------------------+                                                          
|        entry       |    
+--------------------+
|[0.34, 0.56, 0.87]  |
+--------------------+

+--------------------+                                                          
|        entry       |    
+--------------------+
|[0.12, 0.82, 0.98]  |
+--------------------+

如何计算这两个数据帧的条目之间的欧几里得距离?现在我有以下代码:

from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from scipy.spatial import distance

inference = udf(lambda x, y: float(distance.euclidean(x, y)), DoubleType())

inference_result = inference(a, b)

但我收到以下错误:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/lib/spark/python/pyspark/sql/udf.py", line 197, in wrapper
 return self(*args)
 File "/usr/lib/spark/python/pyspark/sql/udf.py", line 177, in __call__
 return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
 File "/usr/lib/spark/python/pyspark/sql/column.py", line 68, in _to_seq
 cols = [converter(c) for c in cols]
 File "/usr/lib/spark/python/pyspark/sql/column.py", line 68, in <listcomp>
 cols = [converter(c) for c in cols]
 File "/usr/lib/spark/python/pyspark/sql/column.py", line 56, in _to_java_column
 "function.".format(col, type(col)))
 TypeError: Invalid argument, not a string or column: DataFrame[embedding: 
 array<float>] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column 
 literals, use 'lit', 'array', 'struct' or 'create_map' function.

标签: dataframepysparkuser-defined-functionseuclidean-distance

解决方案


推荐阅读