首页 > 解决方案 > 计算向量的滑动平均值

问题描述

我想使用 Spark 计算多个向量的向量平均值。我有以下表示:

|vectors                                                     |
------------------------------------------------------------
|[6.08705997467041, 49.47844314575195, 0.09487666034155598]  |
|[6.059467792510986, 49.49903869628906, 0.05688282138794084] |
|[6.11596155166626, 49.48028564453125, 0.1072961373390558]   |
|[6.11596155166626, 49.48028564453125, 0.1072961373390558]   |
|[6.090848445892334, 49.500823974609375, 0.15015015015015015]|

然后我创建一个窗口来分组 5 个向量:

|list_vector                                                                                                                                                                                                                                                                                                   |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|[[6.044274806976318, 49.50155258178711, 0.21052631578947367]]                                                                                                                                                                                                                                                 |
|[[6.044274806976318, 49.50155258178711, 0.21052631578947367], [6.050730228424072, 49.49321746826172, 0.1721170395869191]]                                                                                                                                                                                     |
|[[6.044274806976318, 49.50155258178711, 0.21052631578947367], [6.050730228424072, 49.49321746826172, 0.1721170395869191], [6.040494441986084, 49.5018310546875, 0.15313935681470137]]                                                                                                                         |
|[[6.044274806976318, 49.50155258178711, 0.21052631578947367], [6.050730228424072, 49.49321746826172, 0.1721170395869191], [6.040494441986084, 49.5018310546875, 0.15313935681470137], [6.056500434875488, 49.504425048828125, 0.1297016861219196]]                                                            |
|[[6.044274806976318, 49.50155258178711, 0.21052631578947367], [6.050730228424072, 49.49321746826172, 0.1721170395869191], [6.040494441986084, 49.5018310546875, 0.15313935681470137], [6.056500434875488, 49.504425048828125, 0.1297016861219196], [6.081665515899658, 49.50476837158203, 0.2849002849002849]]|

现在我想计算每行这些向量的平均值,并将权重作为列表中的第三个值。我编写了一个基于 udf 的解决方案,但我认为 spark 内置插件中必须有一种方法。我的udf:

def vector_mean(vectors):

    np_v = np.array(vectors)
    np_vectors = np_v[:, 0:-1]
    weights = np_v[:, -1:].flatten()
    
    try:
        return (np.average(np_vectors, weights=weights, axis=0)).tolist()
    except:
        return [0, 0]


vector_mean = udf(vector_mean, ArrayType(FloatType()))

结果正常但缓慢:

|vector_mean           |
----------------------
|[6.044275, 49.501553] |
|[6.0471787, 49.497803]|
|[6.045268, 49.498955] |
|[6.047457, 49.50002]  |
|[6.057712, 49.501446] |

我尝试使用:

from pyspark.ml.feature import VectorAssembler
    
vec_assembler = VectorAssembler(inputCols = ["X", "Y","W"], outputCol='vectors')

但是,我没有在文档中找到如何计算它。有没有使用 Spark 2.3 内置函数计算矢量平均值的解决方案?

文档:https ://spark.apache.org/docs/2.3.0/ml-features.html#vectorassembler

我正在使用 Spark 2.3,但无法升级。

标签: pythonapache-sparkpyspark

解决方案


推荐阅读