首页 > 解决方案 > Spark(Scala)udf修改数据框列中的地图

问题描述

我有一个看起来像这样的数据框。该tfs列是 String 到 Long 的映射,并且weights是浮点数

+----------+---+------------------------+------------------+
|term      |df |tfs                     |weight            |
+----------+---+------------------------+------------------+
|keyword1  |2  |{2.txt -> 2, 1.txt -> 2}|1.3               |
|keyword2  |1  |{2.txt -> 1}            |0.6931471805599453|
|keyword3  |2  |{2.txt -> 1, 1.txt -> 2}|0.52343473        |
+----------+---+------------------------+------------------+

我想通过将tfs地图中的每个值乘以其各自的权重来组合最后两列,以获得类似

+----------+---+------------------------------------------+
|term      |df |weighted-tfs                              |
+----------+---+------------------------------------------+
|keyword1  |2  |{2.txt -> 2.6, 1.txt -> 2.6}              |
|keyword2  |1  |{2.txt -> 0.6931471805599453}             |
|keyword3  |2  |{2.txt -> 0.52343473, 1.txt -> 1,04686946}|
+----------+---+------------------------------------------+

我的猜测是为此编写一个 udf 会很简单,但我在 Spark 和 Scala 方面都很有经验,所以我不知道该怎么做。

标签: scalaapache-spark

解决方案


使用map_from_arrays, map_keys&map_values函数。

试试下面的代码。

val finalDF = df
.withColumn(
    "weighted-tfs",
    map_from_arrays(
        map_keys($"tfs"),
        expr("transform(map_values(tfs),i -> i * weight)")
    )
)

输出

finalDF.show(false)

+--------+---+------------------------+------------------+------------------------------------------+
|term    |df |tfs                     |weight            |product                                   |
+--------+---+------------------------+------------------+------------------------------------------+
|keyword1|2  |[2.txt -> 2, 1.txt -> 2]|1.3               |[2.txt -> 2.6, 1.txt -> 2.6]              |
|keyword2|1  |[2.txt -> 1]            |0.6931471805599453|[2.txt -> 0.6931471805599453]             |
|keyword3|2  |[2.txt -> 1, 1.txt -> 2]|0.52343473        |[2.txt -> 0.52343473, 1.txt -> 1.04686946]|
+--------+---+------------------------+------------------+------------------------------------------+

推荐阅读