首页 > 解决方案 > 基于子列转置火花数据帧

问题描述

我有一个看起来像这样的火花数据框:

root
|-- 0000154d-7585-5eb283ff985c: struct (nullable = true)
|    |-- collaborative_rank: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- content_rank: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- curated_rank: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- discovery_score: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- original_rank: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- recipe_id: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|-- 00005426-2675-68085cd359c7: struct (nullable = true)
|    |-- collaborative_rank: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- content_rank: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- curated_rank: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- discovery_score: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- original_rank: array (nullable = true)
|    |    |-- element: long (containsNull = true)
|    |-- recipe_id: array (nullable = true)
|    |    |-- element: long (containsNull = true)

每列是一个用户 ID,例如0000154d-7585-5eb283ff985c,每行由 15 000 个用户组成(它们来自每个包含 15 000 个用户的 json 文件)。

我想转置它,使每个用户 id 是一行,每个子列collaborative_rank, content_rank, curated_rank, discovery_score, original_rank and recipe_id是一列,数组是值。我是火花新手,有什么无痛的方法可以做到这一点吗?

编辑:

作为参考,我正在读取的输入 .json 文件如下所示:

{"0000154d-7585-4096-a71a-5eb283ff985c": {"recipe_id": [1, 2, 3], "collaborative_rank": [1, 2, 3], "curated_rank": [1, 2, 3], "discovery_score": [1]}, "00005426-2675-4940-8394-e8085cd359c7": {"recipe_id": [] ... }

等等

标签: dataframeapache-sparkpyspark

解决方案


如果不想将其转换为 rdd 并执行 UDF,可以考虑堆叠数据框。

df = spark.read.json(r'C:\stackoverflow\samples\inp.json')

stack_characteristics = str(len(df.columns))+','+','.join([f"'{v}',`{v}`" for v in df.columns])

df.select(expr(f'''stack({stack_characteristics})''').alias('userId','vals')).\
   select('userId', 'vals.*').show()

+--------------------+------------------+------------+---------------+---------+
|              userId|collaborative_rank|curated_rank|discovery_score|recipe_id|
+--------------------+------------------+------------+---------------+---------+
|0000154d-7585-409...|         [1, 2, 3]|   [1, 2, 3]|            [1]|[1, 2, 3]|
|00005426-2675-494...|         [1, 2, 3]|   [1, 2, 3]|            [1]|[1, 2, 3]|
+--------------------+------------------+------------+---------------+---------+


推荐阅读