首页 > 解决方案 > pyspark 嵌套数组爆炸

问题描述

我有一个数据框如下:

df = spark.createDataFrame(
    [
        (
            1, 
            "2017-12-03", 
            """{"1":[{"john":[12443,12441],"james":[14380,14379,13463],"mike":[15284,15280]}],"2":[{"brian":[15284,15280],"julio":[15284],"org":[]}]}"""
        ),
        (
            2, 
            "2017-12-04", 
            """{"1":[{"john":[12443,12441],"james":[14380,14379,13463],"mike":[15284,15280]}],"2":[{"brian":[15284,15280]}]}"""
        )
    ],
    ("id", "date", "users")
)

我有一个加载为 json 的函数:

@udf("map<string, array<string>>")
def parse(s):
    try:
        return json.loads(s)
    except:
        pass

当我选择顶层时,它看起来不错,但会为用户删除双引号:

df.select("id", "date", explode(parse("users")).alias("tier_id", "user_list")).show()

+---+----------+-------+--------------------+
| id|      date|tier_id|           user_list|
+---+----------+-------+--------------------+
|  1|2017-12-03|      1|[{john=[12443, 12...|
|  1|2017-12-03|      2|[{julio=[15284], ...|
|  2|2017-12-04|      1|[{john=[12443, 12...|
|  2|2017-12-04|      2|[{brian=[15284, 1...|
+---+----------+-------+--------------------+

当我尝试爆炸用户时,我收到以下错误消息:

df.select("id", "date", explode(parse("users")).alias("tier_id", "user_list"))\
    .withColumn("user_list", explode("user_list")).alias("user", "drill").show()

TypeError: alias() takes exactly 2 arguments (3 given)

我认为它不会爆炸 user_list 因为所有的双引号都被删除了。请问我怎样才能让这个工作?

标签: apache-sparkpyspark

解决方案


问题是您的 udf 返回的是字符串数组而不是地图数组。您可以再次使用 json 库解析字符串,或者您可以更改 udf 以返回正确的类型:

@udf("map<string, array<map<string, array<int>>>>")
def parse(s):
    try:
        return json.loads(s)
    except:
        pass

这将返回预期的类型(字典数组,带有字符串键和整数数组的值)。

df.select("id", "date", explode(parse("users")).alias("tier_id", "user_list"))\
    .withColumn("user_list", explode("user_list"))\
    .select("id", "date", "tier_id", explode("user_list").alias("user", "drill")).show()

# +---+----------+-------+-----+--------------------+
# | id|      date|tier_id| user|               drill|
# +---+----------+-------+-----+--------------------+
# |  1|2017-12-03|      1| john|      [12443, 12441]|
# |  1|2017-12-03|      1| mike|      [15284, 15280]|
# |  1|2017-12-03|      1|james|[14380, 14379, 13...|
# |  1|2017-12-03|      2|julio|             [15284]|
# |  1|2017-12-03|      2|brian|      [15284, 15280]|
# |  1|2017-12-03|      2|  org|                  []|
# |  2|2017-12-04|      1| john|      [12443, 12441]|
# |  2|2017-12-04|      1| mike|      [15284, 15280]|
# |  2|2017-12-04|      1|james|[14380, 14379, 13...|
# |  2|2017-12-04|      2|brian|      [15284, 15280]|
# +---+----------+-------+-----+--------------------+

请注意,您需要爆炸user_list两次,因为它以地图数组而不是地图类型开始。


推荐阅读