首页 > 解决方案 > pyspark 保存 json 处理结构的空值

问题描述

在这里使用 Pyspark 和 Spark 2.4,Python3。在将数据帧写为 json 文件时,如果 struct 列为 null,我希望它被写为{},如果 struct 字段为 null,我希望它为"". 例如:

    >>> df.printSchema()
    root
     |-- id: string (nullable = true)
     |-- child1: struct (nullable = true)
     |    |-- f_name: string (nullable = true)
     |    |-- l_name: string (nullable = true)
     |-- child2: struct (nullable = true)
     |    |-- f_name: string (nullable = true)
     |    |-- l_name: string (nullable = true)

     >>> df.show()
    +---+------------+------------+
    | id|      child1|      child2|
    +---+------------+------------+
    |123|[John, Matt]|[Paul, Matt]|
    |111|[Jack, null]|        null|
    |101|        null|        null|
    +---+------------+------------+
    df.fillna("").coalesce(1).write.mode("overwrite").format('json').save('/home/test')

结果:


    {"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
    {"id":"111","child1":{"f_name":"jack","l_name":""}}
    {"id":"111"}

输出要求:


    {"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
    {"id":"111","child1":{"f_name":"jack","l_name":""},"child2": {}}
    {"id":"111","child1":{},"child2": {}}

我尝试了一些地图和 udf,但无法达到我所需要的。在这里感谢您的帮助。

标签: python-3.xapache-sparkpysparkapache-spark-sql

解决方案


火花 3.x

如果您将选项传递ignoreNullFields到您的代码中,您将有这样的输出。不完全是您要求的空结构,但架构仍然正确。

df.fillna("").coalesce(1).write.mode("overwrite").format('json').option('ignoreNullFields', False).save('/home/test')
{"child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"},"id":"123"}
{"child1":{"f_name":"Jack","l_name":null},"child2":null,"id":"111"}
{"child1":null,"child2":null,"id":"101"}

火花 2.x

由于上面的那个选项不存在,我认为有一个“肮脏的修复”,它模仿 JSON 结构并绕过空检查。同样,结果与您要求的不完全一样,但架构是正确的。

(df
    .select(F.struct(
        F.col('id'),
        F.coalesce(F.col('child1'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child1'),
        F.coalesce(F.col('child2'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child2')
    ).alias('json'))
    .coalesce(1).write.mode("overwrite").format('json').save('/home/test')
)
{"json":{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}}
{"json":{"id":"111","child1":{"f_name":"Jack"},"child2":{}}}
{"json":{"id":"101","child1":{},"child2":{}}}

推荐阅读