首页 > 解决方案 > Spark:基于模式填充空值的干净方法

问题描述

我有格式如下的 avro 文件:

|Some col|Some other col|          body         |
|--------|--------------|-----------------------|
|some val|   some val   |   some json string    |
|  ...   |     ...      |         ...           |

问题:是否有一种干净的方法可以从 json 字符串中选择所有列 + 模式中所有不在 json 字符串中的列,其中 None 作为插入值?

标签: pythonapache-sparkpysparkdatabricks

解决方案


In [1]: from pyspark.sql.types import StructField, StructType, StringType
   ...: from pyspark.sql.functions import col, from_json

In [2]: schema = StructType([
   ...:     StructField("a", StringType()),
   ...:     StructField("b", StringType()),
   ...:     StructField("c", StringType()),
   ...:     StructField("d", StringType()),
   ...: ])

In [3]: df = spark.createDataFrame([("1", '{"a": 1, "b": 2}'),
   ...:                             ("2", '{"a": 3, "c": 4}')],
   ...:                            schema=["Some col", "body"])

In [4]: df.show()
+--------+----------------+
|Some col|            body|
+--------+----------------+
|       1|{"a": 1, "b": 2}|
|       2|{"a": 3, "c": 4}|
+--------+----------------+

In [5]: df.select(from_json(col("body"), schema).alias("data")).select("data.*").show()
+---+----+----+----+
|  a|   b|   c|   d|
+---+----+----+----+
|  1|   2|null|null|
|  3|null|   4|null|
+---+----+----+----+

推荐阅读