apache-spark - Spark Structed Streaming 从 kafka 读取嵌套的 json 并将其展平
问题描述
一个 json 类型的数据:
{
"id": "34cx34fs987",
"time_series": [
{
"time": "2020090300: 00: 00",
"value": 342342.12
},
{
"time": "2020090300: 00: 05",
"value": 342421.88
},
{
"time": "2020090300: 00: 10",
"value": 351232.92
}
]
}
我从kafka得到了json:
spark = SparkSession.builder.master('local').appName('test').getOrCreate()
df = spark.readStream.format("kafka")...
如何操作 df 以获取 DataFrame,如下所示:
id time value
34cx34fs987 20200903 00:00:00 342342.12
34cx34fs987 20200903 00:00:05 342421.88
34cx34fs987 20200903 00:00:10 351232.92
解决方案
pyspark 中的示例代码
df2 = df.select("id", f.explode("time_series").alias("col"))
df2.select("id", "col.time", "col.value").show()