首页 > 解决方案 > 如何使用 spark sql scala 拆分列中的 geojson 数据

问题描述

我有 GeoJSON 数据作为结构类型,如下所示:

root
 |-- features: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- geometry: struct (nullable = true)
 |    |    |    |-- coordinates: array (nullable = true)
 |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- properties: struct (nullable = true)
 |    |    |    |-- auswertezeit: string (nullable = true)
 |    |    |    |-- geschwindigkeit: long (nullable = true)
 |    |    |    |-- strecke_id: long (nullable = true)
 |    |    |    |-- verkehrsstatus: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- type: string (nullable = true)

我将在列中拆分数据:strecke_id、auswertezeit、strecke_id、verkehrsstatus、geschwindigkeit 和坐标。谢谢您的帮助

标签: scalaapache-spark-sql

解决方案


这是一个稍微简化的示例,但这将为您提供根据自己的规范调整逻辑的方向:

import sparkSession.implicits._

val geoDF = sparkSession.read.json("./src/test/resources/geo.json")

val resultDf = geoDF.withColumn("exploaded", functions.explode($"features"))
  .select("exploaded.properties.auswertezeit", "exploaded.properties.geschwindigkeit",
    "exploaded.properties.strecke_id", "exploaded.properties.verkehrsstatus")

resultDf.show()
resultDf.printSchema()

输入数据(格式化):

{
    "features": [
        {
            "properties": {
                "auswertezeit": "x",
                "geschwindigkeit": 1,
                "strecke_id": 11,
                "verkehrsstatus": "xx"
            }
        },
        {
            "properties": {
                "auswertezeit": "y",
                "geschwindigkeit": 2,
                "strecke_id": 22,
                "verkehrsstatus": "yy"
            }
        }
    ],
    "type": "xyz"
}

结果:

+------------+---------------+----------+--------------+
|auswertezeit|geschwindigkeit|strecke_id|verkehrsstatus|
+------------+---------------+----------+--------------+
|           x|              1|        11|            xx|
|           y|              2|        22|            yy|
+------------+---------------+----------+--------------+

root
 |-- auswertezeit: string (nullable = true)
 |-- geschwindigkeit: long (nullable = true)
 |-- strecke_id: long (nullable = true)
 |-- verkehrsstatus: string (nullable = true)

推荐阅读