首页 > 解决方案 > 替换 Spark 数据框中嵌套 json 字符串列中的值

问题描述

我有一个嵌套的 JSON 字符串作为列的一部分。我需要用地图中的一些东西替换更深的 JSON 中的键和值。

我之前使用过 typedLit 和一个列,但是如何在这里使用它。

    import org.apache.spark.sql.SparkSession
    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    import spark.implicits._

    val df = Seq(
      ("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
      ("2", """{"p": "bar", "q": 3.0}""", "some_other_field_2"),
      ("3",
        """{"nstKey":[ {"fn":"name1","v1":false,"v2":true},
          |{"fn":"name2","v1":"100","v2":"200"}
          |]}""".stripMargin, "some_other_field_3")

    ).toDF("id", "json", "other")

    df.show(truncate = false)

    val kMap=Map("fn"->"rfn","v1"->"rv1","v2"->"rv2")
    val vmap=Map("name1"->"rename1","name2"->"rename2")

    val df1= df.withColumn("id1",col("id"))
    .withColumn("other1",col("other"))
    .withColumn("k",get_json_object(col("json"),"$.k"))
    .withColumn("v",get_json_object(col("json"),"$.v"))
    .withColumn("p",get_json_object(col("json"),"$.p"))
    .withColumn("q",get_json_object(col("json"),"$.q"))
    .withColumn("nestedKey",get_json_object(col("json"),"$.nstKey"))
    .select("id1","other1","k","v","p","q","nestedKey")
  
  df1.show(truncate = false)

结果 :

+---+------------------+----+----+----+----+--------------------------------------------------------------------------+
|id1|other1            |k   |v   |p   |q   |nestedKey                                                                 |
+---+------------------+----+----+----+----+--------------------------------------------------------------------------+
|1  |some_other_field_1|foo |1.0 |null|null|null                                                                      |
|2  |some_other_field_2|null|null|bar |3.0 |null                                                                      |
|3  |some_other_field_3|null|null|null|null|[{"fn":"name1","v1":500,"v2":true},{"fn":"name2","v1":"abc","v2":"A-1"}]|
+---+------------------+----+----+----+----+--------------------------------------------------------------------------+


期待:

+---+------------------+----+----+----+----+--------------------------------------------------------------------------+
|id1|other1            |k   |v   |p   |q   |nestedKey                                                                 |
+---+------------------+----+----+----+----+--------------------------------------------------------------------------+
|1  |some_other_field_1|foo |1.0 |null|null|null                                                                      |
|2  |some_other_field_2|null|null|bar |3.0 |null                                                                      |
|3  |some_other_field_3|null|null|null|null|[{"rfn":"rename1","rv1":500,"rv2":true},{"rfn":"rename2","rv1":"abc","rv2":"A-1"}]|
+---+------------------+----+----+----+----+--------------------------------------------------------------------------+

标签: jsonscaladataframeapache-spark

解决方案


您可以使用regexp_replace函数来替换 JSON 字符串值。

  val keyChangedDF = kMap.keySet.foldLeft(df1)((df,c)=>
    df.withColumn("nestedKey", regexp_replace('nestedkey, "\""+c+"\":", "\""+kMap(c)+"\":")))

  val valueChangedDF = vMap.keySet.foldLeft(keyChangedDF)((df,c)=>
    df.withColumn("nestedKey", regexp_replace('nestedkey, "\""+c+"\",", "\""+vMap(c)+"\",")))

推荐阅读