首页 > 解决方案 > 如何检查 JSON spark Scala Dataframe 中是否包含所需的密钥

问题描述

我有一个如下所示的数据框。

ID,       details_Json
1         {"name":"Anne","Age":"12","country":"Denmark"}
2         {"name":"Zen","Age":"24"}
3         {"name":"Fred","Age":"20","country":"France"}
4         {"name":"Mona","Age":"18","country":"Denmark"}

如您所见,json 中的字段不固定。它可以包含多个给定的字段。我的意思是有时有时可能name, Age, countryname, Age, country, Universityname, Age, university

我想过滤包含country在其 json 中的行,并且国家等于丹麦。

我的输出应该如下所示。

ID,       details_Json
1         {"name":"Anne","Age":"12","country":"Denmark"}
4         {"name":"Mona","Age":"18","country":"Denmark"}

有没有办法做到这一点?

谢谢:)

标签: sqlscalaapache-sparkapache-spark-sql

解决方案


这是一种方法:

//Construct dataframe
val df = sc.parallelize(Seq((1,"{\"name\":\"Anne\",\"Age\":\"12\",\"country\":\"Denmark\"}"), 
             (2, "{\"name\":\"Zen\",\"Age\":\"24\"}"), 
             (3, "{\"name\":\"Fred\",\"Age\":\"20\",\"country\":\"France\"}"), 
             (4, "{\"name\":\"Mona\",\"Age\":\"18\",\"country\":\"Denmark\"}"))).toDF("ID", "details_Json")

df.show

+---+--------------------+
| ID|        details_Json|
+---+--------------------+
|  1|{"name":"Anne","A...|
|  2|{"name":"Zen","Ag...|
|  3|{"name":"Fred","A...|
|  4|{"name":"Mona","A...|
+---+--------------------+

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val struct =
  StructType(
    StructField("name", StringType, true) ::
    StructField("Age", StringType, true) ::
    StructField("country", StringType, true) :: Nil)
val df2 = df.withColumn("details_Struct", from_json($"details_Json", struct)).withColumn("country", $"details_Struct".getField("country")).filter($"country".equalTo("Denmark")).drop("country", "details_Struct")

df2.show
+---+--------------------+
| ID|        details_Json|
+---+--------------------+
|  1|{"name":"Anne","A...|
|  4|{"name":"Mona","A...|
+---+--------------------+

上面的答案在 Apache Spark 2.3.1 中。您使用哪个版本?在 2.4.1 版本中,有一个schema_of_json自动推断模式的功能。您可能还想检查一下。https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@schema_of_json(json:String):org.apache.spark.sql.Column


推荐阅读