scala - 如何在不知道 spark scala 架构的情况下动态解析数据框中的 json 列
问题描述
给定一个 spark 数据框,该数据框有一列可能包含也可能不包含嵌套的 json 。这个嵌套的 json 是动态的。最终要求是打破 json 并为嵌套 json 中存在的每个键生成一个新的数据框。
json 是动态的,因此生成的表将是动态的。另请考虑数据框包含超过 1 亿条记录。
例如-
输入
------------------------------------------------------------------------
|id |key |type |value
|f9f |BUSI |off |false
|f96 |NAME |50 |true
|f9z |BANK |off |{"Name":"United School","admNumber":"197108","details":{"code":"WEREFFW32","studentName":"Abhishek kumar","doc":"certificate","admId":"3424325328","stat":0,"studentDetails":false} }|
输出:-
--------------------------------------------------------------------------------------------------------------------------
|id |key |type |value |Name | admNumber |code | studentName | doc |admId |stat | studentDetails
+------------------------------------+-----------------+-------------+----------------------------------------------------
|f9f |BUSI |off |false |NULL |NULL |NULL |NULL |NULL |NULL |NULL |NULL |
|f96 |NAME |50 |true |NULL |NULL |NULL |NULL |NULL |NULL |NULL |NULL |
|f9z |BANK |off |NULL |United School |197108 |WEREFFW32 |Abhishek kumar |certificate |3424325328 |0 |false |
解决方案
val data = Seq(
(77, "email1", """{"key1":38,"key3":39}"""),
(78, "email2", """{"key1":38,"key4":39}"""),
(178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }"""),
(179, "email8", """{"sub1":"qwerty","sub2":["42"]}"""),
(180, "email8", """{"sub1":"qwerty","sub2":["42", "56", "test"]}""")
).toDF("id", "name", "colJson")
data.show(false)
// +---+-------+---------------------------------------------------------------+
// |id |name |colJson |
// +---+-------+---------------------------------------------------------------+
// |77 |email1 |{"key1":38,"key3":39} |
// |78 |email2 |{"key1":38,"key4":39} |
// |178|email21|{"key1":"when string","key4":36, "key6":"test", "key10":false }|
// |178|email8 |{"sub1":"qwerty","sub2":"42"} |
// +---+-------+---------------------------------------------------------------+
val schema = spark.read.json(data.select("colJson").as[String]).schema
val res = data.select($"id", $"name", from_json($"colJson", schema).as("s")).select("id", "name", "s.*")
res.show(false)
// +---+-------+-----------+-----+----+----+----+------+----+
// |id |name |key1 |key10|key3|key4|key6|sub1 |sub2|
// +---+-------+-----------+-----+----+----+----+------+----+
// |77 |email1 |38 |null |39 |null|null|null |null|
// |78 |email2 |38 |null |null|39 |null|null |null|
// |178|email21|when string|false|null|36 |test|null |null|
// |178|email8 |null |null |null|null|null|qwerty|42 |
// +---+-------+-----------+-----+----+----+----+------+----+
val df1 = res.filter('sub1.equalTo("qwerty"))
df1.show(false)
// +---+------+----+-----+----+----+----+------+----+
// |id |name |key1|key10|key3|key4|key6|sub1 |sub2|
// +---+------+----+-----+----+----+----+------+----+
// |178|email8|null|null |null|null|null|qwerty|42 |
// +---+------+----+-----+----+----+----+------+----+
推荐阅读
- node.js - 强制从私有注册表下载 npm
- ruby-on-rails - Rails API ActiveStorage:获取公共 URL 以显示来自 AWS S3 存储桶的图像?
- java - 检查数据库数据
- c++ - C++ - 我应该使用 Dijkstra 的哪些数据结构?
- r - 如何为不同面板中的不同分组变量在 lattice 中的 xyplot 中获得阴影背景
- java - Java Swing:“java.awt.AWTEventMulticaster.mouseMoved”错误
- linux - Linux libc++ seems to be broken
- sql-server - 在 WHERE 子句中获取今天的日期
- powershell - 带有文本“$__VAR__”的 PowerShell 转义字符串
- url - 如何使用 JasperSoft Studio 创建超链接作为静态文本的一部分?