scala - 展平并从 Spark 中的 Struct 类型数据框列中读取值
问题描述
我有一个这样的镶木地板格式数据集:
parquetFile.toDF().registerTempTable("tempTable")
val PDataFrame = sqlContext.sql("SELECT * FROM tempTable")
PDataFrame.show()
+--------------------+--------------------+-------------------+-----+--------+-------------------+--------------------+
| _id| VehicleDetailId| PlanID| Type| SubType| CreatedOn| Date|
+--------------------+--------------------+-------------------+-----+--------+-------------------+--------------------+
|[($oid,5cc8e1a72f...|[($numberLong,219...|[($numberLong,164)]|Quote|Response|5/1/2019 5:30:39 AM|[($date,155666883...|
|[($oid,5cc8e1a72f...|[($numberLong,219...|[($numberLong,168)]|Quote|Response|5/1/2019 5:30:39 AM|[($date,155666883...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,102)]| IDV| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,105)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,112)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,134)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,114)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,115)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,113)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,185)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,108)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,149)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,135)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,167)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,116)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,156)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,125)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,102)]| IDV|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,144)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,171)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
+--------------------+--------------------+-------------------+-----+--------+--------------------+-------------------+--------------------+
only showing top 20 rows
该数据集的架构是:
PDataFrame.printSchema()
root
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- VehicleDetailId: struct (nullable = true)
| |-- $numberLong: string (nullable = true)
|-- PlanID: struct (nullable = true)
| |-- $numberLong: string (nullable = true)
|-- Type: string (nullable = true)
|-- SubType: string (nullable = true)
|-- CreatedOn: string (nullable = true)
|-- Date: struct (nullable = true)
| |-- $date: string (nullable = true)
我正在尝试使用 Scala 编写 SparkSQL 代码,以PlanID
在 where 子句中按值读取数据。这就是为什么我想使用 SparkSQL 的 SQL 查询。
这是我预期的输出结构(10 行的示例视图)
+-----------------------+--------------------+-------+-----+--------+-------------------+--------+
| _id| VehicleDetailId| PlanID| Type| SubType| CreatedOn| Date|
+-----------------------+--------------------+-------+-----+--------+-------------------+--------+
5ae7ae00b07ccf35c020e5ba|10220998|135|Quote|Response|5/1/2018 5:30:00 AM|1525132800096
5ae7ae00b07ccf35c020e5bb|10220998|134|Quote|Response|5/1/2018 5:30:00 AM|1525132800139
5ae7ae00b07ccf35c020e5bc|10220998|104|Quote|Response|5/1/2018 5:30:00 AM|1525132800516
5ae7ae00b07ccf35c020e5bd|10220998|104|Quote|Response|5/1/2018 5:30:00 AM|1525132800519
5ae7ae00b07ccf35c020e5be|10220998|101|Quote|Response|5/1/2018 5:30:00 AM|1525132800539
5ae7ae00b07ccf35c020e5bf|10220998|103|IDV|Request|5/1/2018 5:30:00 AM|1525132800546
5ae7ae00b07ccf35c020e5c0|10220998|105|Quote|Response|5/1/2018 5:30:00 AM|1525132800577
5ae7ae00b07ccf35c020e5c1|10220998|103|IDV|Request|5/1/2018 5:30:00 AM|1525132800581
5ae7ae00b07ccf35c020e5c2|10220998|103|IDV|Response|5/1/2018 5:30:00 AM|1525132800702
5ae7ae00b07ccf35c020e5c3|10220998|128|Quote|Response|5/1/2018 5:30:00 AM|1525132800709
现在,我尝试了各种方法来获得预期的输出,例如:
PDataFrame.withColumn("first", $"PlanID.$$numberLong").show
或者
sqlContext.sql(s""" select _id["$$oid"] as col1, PlanID["$numberLong"] as col2 from tempTable """)
不幸的是,我无法达到预期的输出。任何帮助将不胜感激。
解决方案
从您的 DataFrame 架构中,
|-- PlanID: struct (nullable = true)
| |-- $numberLong: string (nullable = true)
$"PlanID.$$numberLong"
is的值,($numberLong,164)
这是一个字符串。所以你必须拆分并选择你想要的。
PDataFrame.withColumn("first", split($"PlanID.$$numberLong", ",")(1)).show
推荐阅读
- c# - 将html标签之间的文本和html标签之间的非文本获取到List中
- awk - 如果使用 sed 或匹配长度,如何移动列
- python - 合并熊猫中的两个字符串列表
- c# - 在 JsonConverter.ReadJson() 中有效地获取完整的 json 字符串
- ansible - Ansible - 比较 2 个文件列表的校验和
- javascript - 在jquery中搜索日期?
- c++ - VC++等效于LLVM`复制类型'std :: atomic的成员子对象
'调用已删除的构造函数` - node.js - app.scss 如何知道 ~foo 是 node_modules 中的 foo 目录?
- python - AssertionError: 图像必须是 rgb 224x224 ((224, 224, 4))
- php - PHP 7.3 strpos() 针问题?