apache-spark - 修改 Spark DataFrame 结构
问题描述
假设我有一个包含以下列的 Spark DataFrame:
| header1 | location | precision | header2 | velocity | data |
(这个df还包含一些数据)
现在我想将 df 转换为具有 2 列的新结构,每列都有复杂的字段 - 类似于:
| gps | velocity |
| header1 | location | precision | header2 | velocity | data |
如果我可以调用一个方法最好:
df1 = createStructure(df, "gps", ["header1", "gps", "precision"])
df2 = createStructure(df1, "velocity", ["header2", "velocity", "data"])
我正在尝试“withColumn”,但没有运气
解决方案
尝试这个。
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df1 = Seq(("h1-4", "loc4", "prec4", "h2-4", "vel4", "d4"), ("h1-5", "loc5", "prec5", "h2-5", "vel5", "d5")).toDF("header1", "location", "precision", "header2", "velocity", "data")
df1: org.apache.spark.sql.DataFrame = [header1: string, location: string ... 4 more fields]
scala> df1.show(false)
+-------+--------+---------+-------+--------+----+
|header1|location|precision|header2|velocity|data|
+-------+--------+---------+-------+--------+----+
|h1-4 |loc4 |prec4 |h2-4 |vel4 |d4 |
|h1-5 |loc5 |prec5 |h2-5 |vel5 |d5 |
+-------+--------+---------+-------+--------+----+
scala> val outputDF = df1.withColumn("gps", struct($"header1", $"location", $"precision")).withColumn("velocity", struct($"header2", $"velocity", $"data")).select("gps", "velocity")
outputDF: org.apache.spark.sql.DataFrame = [gps: struct<header1: string, location: string ... 1 more field>, velocity: struct<header2: string, velocity: string ... 1 more field>]
scala> outputDF.printSchema
root
|-- gps: struct (nullable = false)
| |-- header1: string (nullable = true)
| |-- location: string (nullable = true)
| |-- precision: string (nullable = true)
|-- velocity: struct (nullable = false)
| |-- header2: string (nullable = true)
| |-- velocity: string (nullable = true)
| |-- data: string (nullable = true)
scala> outputDF.show(false)
+-------------------+----------------+
|gps |velocity |
+-------------------+----------------+
|[h1-4, loc4, prec4]|[h2-4, vel4, d4]|
|[h1-5, loc5, prec5]|[h2-5, vel5, d5]|
+-------------------+----------------+
推荐阅读
- prolog - SWI prolog 返回 true 而不是变量
- angular - 角度下拉问题
- splunk - 使用 Splunk 读取 Squid access.log
- javascript - Javascript装箱问题留下空白
- docker - docker-compose 和 docker cli 之间的音量模式行为
- mysql - 在 phpmyadmin 中从列转换为行
- r - 我写的函数中缺少()的问题
- node.js - Joi错误验证在节点js中抛出错误
- html - 基于包含标签、文本框和 div 占位符的布局设计 HTML 表单
- javascript - 如何在给定的选择器之后停止 puppeteer 抓取?