pyspark - 如何在pyspark中为多个不同的列转置数据
解决方案
用于stack
转置如下 ( spark>=2.4
)-
加载测试数据
val data =
"""
|PersonId | Education1CollegeName | Education1Degree | Education2CollegeName | Education2Degree |Education3CollegeName | Education3Degree
| 1 | xyz | MS | abc | Phd | pqr | BS
| 2 | POR | MS | ABC | Phd | null | null
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df1 = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |PersonId|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |1 |xyz |MS |abc |Phd |pqr |BS |
* |2 |POR |MS |ABC |Phd |null |null |
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
*
* root
* |-- PersonId: integer (nullable = true)
* |-- Education1CollegeName: string (nullable = true)
* |-- Education1Degree: string (nullable = true)
* |-- Education2CollegeName: string (nullable = true)
* |-- Education2Degree: string (nullable = true)
* |-- Education3CollegeName: string (nullable = true)
* |-- Education3Degree: string (nullable = true)
*/
使用堆栈取消透视表
df1.selectExpr("PersonId",
"stack(3, Education1CollegeName, Education1Degree, Education2CollegeName, Education2Degree, " +
"Education3CollegeName, Education3Degree) as (CollegeName, EducationDegree)")
.where("CollegeName is not null and EducationDegree is not null")
.show(false)
/**
* +--------+-----------+---------------+
* |PersonId|CollegeName|EducationDegree|
* +--------+-----------+---------------+
* |1 |xyz |MS |
* |1 |abc |Phd |
* |1 |pqr |BS |
* |2 |POR |MS |
* |2 |ABC |Phd |
* +--------+-----------+---------------+
*/
推荐阅读
- java - 无法为事务打开 JPA EntityManager(使用 LocalContainerEntityManagerFactoryBean)
- excel - 具有可变固定单元格的 R1C1 公式格式
- java - 使用 apache poi 从 pptx 获取幻灯片 id
- javascript - 如何将开槽输入字段的表单属性设置为 shadow dom 内的表单?
- dart - 如何检查 Sqflite DB 是否包含特定的 id?-扑
- macos - 不同SVN GUI的兼容性
- sql - 来自同一个 SQL Server 表的水平和垂直输出
- objective-c - 如何在 Swift 中更改覆盖函数返回类型
- qlikview - 设置具有多个排除项的分析表达式不同计数
- python - Netmiko FileTransfer 与 python