首页 > 解决方案 > 在运行时选择所有列 spark sql,没有预定义的模式

问题描述

我有一个数据框,其值的格式为

|resourceId|resourceType|seasonId|seriesId|
+----------+------------+--------+--------+
|1234      |cM-type     |883838  |8838832 |
|1235      |cM-type     |883838  |8838832 |
|1236      |cM-type     |883838  |8838832 |
|1237      |CNN-type    |883838  |8838832 |
|1238      |cM-type     |883838  |8838832 |
+----------+------------+--------+--------+

我想将数据框转换成这种格式

+----------+----------------------------------------------------------------------------------------+
|resourceId|value                                                                                   |
+----------+----------------------------------------------------------------------------------------+
|1234      |{"resourceId":"1234","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1235      |{"resourceId":"1235","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1236      |{"resourceId":"1236","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1237      |{"resourceId":"1237","resourceType":"CNN-type","seasonId":"883838","seriesId":"8838832"}|
|1238      |{"resourceId":"1238","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
+----------+----------------------------------------------------------------------------------------+

我知道我可以通过像这样手动提供字段来获得所需的输出

val jsonformated=df.select($"resourceId",to_json(struct($"resourceId", $"resourceType", $"seasonId",$"seriesId")).alias("value"))

但是,我正在尝试将列值传递给 struct programmatic,使用

val cols = df.columns.toSeq
val jsonformatted=df.select($"resourceId",to_json(struct("colval",cols)).alias("value"))

结构函数不接受序列的某种原因,从 api 看起来有一个方法签名来接受序列,

struct(String colName, scala.collection.Seq<String> colNames)

有没有更好的解决方案来解决这个问题。

更新:

正如答案指出了获取输出的确切语法

val colsList = df.columns.toList
 val column: List[Column] = colsList.map(dftrim(_))
 val jsonformatted=df.select($"resourceId",to_json(struct(column:_*)).alias("value"))

标签: apache-sparkapache-spark-sql

解决方案


struct需要一个序列。你只是在看一个错误的变体。利用

def struct(cols: Column*): Column 

import org.apache.spark.sql.functions._

val cols: Seq[String] = ???

struct(cols map col: _*)

推荐阅读