apache-spark - 在运行时选择所有列 spark sql,没有预定义的模式
问题描述
我有一个数据框,其值的格式为
|resourceId|resourceType|seasonId|seriesId|
+----------+------------+--------+--------+
|1234 |cM-type |883838 |8838832 |
|1235 |cM-type |883838 |8838832 |
|1236 |cM-type |883838 |8838832 |
|1237 |CNN-type |883838 |8838832 |
|1238 |cM-type |883838 |8838832 |
+----------+------------+--------+--------+
我想将数据框转换成这种格式
+----------+----------------------------------------------------------------------------------------+
|resourceId|value |
+----------+----------------------------------------------------------------------------------------+
|1234 |{"resourceId":"1234","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1235 |{"resourceId":"1235","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1236 |{"resourceId":"1236","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1237 |{"resourceId":"1237","resourceType":"CNN-type","seasonId":"883838","seriesId":"8838832"}|
|1238 |{"resourceId":"1238","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
+----------+----------------------------------------------------------------------------------------+
我知道我可以通过像这样手动提供字段来获得所需的输出
val jsonformated=df.select($"resourceId",to_json(struct($"resourceId", $"resourceType", $"seasonId",$"seriesId")).alias("value"))
但是,我正在尝试将列值传递给 struct programmatic,使用
val cols = df.columns.toSeq
val jsonformatted=df.select($"resourceId",to_json(struct("colval",cols)).alias("value"))
结构函数不接受序列的某种原因,从 api 看起来有一个方法签名来接受序列,
struct(String colName, scala.collection.Seq<String> colNames)
有没有更好的解决方案来解决这个问题。
更新:
正如答案指出了获取输出的确切语法
val colsList = df.columns.toList
val column: List[Column] = colsList.map(dftrim(_))
val jsonformatted=df.select($"resourceId",to_json(struct(column:_*)).alias("value"))
解决方案
struct
需要一个序列。你只是在看一个错误的变体。利用
def struct(cols: Column*): Column
如
import org.apache.spark.sql.functions._
val cols: Seq[String] = ???
struct(cols map col: _*)
推荐阅读
- magento2 - After installing Admin Dashboard stuck at loader
- ssh - ssh weak cipher removal... just benefits or drawbacks as well?
- c# - C# class string used on a another Form
- python - Django-taggit - 如何过滤所有标记的对象,为每个标记重复它们?
- python - Using sklearn with complex values
- python - Matplotlib 在 for 循环中显示空图
- python - Why can't I install these specific requirements?
- javascript - Javascript map() forEach() 方法不能连续工作
- javascript - 关闭一个弹出窗口,打开另一个
- python - “没有名为 google.auth.transport.grpc 的模块”,即使 google-auth 已使用 pip 安装