首页 > 解决方案 > 如何在 Spark 中取消透视数据框?

问题描述

我有一个具有以下架构的数据框:

subjectID, feature001, feature002, feature003, ..., feature299

假设我的数据框如下所示:

123,0.23,0.54,0.35,...,0.26   
234,0.17,0.49,0.47,...,0.69

现在,我想要的是:

subjectID, featureID, featureValue

上面的数据框看起来像:

123,001,0.23
123,002,0.54
123,003,0.35
......
123,299,0.26
234,001,0.17
234,002,0.49
234,003,0.47
......
234,299,0.69

如果我只有几列,我知道如何实现它:

newDF = df.select($"subjectID", expr("stack(3, 'feature001', 001, 'feature002', 002, 'feature003', 003) as (featureID, featureValue)"))

但是,我正在寻找一种处理 300 列的方法。

标签: scalaapache-spark

解决方案


您可以使用列构建一个数组,struct然后使用explode将它们转换为行:

import org.apache.spark.sql.functions.{explode, struct, lit, array, col}

// build an array of struct expressions from the feature columns
val columnExprs = df.columns
   .filter(_.startsWith("feature"))
   .map(name => struct(lit(name.replace("feature","")) as "id", col(name) as "value"))

// unpivot the DataFrame
val newDF = df.select($"subjectID", explode(array(columnExprs:_*)) as "feature")
              .select($"subjectID", 
                      $"feature.id" as "featureID", 
                      $"feature.value" as "featureValue") 

推荐阅读