首页 > 解决方案 > How to explode a struct column with a prefix?

问题描述

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.

Here is what I have so far:

  implicit class Implicit(df: DataFrame) {
    def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
  }

This does the job alright - I use this writing:

  df.explodeStruct("myColumn")

It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.

As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.

My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.

Any elegant solution to this problem?

标签: scalaapache-sparkstruct

解决方案


Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:

  implicit class Implicit(df: DataFrame) {
    def explodeStruct(column: String) = {
      val prefix = column + "_"
      val originalPosition = df.columns.indexOf(column)

      val dfWithAllColumns = df.select("*", column + ".*")

      val explodedColumns = dfWithAllColumns.columns diff df.columns
      val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)

      val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)

      df.select(finalColumnsList: _*)
    }
  }

Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.


推荐阅读