xml - 使用 Spark 加载 XML 时推断架构中的重复字段

问题描述

我想在这个结构中创建一个模式：

|    |-- Features: struct (nullable = true)
|    |    |-- Feature: array (nullable = true)
|    |    |    |-- element: string (containsNull = true)

这是我的代码：

StructField( "Features", StructType(
        Array(
          StructField( "Feature", ArrayType(
            StructType(
              Array(
                StructField( "element", StringType, true )
              )
            )
          ) )
        )
      ), true )

结果：

|    |-- Features: struct (nullable = true)
|    |    |-- Feature: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- element: string (nullable = true)

有什么想法吗？

标签： xmlscalaapache-sparkdataframe

你应该省略最里面的struct：

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val schema = StructType(Seq(StructField("Features", StructType(Seq(
  StructField("Feature", ArrayType(StringType))
)))))

spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema).printSchema
// root
//  |-- Features: struct (nullable = true)
//  |    |-- Feature: array (nullable = true)
//  |    |    |-- element: string (containsNull = true)

xml - 使用 Spark 加载 XML 时推断架构中的重复字段

问题描述

解决方案

推荐阅读