首页 > 解决方案 > 火花-XML;在读取时使用显式模式从 S3 读取。XML中的数组类型问题

问题描述

我正在尝试通过 Scala Spark API ( https://github.com/databricks/spark-xml ) 访问 spark-xml 库,以便从 S3 读取大量 XML 文件。

感谢这里的任何反馈!

以下代码示例仅提取_ProgramInfoID字段

val schema = StructType(
                // BROACAST METADATA
                Array(StructField("BroadcastMetadata",StructType(
                    // PROGRAM INFO
                    Array(StructField("ProgramInfo", StructType(
                        Array(StructField("_ProgramInfoID", StringType, nullable = true))
                        )))
                    ))),                  
                )

以下尝试同时读取ProgramInfoID 和 _VALUE,但在尝试定义架构对象时遇到错误


val schema = StructType(
                // BROACAST METADATA
                Array(StructField("BroadcastMetadata",StructType(
                    // PROGRAM INFO
                    Array(StructField("ProgramInfo", StructType(
                        Array(StructField("_ProgramInfoID", StringType, nullable = true))
                        )))

                    ))),
                
                // LINES
                Array(StructField("Lines", StructType(
                    // Line 
                      ArrayType(StructField("Line", StructType(
                          Array(StructField("element", StructType(
                            Array(StructField("_VALUE", StringType, nullable = true))
                                ))))))
                    )))
                                 
                                    
                
                )

错误:

<console>:45: error: type mismatch;
 found   : org.apache.spark.sql.types.StructField
 required: org.apache.spark.sql.types.DataType
                             ArrayType(StructField("Line", StructType(

我意识到这是一个语法错误,但我无法找到关于如何将下面看到的模式转换为涉及 Spark 类型(如 ArrayType、StructField 和 StructType)的模式的良好文档。

涉及 XML 中数组类型对象的相关问题: spark 中用于 xml 处理的复杂自定义模式

但是,我无法使用那里的解决方案解决这个问题。

XML 示例数据模式

root
 |-- BroadcastMetadata: struct (nullable = true)
 |    |-- ExtendedProgramInfo: struct (nullable = true)
 |    |    |-- Schedule: struct (nullable = true)
 |    |    |    |-- AiringType: string (nullable = true)
 |    |    |    |-- PartNumber: long (nullable = true)
 |    |    |    |-- Program: struct (nullable = true)
 |    |    |    |    |-- AdditionalProgramURL: string (nullable = true)
 |    |    |    |    |-- AliasTitle: string (nullable = true)
 |    |    |    |    |-- Delta: string (nullable = true)
 |    |    |    |    |-- Descriptions: struct (nullable = true)
 |    |    |    |    |    |-- ProgramDescription: struct (nullable = true)
 |    |    |    |    |    |    |-- Delta: string (nullable = true)
 |    |    |    |    |    |    |-- _ProgramID: long (nullable = true)
 |    |    |    |    |    |    |-- _RoviRemotePath: string (nullable = true)
 |    |    |    |    |-- EpisodeNumber: string (nullable = true)
 |    |    |    |    |-- EpisodeTitle: string (nullable = true)
 |    |    |    |    |-- EventDate: string (nullable = true)
 |    |    |    |    |-- Genres: struct (nullable = true)
 |    |    |    |    |    |-- ProgramGenre: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- Delta: string (nullable = true)
 |    |    |    |    |    |    |    |-- Genre: string (nullable = true)
 |    |    |    |    |    |    |    |-- _RoviRemotePath: string (nullable = true)
 |    |    |    |    |-- Grid2Title: string (nullable = true)
 |    |    |    |    |-- GridTitle: string (nullable = true)
 |    |    |    |    |-- ProgramOriginalCountry: struct (nullable = true)
 |    |    |    |    |    |-- Delta: string (nullable = true)
 |    |    |    |    |    |-- _RoviRemotePath: string (nullable = true)
 |    |    |    |    |-- ProgramOriginalLanguage: struct (nullable = true)
 |    |    |    |    |    |-- Delta: string (nullable = true)
 |    |    |    |    |    |-- _RoviRemotePath: string (nullable = true)
 |    |    |    |    |-- RecordDateTime: string (nullable = true)
 |    |    |    |    |-- Syndicated: string (nullable = true)
 |    |    |    |    |-- TVRatings: struct (nullable = true)
 |    |    |    |    |    |-- ProgramTVRating: struct (nullable = true)
 |    |    |    |    |    |    |-- Delta: string (nullable = true)
 |    |    |    |    |    |    |-- _RoviRemotePath: string (nullable = true)
 |    |    |    |    |-- ThreeDLevel: string (nullable = true)
 |    |    |    |    |-- TitleParentID: long (nullable = true)
 |    |    |    |    |-- _RoviRemotePath: string (nullable = true)
 |    |    |    |-- ProgramID: long (nullable = true)
 |    |    |    |-- ProgramShowingType: string (nullable = true)
 |    |    |    |-- RecordDateTime: string (nullable = true)
 |    |    |    |-- _ScheduleID: long (nullable = true)
 |    |-- Market: struct (nullable = true)
 |    |    |-- Country: string (nullable = true)
 |    |    |-- _MarketName: string (nullable = true)
 |    |-- ProgramInfo: struct (nullable = true)
 |    |    |-- CC: string (nullable = true)
 |    |    |-- Category: string (nullable = true)
 |    |    |-- _ProgramInfoID: long (nullable = true)
 |    |-- Station: struct (nullable = true)
 |    |    |-- Active: long (nullable = true)
 |    |    |-- _UniqueIdentifier: string (nullable = true)
 |    |-- TranscriptUrl: string (nullable = true)
 |    |-- ViewershipData: string (nullable = true)
 |-- Lines: struct (nullable = true)
 |    |-- Line: array (nullable = true) --> SEE ARRAY TYPE HERE
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _LineDateTime: timestamp (nullable = true)
 |    |    |    |-- _StationGUID: string (nullable = true)
 |    |    |    |-- _StationID: long (nullable = true)
 |    |    |    |-- _UTCDelta: long (nullable = true)
 |    |    |    |-- _UTCLineDateTime: string (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |-- _BreakType: string (nullable = true)
 |-- _Duration: double (nullable = true)
 |-- _PageID: string (nullable = true)
 |-- _StationGUID: string (nullable = true)
 |-- _StationID: long (nullable = true)

我很感激这里的任何帮助,谢谢!

标签: xmlscalaapache-sparkapache-spark-xml

解决方案


推荐阅读