使用 Spark 将 XML 读入数据框


我正在尝试使用 Spark 将 XML 文件读入数据帧。

我按照 GitHub 上的这个指南工作。


我正在这个xml 文件上测试我的代码。

from pyspark.sql import SQLContext
from pyspark.sql.types import *

AWS_ACCESS_KEY_ID = "*********************"
AWS_SECRET_ACCESS_KEY = "*************************"

sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY_ID)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_ACCESS_KEY)

sqlContext = SQLContext(sc)
customSchema = StructType([ \
    StructField("_id", StringType(), True), \
    StructField("author", StringType(), True), \
    # StructField("description", StringType(), True), \
    StructField("genre", StringType(), True), \
    StructField("price", DoubleType(), True), \
    StructField("publish_date", StringType(), True), \
    StructField("title", StringType(), True)])

df = sqlContext.read \
    .format('com.databricks.spark.xml') \
    .options(rowTag='book') \
    .load('s3n://######/###/######/books.xml',schema = customSchema)


| _id|              author|          genre|price|publish_date|               title|
|null|Gambardella, Matthew|       Computer|44.95|  2000-10-01|XML Developer's G...|
|null|          Ralls, Kim|        Fantasy| 5.95|  2000-12-16|       Midnight Rain|
|null|         Corets, Eva|        Fantasy| 5.95|  2000-11-17|     Maeve Ascendant|
|null|         Corets, Eva|        Fantasy| 5.95|  2001-03-10|     Oberon's Legacy|
|null|         Corets, Eva|        Fantasy| 5.95|  2001-09-10|  The Sundered Grail|
|null|    Randall, Cynthia|        Romance| 4.95|  2000-09-02|         Lover Birds|
|null|      Thurman, Paula|        Romance| 4.95|  2000-11-02|       Splish Splash|
|null|       Knorr, Stefan|         Horror| 4.95|  2000-12-06|     Creepy Crawlies|
|null|        Kress, Peter|Science Fiction| 6.95|  2000-11-02|        Paradox Lost|
|null|        O'Brien, Tim|       Computer|36.95|  2000-12-09|Microsoft .NET: T...|
|null|        O'Brien, Tim|       Computer|36.95|  2000-12-01|MSXML3: A Compreh...|
|null|         Galos, Mike|       Computer|49.95|  2001-04-16|Visual Studio 7: ...|

这是 XML 文件的一部分:

<?xml version="1.0"?>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
         An in-depth look at creating applications
         with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage,
         and query XML data in the database.


标签: xmlapache-sparkdataframepyspark

