首页 > 解决方案 > 使用 Spark 将 XML 读入数据框

问题描述

我正在尝试使用 Spark 将 XML 文件读入数据帧。

我按照 GitHub 上的这个指南工作。

由于某种原因,具有属性的列idnull.

我正在这个xml 文件上测试我的代码。

%pyspark
from pyspark.sql import SQLContext
from pyspark.sql.types import *

AWS_ACCESS_KEY_ID = "*********************"
AWS_SECRET_ACCESS_KEY = "*************************"

sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY_ID)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_ACCESS_KEY)

sqlContext = SQLContext(sc)
customSchema = StructType([ \
    StructField("_id", StringType(), True), \
    StructField("author", StringType(), True), \
    # StructField("description", StringType(), True), \
    StructField("genre", StringType(), True), \
    StructField("price", DoubleType(), True), \
    StructField("publish_date", StringType(), True), \
    StructField("title", StringType(), True)])


df = sqlContext.read \
    .format('com.databricks.spark.xml') \
    .options(rowTag='book') \
    .load('s3n://######/###/######/books.xml',schema = customSchema)

df.show()

+----+--------------------+---------------+-----+------------+--------------------+
| _id|              author|          genre|price|publish_date|               title|
+----+--------------------+---------------+-----+------------+--------------------+
|null|Gambardella, Matthew|       Computer|44.95|  2000-10-01|XML Developer's G...|
|null|          Ralls, Kim|        Fantasy| 5.95|  2000-12-16|       Midnight Rain|
|null|         Corets, Eva|        Fantasy| 5.95|  2000-11-17|     Maeve Ascendant|
|null|         Corets, Eva|        Fantasy| 5.95|  2001-03-10|     Oberon's Legacy|
|null|         Corets, Eva|        Fantasy| 5.95|  2001-09-10|  The Sundered Grail|
|null|    Randall, Cynthia|        Romance| 4.95|  2000-09-02|         Lover Birds|
|null|      Thurman, Paula|        Romance| 4.95|  2000-11-02|       Splish Splash|
|null|       Knorr, Stefan|         Horror| 4.95|  2000-12-06|     Creepy Crawlies|
|null|        Kress, Peter|Science Fiction| 6.95|  2000-11-02|        Paradox Lost|
|null|        O'Brien, Tim|       Computer|36.95|  2000-12-09|Microsoft .NET: T...|
|null|        O'Brien, Tim|       Computer|36.95|  2000-12-01|MSXML3: A Compreh...|
|null|         Galos, Mike|       Computer|49.95|  2001-04-16|Visual Studio 7: ...|
+----+--------------------+---------------+-----+------------+--------------------+

这是 XML 文件的一部分:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>
         An in-depth look at creating applications
         with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage,
         and query XML data in the database.
       </description>
   </book>

</catalog>

标签: xmlapache-sparkdataframepyspark

解决方案


推荐阅读