首页 > 解决方案 > Pyspark 的 XML 文件中的架构问题

问题描述

我是为 xml 创建架构的新手。我以前使用 xsd 来解析 xml 数据。

我正在尝试使用火花读取格式方法。但是我没有在架构中看到卖家 ID。有没有办法我可以将卖家 ID 和贸易 ID 都放入我的数据中。

df_trade_loan = spark.read.format("com.databricks.spark.xml").option("rowTag","trade").option("rootTag","loan").load("dbfs:/FileStore/shared_uploads/trades/*")

我的 xml 文件如下所示。

<loan>
    <seller>
        <id>11</id>
    </seller>
    <trade id="67" type="Standard">
        <advance>
            <date>2011-03-09</date>
            <amount>16466.76</amount>
            <amount_gbp>16466.76</amount_gbp>
            <percentage>90.0</percentage>
        </advance>
        <discount>
            <percentage>1.0</percentage>
            <on>Facevalue</on>
        </discount>
        <expected_payment_date>2011-03-18 00:00:00 +0000</expected_payment_date>
        <settlement_date>2011-03-25</settlement_date>
        <arrears>
            <in_arrears>No</in_arrears>
            <in_arrears_on_date>nan</in_arrears_on_date>
        </arrears>
        <payment>
            <state>Paid</state>
        </payment>
        <price_grade>6</price_grade>
        <currency>GBP</currency>
        <face_value>
            <amount>18296.4</amount>
            <amount_gbp>18296.4</amount_gbp>
        </face_value>
        <outstanding_principal>
            <amount>0.0</amount>
            <amount_gbp>0.0</amount_gbp>
        </outstanding_principal>
        <crystalised_loss>
            <amount>nan</amount>
            <date>nan</date>
        </crystalised_loss>
        <gross_yield>
            <annualised>14.164038846995776</annualised>
        </gross_yield>
    </trade>
</loan>

当前架构如下所示

root
 |-- _id: long (nullable = true)
 |-- _type: string (nullable = true)
 |-- advance: struct (nullable = true)
 |    |-- amount: double (nullable = true)
 |    |-- amount_gbp: double (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- percentage: double (nullable = true)
 |-- arrears: struct (nullable = true)
 |    |-- in_arrears: string (nullable = true)
 |    |-- in_arrears_on_date: string (nullable = true)
 |-- crystalised_loss: struct (nullable = true)
 |    |-- amount: string (nullable = true)
 |    |-- date: string (nullable = true)
 |-- currency: string (nullable = true)
 |-- discount: struct (nullable = true)
 |    |-- on: string (nullable = true)
 |    |-- percentage: double (nullable = true)
 |-- expected_payment_date: string (nullable = true)
 |-- face_value: struct (nullable = true)
 |    |-- amount: double (nullable = true)
 |    |-- amount_gbp: double (nullable = true)
 |-- gross_yield: struct (nullable = true)
 |    |-- annualised: double (nullable = true)
 |-- outstanding_principal: struct (nullable = true)
 |    |-- amount: double (nullable = true)
 |    |-- amount_gbp: double (nullable = true)
 |-- payment: struct (nullable = true)
 |    |-- state: string (nullable = true)
 |-- price_grade: long (nullable = true)
 |-- settlement_date: string (nullable = true)

标签: xmlapache-sparkparsingpysparkschema

解决方案


df_tade_seller = spark.read.format("com.databricks.spark.xml").option("rowTag","loan").option("rootTag","seller").load("adl://haaldatalake.azuredatalakestore.net/use_cases/recommendation/tempsubas/tempsubas/trades/*")

该代码有效。


推荐阅读