scala - 将 XML 列解析为多列并根据 Spark DataFrame 中的计数转置为行
问题描述
我有一个场景,其中 XML 列response_output
具有相应的订单详细信息ordercount
。orders
例如 xml 如下所示,计数OrderCount
为 4 ,下面 orders
我们有 4 个order
详细信息
<USR_ORD><OrderResponse><OrderResult>
<OrderCount>4</OrderCount>
<ORDTime>2021-02-02 21:13:12</ORDTime><ORDStatus>COMPLETE</ORDStatus>
<ORDValue>
<USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc>
<orders>
<order><name>MR RITA SOMA</name><address>606 JAL TXS</address><tracknumber>7825225</tracknumber><status>UNK</status></order>
<order><name>MR RITA SOMA</name><address>1 BAL, HAL</address><tracknumber>7825226</tracknumber><status>FAIL</status></order>
<order><name>MR RODREX SOMA</name><address>18, GHC,BAN</address><tracknumber>7825224</tracknumber><status>SUC</status></order>
<order><name>MR RITA SOMA</name><address>1 BAL, HAL</address><tracknumber>7825223</tracknumber><status>SUC</status></order>
</orders>
<USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD>
</ORDValue>
</OrderResult></OrderResponse></USR_ORD>
我需要根据 检索记录ordercount
,如果ordercount
是 4,那么我需要迭代 4 次orders
并获取 4 个包含所有order
详细信息的记录,如果ordercount
是 1,那么我需要order
分别获取 1 个包含详细信息的记录。
任何人都可以用 spark2,scala 解决方案帮助我吗?
源数据:
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output |
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |1 |<USR_ORD><OrderResponse><OrderResult><OrderCount>1</OrderCount><ORDTime>2021-02-02 10:34:19</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><orders><order><name>MRS MITA PERS</name><address>17 MAXI RD CHN</address><tracknumber>7825222</tracknumber><status>FAIL</status><amount>4500</amount><orderdate>2019-10-18</orderdate></order></orders><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD> |
|200 |1 |<USR_ORD><OrderResponse><OrderResult><OrderCount>4</OrderCount><ORDTime>2021-02-02 21:13:12</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><orders><order><name>MR RITA SOMA</name><address>606 JAL TXS</address><tracknumber>7825225</tracknumber><status>UNK</status><amount>1030</amount><orderdate>2020-11-16</orderdate></order><order><name>MR RITA SOMA</name><address>1 BAL, HAL</address><tracknumber>7825226</tracknumber><status>FAIL</status><amount>8000</amount><orderdate>2018-07-17</orderdate></order><order><name>MR RODREX SOMA</name><address>18, GHC, BAN</address><tracknumber>7825224</tracknumber><status>SUC</status><amount>2500</amount><orderdate>2017-09-16</orderdate></order><order><name>MR RITA SOMA</name><address>1 BAL, HAL</address><tracknumber>7825223</tracknumber><status>SUC</status><amount>2700</amount><orderdate>2017-04-22</orderdate></order></orders><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
当我尝试运行以下 sql 时,我得到如下结果,但我需要为 customer_id 200 获取 4 条记录,因为计数为 4,并带有相应的 oder 详细信息。
spark.sql("""select
| customer_id,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/OrderCount') as OrderCount,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/name') as name,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/address') as address,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/tracknumber') as tracknumber,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/status') as status
| from cust_tbl""").show()
我得到的结果:
+-----------+----------+-------------+--------------+-----------+------+
|customer_id|OrderCount| name| address|tracknumber|status|
+-----------+----------+-------------+--------------+-----------+------+
| 100| 1|MRS MITA PERS|17 MAXI RD CHN| 7825222| FAIL|
| 200| 4| MR RITA SOMA| 606 JAL TXS| 7825225| UNK|
+-----------+----------+-------------+--------------+-----------+------+
期望输出:
+-----------+----------+------------+-----------+-----------+------+
|customer_id|OrderCount|name |address |tracknumber|status|
+-----------+----------+------------+-----------+-----------+------+
|200 |4 |MRRITASOMA |606JALTXS |7825225 |UNK |
|200 |4 |MRRITASOMA |1BAL HAL |7825226 |FAIL |
|200 |4 |MRRODREXSOMA|18 GHC BAN |7825224 |SUC |
|200 |4 |MRRITASOMA |1 BAL HAL |7825223 |SUC |
|100 |1 |MRSMITAPERS |17MAXIRDCHN|7825222 |FAIL |
+-----------+----------+------------+-----------+-----------+------+
解决方案
该函数xpath_string
为给定的 XPath 表达式提取一个字符串值。对于您的情况,您需要使用xpath
获取每个订单详细信息(、、...)的节点值数组,name
并使用以下命令status
将它们全部压缩在一起arrays_zip
:
val df1 = df.withColumn(
"OrderCount",
expr("xpath_string(response_output, 'USR_ORD/OrderResponse/OrderResult/OrderCount')")
).withColumn(
"orders",
explode(
arrays_zip(
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/name/text()')"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/address/text()')"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/tracknumber/text()')"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/status/text()')")
).cast("array<struct<name:string,address:string,tracknumber:string,status:string>>")
)
).select("customer_id", "OrderCount", "orders.*")
df1.show(false)
//+-----------+----------+--------------+--------------+-----------+------+
//|customer_id|OrderCount|name |address |tracknumber|status|
//+-----------+----------+--------------+--------------+-----------+------+
//|100 |1 |MRS MITA PERS |17 MAXI RD CHN|7825222 |FAIL |
//|200 |4 |MR RITA SOMA |606 JAL TXS |7825225 |UNK |
//|200 |4 |MR RITA SOMA |1 BAL, HAL |7825226 |FAIL |
//|200 |4 |MR RODREX SOMA|18, GHC, BAN |7825224 |SUC |
//|200 |4 |MR RITA SOMA |1 BAL, HAL |7825223 |SUC |
//+-----------+----------+--------------+--------------+-----------+------+
更新
对于Spark < 2.4,您可以在posexplode
每个数组列上加入索引:
val df1 = df.withColumn(
"OrderCount",
expr("xpath_string(response_output, 'USR_ORD/OrderResponse/OrderResult/OrderCount')")
).select(
col("customer_id"),
col("OrderCount"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/name/text()')").as("name"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/address/text()')").as("address"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/tracknumber/text()')").as("tracknumber"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/status/text()')").as("status")
)
val result = df1.selectExpr("customer_id", "OrderCount", "posexplode(name) as (idx, name)")
.join(
df1.selectExpr("customer_id", "posexplode(address) as (idx, address)"),
Seq("idx", "customer_id")
).join(
df1.selectExpr("customer_id","posexplode(tracknumber) as (idx, tracknumber)"),
Seq("idx", "customer_id")
).join(
df1.selectExpr("customer_id", "posexplode(status) as (idx, status)"),
Seq("idx", "customer_id")
).drop("idx")
result.show(false)
//+-----------+----------+--------------+--------------+-----------+------+
//|customer_id|OrderCount|name |address |tracknumber|status|
//+-----------+----------+--------------+--------------+-----------+------+
//|100 |1 |MRS MITA PERS |17 MAXI RD CHN|7825222 |FAIL |
//|200 |4 |MR RITA SOMA |606 JAL TXS |7825225 |UNK |
//|200 |4 |MR RITA SOMA |1 BAL, HAL |7825226 |FAIL |
//|200 |4 |MR RODREX SOMA|18, GHC, BAN |7825224 |SUC |
//|200 |4 |MR RITA SOMA |1 BAL, HAL |7825223 |SUC |
//+-----------+----------+--------------+--------------+-----------+------+
推荐阅读
- php - if(!isset($_GET['id']) 具有相同变量的不同选项
- android - retrieve data from firebase returning null
- azure-iot-central - Azure IOT Central 设备的示例 C 代码
- recursion - 如何找到这个递归函数的时间复杂度?
- fadein - Scilab 上的淡入淡出功能
- delphi - CEF4Delphi 单元测试不
- html - 为什么我的导航栏将我的整体内容推到一边?
- python - 创建一个以变量为名称的文件目录
- python - SQLAlchemy 接受不同的数据类型、空字符串、更大长度的字符串等所有内容
- angular - 使用 ionic 2/3 面临无限滚动的问题