首页 > 解决方案 > 使用 spark 删除 XML 中没有特定值的节点

问题描述

如何删除 XML 输出中没有的节点Zcode=XYZ?SQL 输出为:

[[00000016,, 04,, XYZ], [00000016,, 04,,]]
      

使用 JAXB 逻辑生成 XML 的代码:

var abclist: Seq[Row] = x.getAs[Row]("TVchannels").asInstanceOf[Seq[Row]]
          
if (abclist != null) {

    abclist.foreach(gap => {

        var abcobj: abc = new abc()
        abcobj.setChanneltune(checknull(gap.getAs[String]("_channeltune")))
        abcobj.setZCode(checknull(gap.getAs[String]("Zcode")))
        abcobj.setFCode(checknull(gap.getAs[String]("Fcode")))

在 XML 输出中,应删除第二个节点;它没有,Zcode并且该节点不应显示在输出中。

<abc Zcode="XYZ" FCode="04" SMTPESPcode="00000016"/>
<abc FCode="04" SMTPESPcode="00000016"/>

标签: scalaapache-sparkapache-spark-sql

解决方案


IIUC,以下方法将解决您的用例,输入 xml 文件内容,

<root>
    <tvchannels>
        <tvchannel>
              <SMTPESPcode>00000016<\SMTPESPcode>
              <FCode>04<\FCode>
              <Zcode>XYZ<\Zcode>
         <\tvchannel>
         <tvchannel>
              <SMTPESPcode>00000016<\SMTPESPcode>
              <FCode>04<\FCode>
              <Zcode><\Zcode>
         <\tvchannel>    
    <\tvchannels>
<\root>

将您的输入读取为 xml 文件的代码遵循https://github.com/databricks/spark-xml

val df =spark.read.format("xml").option("rowTag", "tvchannel").load("file:///home/ubuntu/input/tvchannles.xml");
df.show()

/*
+-----------+-----+-----+
|SMTPESPcode|FCode|Zcode|
+-----------+-----+-----+
|   00000016|   04|  XYZ|
|   00000016|   04|     |
+-----------+-----+-----+
*/

df.filter("Zcode != ''").show()

/*
+-----------+-----+-----+
|SMTPESPcode|FCode|Zcode|
+-----------+-----+-----+
|   00000016|   04|  XYZ|
+-----------+-----+-----+
*/

// your remaining spark logic.

或将其读取为 Seq:

val df = Seq((00000016,04,"XYZ"),(00000016,04,"")).toDF("SMTPESPcode","FCode","Zcode")
df.show()
/*
+-----------+-----+-----+
|SMTPESPcode|FCode|Zcode|
+-----------+-----+-----+
|   00000016|   04|  XYZ|
|   00000016|   04|     |
+-----------+-----+-----+
*/

df.filter("Zcode != ''").show()

/*
+-----------+-----+-----+
|SMTPESPcode|FCode|Zcode|
+-----------+-----+-----+
|   00000016|   04|  XYZ|
+-----------+-----+-----+
*/

// your remaining spark logic.

推荐阅读