首页 > 解决方案 > 如何在 Hive 中使用横向视图分解 XML 数据格式?

问题描述

我正在尝试将 XML 格式的销售数据加载到 Hive 表中。下面是数据的一个小样本。

我知道如果我将下面的数据分成几个表,然后根据需要加入它们,我可以将其加载到 Hive。但只是想知道我是否可以将它们加载到一个表中,并且预期的输出应该看起来像所附的屏幕截图。

请帮助我了解我应该使用的表格结构以及如何有效地使用横向视图分解选项来实现这一点。

样本数据:

  <Store>
    <Version>1.1</Version>
    <StoreId>16695</StoreId>    
    <Bskt>
      <TillNo>4</TillNo>
      <BsktNo>1753</BsktNo>
      <DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
      <OpID>50056</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>29559</GTIN>
        <ItmDsc>CHOCALATE</ItmDsc>
      <ItmProm>
          <PromCD>CM</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>59653</GTIN>
        <ItmDsc>CORN FLAKES</ItmDsc>
      </Itm>
        <Itm>
        <ItmSeq>3</ItmSeq>
        <GTIN>42260</GTIN>
        <ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
        <ItmProm>
          <PromCD>MTSRO</PromCD>
          <OfferID>11766</OfferID>
        </ItmProm>
      </Itm>
    </Bskt>
    <Bskt>
      <TillNo>5</TillNo>
      <BsktNo>1947</BsktNo>
      <DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
      <OpID>50063</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>24064</GTIN>
        <ItmDsc>TOMATOES 2KG</ItmDsc>
        <ItmProm>
          <PromCD>INSTORE</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>81287</GTIN>
        <ItmDsc>ROTHMANS BLUE</ItmDsc>
        <ItmProm>
          <PromCD>TF</PromCD>
        </ItmProm>
      </Itm>
    </Bskt>
  </Store>  

期望的输出

在此处输入图像描述

表结构:

CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
    TBLPROPERTIES (
    "xmlinput.start"="<Store","xmlinput.end"="</Store>"
);

输出: 在此处输入图像描述

尝试下面的查询来读取数据,它没有以我想要的方式显示结果。

select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
    LATERAL VIEW explode(DateTime) table1 as basket_dtm 
    LATERAL VIEW explode(BsktNo) table2 as basket_number
    LATERAL VIEW explode(TillNo) table3 as till_number;

结果:

在此处输入图像描述

标签: xmlparsinghadoophiveexplode

解决方案


感谢您的详细解决方案。我测试了它,它工作得非常好。我尝试了一种类似的方法来直接使用 XML serde 从 XML 中读取数据。

我的挑战:

1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.

下面是 Hive 表结构和我用来读取数据的查询。我能够成功地爆炸第一级阵列(Bskt)而没有任何问题。

但是当我尝试分解第二级数组(Itm)时,它会为“Itm”中的所有字段返回 NULL 结果。

我的查询或表结构本身有什么问题吗?

create external table nanda_scan_xml  (
  Version string,
  StoreId string,
  Bskt array<struct<
                    Bskt:struct<
                                DateTime:string,
                                TillNo:string,
                                BsktNo:string,
                                Itm:array<struct<
                                                Itm:struct<
                                                    ItmSeq:string,      
                                                    GTIN:string,        
                                                    ItmDsc:string,      
                                                    DeptCD:string,      
                                                    ItmCD:string,       
                                                    SalesQTY:string,        
                                                    SalesExGST:string,      
                                                    Points:string,      
                                                    CostExGST:string,       
                                                    GSTRate:string,     
                                                    DiscAmtExGST:string,        
                                                    ItmProm:struct<     
                                                                    PromCD:string,      
                                                                    OfferID:string      
                                                                  >
                                                              >
                                                     >
                                            >
                                >
                    >
            >
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties 
(
    "column.xpath.Version"       = "/Store/Version/text()",
    "column.xpath.StoreId"       = "/Store/StoreId/text()",
    "column.xpath.Bskt"  = "/Store/Bskt"

)
stored as 
inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
tblproperties 
(
    "xmlinput.start"    = "<Store>",
    "xmlinput.end"      = "</Store>"
);

询问:

1)对于工作正常的 Bskt:

SELECT  Version,
        StoreId,
        basket.Bskt.DateTime,
        basket.Bskt.bsktno,
        basket.Bskt.tillno
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket;

结果:

在此处输入图像描述 2)尝试在单个查询中分解两个横向视图时:

SELECT  Version,
        StoreId,
        basket.Bskt.DateTime,
        basket.Bskt.bsktno,
        basket.Bskt.tillno,
        item.Itm.ItmSeq,
        item.Itm.ItmDsc,
        item.Itm.GTIN,
        item.Itm.itmprom.OfferID,
        item.Itm.itmprom.PromCD 
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;

结果:

在此处输入图像描述

3) 查询:

SELECT  Version,
        StoreId,
        basket.Bskt.DateTime,
        basket.Bskt.bsktno,
        basket.Bskt.tillno,
        item.Itm.ItmSeq,
        item.Itm.ItmDsc,
        item.Itm.GTIN,
        item.Itm.itmprom.OfferID,
        item.Itm.itmprom.PromCD 
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;

错误:

在此处输入图像描述


推荐阅读