xml - 如何在 Hive 中使用横向视图分解 XML 数据格式?
问题描述
我正在尝试将 XML 格式的销售数据加载到 Hive 表中。下面是数据的一个小样本。
我知道如果我将下面的数据分成几个表,然后根据需要加入它们,我可以将其加载到 Hive。但只是想知道我是否可以将它们加载到一个表中,并且预期的输出应该看起来像所附的屏幕截图。
请帮助我了解我应该使用的表格结构以及如何有效地使用横向视图分解选项来实现这一点。
样本数据:
<Store>
<Version>1.1</Version>
<StoreId>16695</StoreId>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>29559</GTIN>
<ItmDsc>CHOCALATE</ItmDsc>
<ItmProm>
<PromCD>CM</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>59653</GTIN>
<ItmDsc>CORN FLAKES</ItmDsc>
</Itm>
<Itm>
<ItmSeq>3</ItmSeq>
<GTIN>42260</GTIN>
<ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
<ItmProm>
<PromCD>MTSRO</PromCD>
<OfferID>11766</OfferID>
</ItmProm>
</Itm>
</Bskt>
<Bskt>
<TillNo>5</TillNo>
<BsktNo>1947</BsktNo>
<DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
<OpID>50063</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>24064</GTIN>
<ItmDsc>TOMATOES 2KG</ItmDsc>
<ItmProm>
<PromCD>INSTORE</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>81287</GTIN>
<ItmDsc>ROTHMANS BLUE</ItmDsc>
<ItmProm>
<PromCD>TF</PromCD>
</ItmProm>
</Itm>
</Bskt>
</Store>
期望的输出
表结构:
CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
TBLPROPERTIES (
"xmlinput.start"="<Store","xmlinput.end"="</Store>"
);
输出: 在此处输入图像描述
尝试下面的查询来读取数据,它没有以我想要的方式显示结果。
select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
LATERAL VIEW explode(DateTime) table1 as basket_dtm
LATERAL VIEW explode(BsktNo) table2 as basket_number
LATERAL VIEW explode(TillNo) table3 as till_number;
结果:
解决方案
感谢您的详细解决方案。我测试了它,它工作得非常好。我尝试了一种类似的方法来直接使用 XML serde 从 XML 中读取数据。
我的挑战:
1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.
下面是 Hive 表结构和我用来读取数据的查询。我能够成功地爆炸第一级阵列(Bskt)而没有任何问题。
但是当我尝试分解第二级数组(Itm)时,它会为“Itm”中的所有字段返回 NULL 结果。
我的查询或表结构本身有什么问题吗?
create external table nanda_scan_xml (
Version string,
StoreId string,
Bskt array<struct<
Bskt:struct<
DateTime:string,
TillNo:string,
BsktNo:string,
Itm:array<struct<
Itm:struct<
ItmSeq:string,
GTIN:string,
ItmDsc:string,
DeptCD:string,
ItmCD:string,
SalesQTY:string,
SalesExGST:string,
Points:string,
CostExGST:string,
GSTRate:string,
DiscAmtExGST:string,
ItmProm:struct<
PromCD:string,
OfferID:string
>
>
>
>
>
>
>
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties
(
"column.xpath.Version" = "/Store/Version/text()",
"column.xpath.StoreId" = "/Store/StoreId/text()",
"column.xpath.Bskt" = "/Store/Bskt"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
tblproperties
(
"xmlinput.start" = "<Store>",
"xmlinput.end" = "</Store>"
);
询问:
1)对于工作正常的 Bskt:
SELECT Version,
StoreId,
basket.Bskt.DateTime,
basket.Bskt.bsktno,
basket.Bskt.tillno
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket;
结果:
在此处输入图像描述 2)尝试在单个查询中分解两个横向视图时:
SELECT Version,
StoreId,
basket.Bskt.DateTime,
basket.Bskt.bsktno,
basket.Bskt.tillno,
item.Itm.ItmSeq,
item.Itm.ItmDsc,
item.Itm.GTIN,
item.Itm.itmprom.OfferID,
item.Itm.itmprom.PromCD
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;
结果:
3) 查询:
SELECT Version,
StoreId,
basket.Bskt.DateTime,
basket.Bskt.bsktno,
basket.Bskt.tillno,
item.Itm.ItmSeq,
item.Itm.ItmDsc,
item.Itm.GTIN,
item.Itm.itmprom.OfferID,
item.Itm.itmprom.PromCD
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;
错误:
推荐阅读
- web-scraping - Is there any ways to extract the number of views a webpage recieved?
- python - Python extracting specific text from tweets
- python - AttributeError:“_pygit2.Reference”对象没有属性“get_object”
- permissions - 非 root 用户不存在 Pip 安装的命令
- python - 是否可以从 HTTP 请求标头中获取客户端的 IP 地址?
- django - 如何在管理员中使用 Django import export 在单击导入之前删除列
- excel - VBA Loop one sheet to the next
- typescript - 扩展第三方库的接口,然后重用该定义
- android - 我无法从 Firestore 读取数据
- python - Django ManifestStaticFilesStorage + DEBUG = False 在上传的文件中给出错误:他们不显示