首页 > 解决方案 > 如何使用pyspark从xml的每个嵌套节点创建一个表

问题描述

我有一个嵌套的 XML 结构如下 -

<parent>
<root1 detail = "something">
    <ID type="typeA">id1</ID>
    <ID type="typeB">id2</ID>
    <ID type="typeC">id3</ID>
</root1>

<root2 detail = "something">
    <ID type="typeA">id1</ID>
    <ID type="typeB">id2</ID>
    <ID type="typeC">id3</ID>
</root2>
<parent>

我想用列和数据创建2个表,如下所示-

架构:

detail string
ID string
type string

记录:

detail        ID     type
something     id1   typeA
something     id2   typeB
something     id3   typeC

我试过使用

   spark.read.format(file_type) \
      .option("rootTag", "root1") \
      .option("rowTag", "ID") \
      .load(file_location)

但这只会产生描述(字符串)和 ID(数组)作为列。

提前致谢!

标签: xmlscalapysparkdatabricks

解决方案


看起来诀窍是在 StructField 中通过它们的名称(and )提取IDand ,该字段位于名为“ID”的列中,该列是通过读取 xml 文件得出的:type_VALUE_TYPE

from pyspark.sql.functions import explode, col

dfs = []

n = 2

for i in range(1,n+1):

    df = spark.read.format('xml') \
              .option("rowTag","root{}".format(i))\
              .load('file.xml')

    df = df.select([explode('ID'),'_detail'])\
           .withColumn('ID',col('col').getItem('_VALUE'))\
           .withColumn('type',col('col').getItem('_TYPE'))\
           .drop('col')\
           .withColumnRenamed('_detail','detail')
   
    dfs.append(df)
    
    df.show()

# +---------+---+-----+
# |   detail| ID| type|
# +---------+---+-----+
# |something|id1|typeA|
# |something|id2|typeB|
# |something|id3|typeC|
# +---------+---+-----+
# 
# +---------+---+-----+
# |   detail| ID| type|
# +---------+---+-----+
# |something|id1|typeA|
# |something|id2|typeB|
# |something|id3|typeC|
# +---------+---+-----+

如果您不想手动指定表的数量(由n上面代码中的变量控制),那么您可以先运行此代码:

from xml.etree import ElementTree

tree = ElementTree.parse("file.xml")
root = tree.getroot()

children = root.getchildren()

n = 0

for child in children:
    ElementTree.dump(child)
    n+=1

print("n = {}".format(n))

# <root1 detail="something">
#     <ID type="typeA">id1</ID>
#     <ID type="typeB">id2</ID>
#     <ID type="typeC">id3</ID>
# </root1>
# 
# <root2 detail="something">
#     <ID type="typeA">id1</ID>
#     <ID type="typeB">id2</ID>
#     <ID type="typeC">id3</ID>
# </root2>
# n = 2

推荐阅读