首页 > 解决方案 > Python:使用嵌套的子标签将 xml 展平为 csv

问题描述

我想展平多个 XML 文件,我正在寻找一个通用函数或逻辑来将 xml 转换为平面文件。大多数答案都包含硬编码标签。最接近的是Python:将 xml 展平为 csv,父标签在子标签中重复,但仍然有硬编码的解决方案。对于以下输入 xml

<root> 
    <child> child-val </child>
    <child2> child2-val2 </child2>
    <anotherchild>
        <childid> another child 45</childid>
        <childname> another child name </childname>
    </anotherchild>
    <group> 
        <groupid> groupid-123</groupid>
        <grouplist>
            <groupzone>
                <groupname>first </groupname>
                <groupsize> 4</groupsize>
            </groupzone>
            <groupzone>
                <groupname>second </groupname>
                <groupsize> 6</groupsize>
            </groupzone>
            <groupzone>
                <groupname> third </groupname>
                <groupsize> 8 </groupsize>
            </groupzone>
        </grouplist>
    </group>
    <secondgroup> 
        <secondgroupid> secondgroupid-42 </secondgroupid>
        <secondgrouptitle> second group title </secondgrouptitle>
        <secondgrouplist>
            <secondgroupzone>
                <secondgroupsub>
                    <secondsub>v1</secondsub>
                    <secondsubid>12</secondsubid>
                </secondgroupsub>
                <secondgroupname> third </secondgroupname>
                <secondgroupsize> 4</secondgroupsize>
            </secondgroupzone>
            <secondgroupzone>
                <secondgroupsub>
                    <secondsub>v2</secondsub>
                    <secondsubid>1</secondsubid>
                </secondgroupsub>
                <secondgroupname>fourth </secondgroupname>
                <secondgroupsize> 6</secondgroupsize>
            </secondgroupzone>
            <secondgroupzone>
                <secondgroupsub>
                    <secondsub>v3</secondsub>
                    <secondsubid>45</secondsubid>
                </secondgroupsub>
                <secondgroupname> tenth </secondgroupname>
                <secondgroupsize> 10 </secondgroupsize>
            </secondgroupzone>
        </secondgrouplist>
    </secondgroup>
    <child3> val3 </child3>
</root>

我尝试使用这个包pandas-read-xml获得了大部分值,但是另一个子标签值显示在一列(另一个子)中,而不是另一个子|子ID和另一个子|另一个子。如果可能,建议使用通用逻辑将 xml 转换为平面文件。

import pandas_read_xml as pdx

df = pdx.read_xml(xml_content, ['root'])
fully_fatten_df = pdx.fully_flatten(df)
fully_fatten_df.to_csv("stack.csv", index=False)

输出 csv

anotherchild,child,child2,child3,group|groupzone|groupname,group|groupzone|groupsize,secondgroup|secondgroupzone|secondgroupname,secondgroup|secondgroupzone|secondgroupsize,secondgroup|secondgroupzone|secondgroupsub|secondsub,secondgroup|secondgroupzone|secondgroupsub|secondsubid
,child-val,child2-val2,val3,,,third,4,v1,12
,child-val,child2-val2,val3,,,fourth,6,v2,1
,child-val,child2-val2,val3,,,tenth,10,v3,45
,child-val,child2-val2,val3,first,4,,,,
,child-val,child2-val2,val3,second,6,,,,
,child-val,child2-val2,val3,third,8,,,,
another child 45,child-val,child2-val2,val3,,,,,,
another child name,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,

标签: python-3.xxmlpandaslogic

解决方案


通常,保存值的xml节点应该是相应的列。正如我在您的xml示例中看到的那样,“child”、“child2”、“childid”等应该是列。

基于上面的xml,我制作了这段代码,它应该足够通用以适应类似的示例。

import pandas as pd
import tabulate
import xml.etree.ElementTree as Xet

def getData(root, rows, columns, rowcount, name=None):
    if name != None:
        name = "{0}{1}{2}".format(name,"|",root.tag) # we construct the column names like this so that we don't risk haveing the same column on different nodes that should repeat
                                         # for example: a node named "name" could be under group and secondgroup and they shouldn't be the same column
    else:
        name = root.tag

    for item in root:
        if len(item) == 0:
            colName = "{0}{1}{2}".format(name,"|", item.tag)
            # colName = item.tag # remove this line to get the full column name; ex: root|group|grouplist|groupzone|groupsize
            if not colName in columns:
                columns.append(colName) # save the column to a list
                rowcount.append(0) # save the row on which we add the value for this column
                rows[rowcount[columns.index(colName)]].update({colName : item.text.strip()}) # add the value to the row - this will always happen on row 0
            else:
                repeatPosition = columns.index(colName) # get the column position for the repeated item
                rowcount[repeatPosition] = rowcount[repeatPosition] + 1 # increase row count
                if len(rows) <= max(rowcount):
                    rows.append({}) # add a new row based on row count
                rows[rowcount[repeatPosition]].update({colName : item.text.strip()}) # add the value on the new row

        getData(item, rows, columns, rowcount, name) # recursive call to walk trough each list of elements


xmlParse = Xet.parse('example.xml')
root = xmlParse.getroot()

rows = [{}] # adding at least one row from the start and will add additional rows as we go along
columns = [] # holds the names of the columns
rowcount = [] # holds the rows on which we add each element value; ex: 
getData(root, rows, columns, rowcount)

df = pd.DataFrame(rows, columns=columns)
print(df)
df.to_csv('parse.csv')

运行此代码后的最终结果如下所示: csv result

这是普通的csv:

,root|child,root|child2,root|anotherchild|childid,root|anotherchild|childname,root|group|groupid,root|group|grouplist|groupzone|groupname,root|group|grouplist|groupzone|groupsize,root|secondgroup|secondgroupid,root|secondgroup|secondgrouptitle,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsub,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsubid,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupname,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsize,root|child3
0,child-val,child2-val2,another child 45,another child name,groupid-123,first,4,secondgroupid-42,second group title,v1,12,third,4,val3
1,,,,,,second,6,,,v2,1,fourth,6,
2,,,,,,third,8,,,v3,45,tenth,10,

希望这能让您朝着正确的方向开始。


推荐阅读