首页 > 解决方案 > xml到多个熊猫数据框

问题描述

我想从 XML 中提取数据并将其转换为 Multiple Pandas DataFrame,我尝试使用 Element Tree xml 导入并打印出标签和文本(仅限 2 列)我无法弄清楚如何将其拆分为多个数据框,

<?xml version="1.0" encoding="ISO-8859-1"?>
<spec:zzz>
<xxx>
    <class>
        <table_name>
            <attributes>
                <aaa>0</aaa>
                <bbb>1</bbb>
                <ccc>
                    <element>
                        <ccc1>0</ccc1>
                        <ccc2>0</ccc2>
                        <ccc3>3</ccc3>
                    </element>
                </ccc>
            </attributes>
        </table_name>
        <table_name>
            <attributes>
                <aaa>0</aaa>
                <bbb>0</bbb>
                <ccc>
                    <element>
                        <ccc1>0</ccc1>
                        <ccc2>0</ccc2>
                        <ccc3>3</ccc3>
                    </element>
                </ccc>
                <ddd>4</ddd>
            </attributes>
        </table_name>
    </class>
    <class>
        <table_name1>
            <attributes>
            </attributes>
        </table_name1>
    </class>
    <class>
        <table_name2>
            <attributes>
                <eee>0</eee>
                <fff></fff>
                <ggg></ggg>
            </attributes>
        </table_name2>
    </class>
</xxx>
</spec:zzz>

表格样本:

table_name                      table_name1         table_name2     
                                        
                                        
|aaa|   bbb |   ccc |   ddd|                        |eee    |fff    | ggg |
|0  |1      |(0,0,3)|      |                        |0      |       |     |
|0  |0      |(0,0,3)|4     |                        

标签: python-3.x

解决方案


尝试这个。

from simplified_scrapy import utils, SimplifiedDoc
xml = '''
your xml
'''
doc = SimplifiedDoc(xml)
tablenames = doc.selects('class').children

for tablename in tablenames:
    table = tablename.child.children
    rows = []
    for attributes in table:
        # rows.append([attr.text for attr in attributes])
        row = []
        for attr in attributes:
            if attr.child:
                row.append(','.join(attr.child.children.text))
            else:
                row.append(attr.text)
        rows.append(row)
    print (tablename[0].tag, rows)

结果:

table_name [['0', '1', '0,0,3'], ['0', '0', '0,0,3', '4']]
table_name1 [[]]
table_name2 [['0', '', '']]

处理多个文件

from simplified_scrapy import utils, SimplifiedDoc

xmlDir = 'test/'
xmls = utils.getSubFile(xmlDir)
for x in xmls:
    xml = utils.getFileContent(x)
    # xml = '''your xml'''
    doc = SimplifiedDoc(xml)
    tablenames = doc.selects('class').children

    for tablename in tablenames:
        table = tablename.child.children
        rows = []
        for attributes in table:
            # rows.append([attr.text for attr in attributes])
            row = []
            for attr in attributes:
                if attr.child:
                    row.append(','.join(attr.child.children.text))
                else:
                    row.append(attr.text)
            rows.append(row)
        print (tablename[0].tag, rows)

推荐阅读