首页 > 解决方案 > 如何将 xml 解析为具有兄弟元素的表?

问题描述

我有看起来像这样的xml:

xml = """
<portfolio>
    <assets>600000</assets>
    <assetClassDetails>
        <assetClassName>Bonds</assetClassName>
        <assetAmount>100000</assetAmount>
    </assetClassDetails>
    <assetClassDetails>
        <assetClassName>Equities</assetClassName>
        <assetAmount>500000</assetAmount>
    </assetClassDetails>
    <rateOfReturn>6.3</rateOfReturn>
</portfolio>
"""

我通过这样做将每个元素解析为一个表:

root = etree.fromstring(xml)

tag = []
text = []
parent = []
double_parent = []

for element in root.iter():
    try:
        element_parent = element.getparent().tag
    except AttributeError:
        element_parent = 'none'
    try:
        element_double_parent = element.getparent().getparent().tag
    except AttributeError:
        element_double_parent = 'none'
    tag.append(element.tag)
    text.append(element.text)
    parent.append(element_parent)
    double_parent.append(element_double_parent)

df = pd.DataFrame({'tag' : tag, 'text' : text, 'parent' : parent, 'double_parent' : double_parent})

这导致:

tag                 text      parent            double_parent
portfolio           \n        none              none
assets              600000    portfolio         none
assetClassDetails   \n        portfolio         none
assetClassName      Bonds     assetClassDetails portfolio
assetAmount         100000    assetClassDetails portfolio
assetClassDetails   \n        portfolio         none
assetClassName      Equities  assetClassDetails portfolio
assetAmount         500000    assetClassDetails portfolio
rateOfReturn        6.3       portfolio         none

我正在努力解决如何对数据进行透视,以便将资产类别名称和金额配对并绑定到投资组合标签(及其直系子代)。如何在结果中获得配对兄弟标签?

我想要的结果如下所示:

type        assets  rateOfReturn    assetClassName  assetAmount
portfolio   600000  6.3             Bonds           100000
portfolio   600000  6.3             Equities        500000

标签: pythonxmlxml-parsinglxml

解决方案


尝试类似:

rows = []
columns = ['assets',  'rateOfReturn',    'assetClassName',  'assetAmount']
for entry in root.xpath('//assetClassDetails'):
    row = []
    row.extend([entry.xpath('preceding-sibling::assets/text()')[0],
                entry.xpath('following-sibling::rateOfReturn/text()')[0],
                entry.xpath('./assetClassName/text()')[0],
                entry.xpath('./assetAmount/text()')[0]])
    rows.append(row)
pd.DataFrame(rows,columns=columns)

输出:

    assets  rateOfReturn    assetClassName  assetAmount
0   600000  6.3     Bonds   100000
1   600000  6.3     Equities    500000

另一种有趣的方法是使用另一个库来做到这一点:

import pandas_read_xml as pdx
df1 = pdx.read_xml(r'path\to\myfile.xml',['portfolio','assetClassDetails'])
df2 = pdx.read_xml(r'path\to\myfile.xml',['portfolio'])
pd.concat([df2[['assets','rateOfReturn']],df1], axis=1)

输出:

assets     rateOfReturn assetClassName  assetAmount
0   600000  6.3         Bonds             100000
1   600000  6.3         Equities        500000

推荐阅读