首页 > 解决方案 > 使用元素树为嵌套 XML 创建唯一代码


我在嵌套的 XML 代码下面。参考下图


  1. 黄色突出显示的代码是1st Layer
  2. 蓝色突出显示的代码是2nd Layer
  3. 红色突出显示的代码是3rd Layer

refer below for the xml data


我的任务是为所有 3 层创建唯一的 ID。下面是我写的代码。

import pandas as pd 
import xml.etree.ElementTree as ET

xml_file_path = 'C:\Desktop\data.xml'

tree = ET.parse(xml_file_path)
root = tree.getroot()

sub_item_id = 0

cols = ['invoice','total','code','item_id','A','B','C']

dict_xml = {}
data = []
for trx in root.iter('trx'):

    invoice = trx.find('invoice').text
    total = trx.find('total').text

    item_id = 0

    a = 0
    for it in trx.findall('item'):
        a += 1
        b = -1
        for j in it.iter('item'):
            b += 1
            c = 0

            code = j.find('code').text

            item_id += 1    


data = pd.DataFrame(data)

我得到低于输出。哪里Column A是正确的。不是B and C

|   | invoice  | total | code | item_id | A | B | C |
| 0 | 27844173 |   52  |  110 |    1    | 1 | 0 | 0 |
| 1 | 27844173 |   52  |  304 |    2    | 2 | 0 | 0 |
| 2 | 27844173 |   52  |  54  |    3    | 2 | 1 | 0 |
| 3 | 27844173 |   52  |  174 |    4    | 2 | 2 | 0 |
| 4 | 27844173 |   52  |  600 |    5    | 2 | 3 | 0 |
| 5 | 27844173 |   52  |  478 |    6    | 2 | 4 | 0 |
| 6 | 27844173 |   52  |  810 |    7    | 2 | 5 | 0 |


|   | invoice  | total | code | item_id | A | B | C |
| 0 | 27844173 |   52  |  110 |    1    | 1 | 0 | 0 |
| 1 | 27844173 |   52  |  304 |    2    | 2 | 0 | 0 |
| 2 | 27844173 |   52  |  54  |    3    | 2 | 1 | 0 |
| 3 | 27844173 |   52  |  174 |    4    | 2 | 1 | 1 |
| 4 | 27844173 |   52  |  600 |    5    | 2 | 1 | 2 |
| 5 | 27844173 |   52  |  478 |    6    | 2 | 2 | 0 |
| 6 | 27844173 |   52  |  810 |    7    | 2 | 3 | 0 |

我应该如何以及在哪里增加B and C变量以获得所需的输出

标签: pythonxmlelementtree


首先初步观察:当您使用 xml.etree 时,我更喜欢使用 lxml 库,因为它对 xpath 有更好的支持。显然,如果您觉得有必要,可以尝试将代码转换为 xml.etree。


import pandas as pd
from lxml import etree

stuff = """[your xml above]"""

doc = etree.XML(stuff.encode())
tree = etree.ElementTree(doc)

#first off, get the invoice number and total as integers
inv = int(doc.xpath('/trx/invoice/text()')[0])
total = int(doc.xpath('/trx/total/text()')[0])

#initialize a few lists:
levels = [] #we'll need this to determine programmatically how many levels deep the xml is
codes = [] #collect the codes
tiers = [] #create rows for each tier

#next - how many levels deep is the xml? Not easy to find out:
for e in doc.iter('item'):
    path = tree.getpath(e)
    tier = path.replace('/trx/','').replace('item','').replace('/s/',' ').replace('[','').replace(']','')
    tiers.append(tier.split(' '))
    levels.append(path.count('[')) #we now have the depth of each tier

#the length of each tier is a function of its level; so we pad the length of that list to the highest level number (3 in this example):
for tier in tiers:
    tiers[tiers.index(tier)] = [*tier, *["0"] * (max(levels)-len(tier))]
    #so all that work with counting levels was just to use this max(levels) variable once...

#we now insert the other info you require in each row:
for t,c in zip(tiers,codes):

#With all this prep out of the way, we get to the dataframe at last:
ids = list(range(1, len(tiers)+1)) #this is for the additional column you require
columns = ["total","invoice","code"," A"," B","C"]
df = pd.DataFrame(tiers,columns=columns)
df.insert(2, 'item_id', ids) #insert the extra column


 total invoice item_id code     A   B   C
0   52  27844173    1   110     1   0   0
1   52  27844173    2   304     2   0   0
2   52  27844173    3   54      2   1   0
3   52  27844173    4   174     2   1   1
4   52  27844173    5   600     2   1   2
5   52  27844173    6   478     2   2   0
6   52  27844173    7   810     2   3   0
