python - 使用元素树为嵌套 XML 创建唯一代码
问题描述
我在嵌套的 XML 代码下面。参考下图
- 黄色突出显示的代码是
1st Layer
- 蓝色突出显示的代码是
2nd Layer
- 红色突出显示的代码是
3rd Layer
refer below for the xml data
<trx><invoice>27844173</invoice><total>52</total><item><code>110</code></item><item><code>304</code><items><item><code>54</code><items><item><code>174</code></item><item><code>600</code></item></items></item><item><code>478</code></item><item><code>810</code></item></items></item></trx>
我的任务是为所有 3 层创建唯一的 ID。下面是我写的代码。
import pandas as pd
import xml.etree.ElementTree as ET
xml_file_path = 'C:\Desktop\data.xml'
tree = ET.parse(xml_file_path)
root = tree.getroot()
sub_item_id = 0
cols = ['invoice','total','code','item_id','A','B','C']
dict_xml = {}
data = []
for trx in root.iter('trx'):
invoice = trx.find('invoice').text
total = trx.find('total').text
item_id = 0
a = 0
for it in trx.findall('item'):
a += 1
b = -1
for j in it.iter('item'):
b += 1
c = 0
code = j.find('code').text
item_id += 1
data.append({"invoice":invoice,"total":total,"code":code,
"item_id":item_id,"A":a,"B":b,"C":c})
data = pd.DataFrame(data)
data
我得到低于输出。哪里Column A
是正确的。不是B and C
+---+----------+-------+------+---------+---+---+---+
| | invoice | total | code | item_id | A | B | C |
+---+----------+-------+------+---------+---+---+---+
| 0 | 27844173 | 52 | 110 | 1 | 1 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 1 | 27844173 | 52 | 304 | 2 | 2 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 2 | 27844173 | 52 | 54 | 3 | 2 | 1 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 3 | 27844173 | 52 | 174 | 4 | 2 | 2 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 4 | 27844173 | 52 | 600 | 5 | 2 | 3 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 5 | 27844173 | 52 | 478 | 6 | 2 | 4 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 6 | 27844173 | 52 | 810 | 7 | 2 | 5 | 0 |
+---+----------+-------+------+---------+---+---+---+
我的预期结果如下。
+---+----------+-------+------+---------+---+---+---+
| | invoice | total | code | item_id | A | B | C |
+---+----------+-------+------+---------+---+---+---+
| 0 | 27844173 | 52 | 110 | 1 | 1 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 1 | 27844173 | 52 | 304 | 2 | 2 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 2 | 27844173 | 52 | 54 | 3 | 2 | 1 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 3 | 27844173 | 52 | 174 | 4 | 2 | 1 | 1 |
+---+----------+-------+------+---------+---+---+---+
| 4 | 27844173 | 52 | 600 | 5 | 2 | 1 | 2 |
+---+----------+-------+------+---------+---+---+---+
| 5 | 27844173 | 52 | 478 | 6 | 2 | 2 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 6 | 27844173 | 52 | 810 | 7 | 2 | 3 | 0 |
+---+----------+-------+------+---------+---+---+---+
我应该如何以及在哪里增加B and C
变量以获得所需的输出
解决方案
首先初步观察:当您使用 xml.etree 时,我更喜欢使用 lxml 库,因为它对 xpath 有更好的支持。显然,如果您觉得有必要,可以尝试将代码转换为 xml.etree。
可能有更短的方法可以做到这一点,但暂时让我们使用以下方法,我将一路解释:
import pandas as pd
from lxml import etree
stuff = """[your xml above]"""
doc = etree.XML(stuff.encode())
tree = etree.ElementTree(doc)
#first off, get the invoice number and total as integers
inv = int(doc.xpath('/trx/invoice/text()')[0])
total = int(doc.xpath('/trx/total/text()')[0])
#initialize a few lists:
levels = [] #we'll need this to determine programmatically how many levels deep the xml is
codes = [] #collect the codes
tiers = [] #create rows for each tier
#next - how many levels deep is the xml? Not easy to find out:
for e in doc.iter('item'):
path = tree.getpath(e)
tier = path.replace('/trx/','').replace('item','').replace('/s/',' ').replace('[','').replace(']','')
tiers.append(tier.split(' '))
codes.append(e.xpath('./code/text()')[0])
levels.append(path.count('[')) #we now have the depth of each tier
#the length of each tier is a function of its level; so we pad the length of that list to the highest level number (3 in this example):
for tier in tiers:
tiers[tiers.index(tier)] = [*tier, *["0"] * (max(levels)-len(tier))]
#so all that work with counting levels was just to use this max(levels) variable once...
#we now insert the other info you require in each row:
for t,c in zip(tiers,codes):
t.insert(0,c)
t.insert(0,inv)
t.insert(0,total)
#With all this prep out of the way, we get to the dataframe at last:
ids = list(range(1, len(tiers)+1)) #this is for the additional column you require
columns = ["total","invoice","code"," A"," B","C"]
df = pd.DataFrame(tiers,columns=columns)
df.insert(2, 'item_id', ids) #insert the extra column
df
输出:
total invoice item_id code A B C
0 52 27844173 1 110 1 0 0
1 52 27844173 2 304 2 0 0
2 52 27844173 3 54 2 1 0
3 52 27844173 4 174 2 1 1
4 52 27844173 5 600 2 1 2
5 52 27844173 6 478 2 2 0
6 52 27844173 7 810 2 3 0
推荐阅读
- redirect - 我失去了我的域名,有什么办法可以重定向到新网站?
- entity-framework - 如何在 ASP.NET Core 3.1 的另一个应用程序中包含用户管理器
- react-native - 世博会发布后世博会使用字体不起作用
- rxjs - 当您订阅 rxjs 时,如果测试失败,您如何向测试发出信号?
- javascript - 如何从 javascript 获取网络请求列表?
- python - 如何在 azure 中设置 Linux VM,以便我们可以将其作为服务器(不是 apache Web 服务器)进行套接字编程
- c# - 为什么向 ASP.NET Core 3.1 WEB API 或 MVC 控制器的正文中带有 JSON 参数的 Fetch Post 请求没有得到任何东西?
- node.js - 如何处理涉及打字稿的包之间的冲突?
- javascript - 如何执行在iframe中提交表单后获得的javascript
- html - 为什么固定位置的表格在滚动时会移动边框的前1px?