python-3.x - Python:使用嵌套的子标签将 xml 展平为 csv
问题描述
我想展平多个 XML 文件,我正在寻找一个通用函数或逻辑来将 xml 转换为平面文件。大多数答案都包含硬编码标签。最接近的是Python:将 xml 展平为 csv,父标签在子标签中重复,但仍然有硬编码的解决方案。对于以下输入 xml
<root>
<child> child-val </child>
<child2> child2-val2 </child2>
<anotherchild>
<childid> another child 45</childid>
<childname> another child name </childname>
</anotherchild>
<group>
<groupid> groupid-123</groupid>
<grouplist>
<groupzone>
<groupname>first </groupname>
<groupsize> 4</groupsize>
</groupzone>
<groupzone>
<groupname>second </groupname>
<groupsize> 6</groupsize>
</groupzone>
<groupzone>
<groupname> third </groupname>
<groupsize> 8 </groupsize>
</groupzone>
</grouplist>
</group>
<secondgroup>
<secondgroupid> secondgroupid-42 </secondgroupid>
<secondgrouptitle> second group title </secondgrouptitle>
<secondgrouplist>
<secondgroupzone>
<secondgroupsub>
<secondsub>v1</secondsub>
<secondsubid>12</secondsubid>
</secondgroupsub>
<secondgroupname> third </secondgroupname>
<secondgroupsize> 4</secondgroupsize>
</secondgroupzone>
<secondgroupzone>
<secondgroupsub>
<secondsub>v2</secondsub>
<secondsubid>1</secondsubid>
</secondgroupsub>
<secondgroupname>fourth </secondgroupname>
<secondgroupsize> 6</secondgroupsize>
</secondgroupzone>
<secondgroupzone>
<secondgroupsub>
<secondsub>v3</secondsub>
<secondsubid>45</secondsubid>
</secondgroupsub>
<secondgroupname> tenth </secondgroupname>
<secondgroupsize> 10 </secondgroupsize>
</secondgroupzone>
</secondgrouplist>
</secondgroup>
<child3> val3 </child3>
</root>
我尝试使用这个包pandas-read-xml获得了大部分值,但是另一个子标签值显示在一列(另一个子)中,而不是另一个子|子ID和另一个子|另一个子。如果可能,建议使用通用逻辑将 xml 转换为平面文件。
import pandas_read_xml as pdx
df = pdx.read_xml(xml_content, ['root'])
fully_fatten_df = pdx.fully_flatten(df)
fully_fatten_df.to_csv("stack.csv", index=False)
输出 csv
anotherchild,child,child2,child3,group|groupzone|groupname,group|groupzone|groupsize,secondgroup|secondgroupzone|secondgroupname,secondgroup|secondgroupzone|secondgroupsize,secondgroup|secondgroupzone|secondgroupsub|secondsub,secondgroup|secondgroupzone|secondgroupsub|secondsubid
,child-val,child2-val2,val3,,,third,4,v1,12
,child-val,child2-val2,val3,,,fourth,6,v2,1
,child-val,child2-val2,val3,,,tenth,10,v3,45
,child-val,child2-val2,val3,first,4,,,,
,child-val,child2-val2,val3,second,6,,,,
,child-val,child2-val2,val3,third,8,,,,
another child 45,child-val,child2-val2,val3,,,,,,
another child name,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,
,child-val,child2-val2,val3,,,,,,
解决方案
通常,保存值的xml节点应该是相应的列。正如我在您的xml示例中看到的那样,“child”、“child2”、“childid”等应该是列。
基于上面的xml,我制作了这段代码,它应该足够通用以适应类似的示例。
import pandas as pd
import tabulate
import xml.etree.ElementTree as Xet
def getData(root, rows, columns, rowcount, name=None):
if name != None:
name = "{0}{1}{2}".format(name,"|",root.tag) # we construct the column names like this so that we don't risk haveing the same column on different nodes that should repeat
# for example: a node named "name" could be under group and secondgroup and they shouldn't be the same column
else:
name = root.tag
for item in root:
if len(item) == 0:
colName = "{0}{1}{2}".format(name,"|", item.tag)
# colName = item.tag # remove this line to get the full column name; ex: root|group|grouplist|groupzone|groupsize
if not colName in columns:
columns.append(colName) # save the column to a list
rowcount.append(0) # save the row on which we add the value for this column
rows[rowcount[columns.index(colName)]].update({colName : item.text.strip()}) # add the value to the row - this will always happen on row 0
else:
repeatPosition = columns.index(colName) # get the column position for the repeated item
rowcount[repeatPosition] = rowcount[repeatPosition] + 1 # increase row count
if len(rows) <= max(rowcount):
rows.append({}) # add a new row based on row count
rows[rowcount[repeatPosition]].update({colName : item.text.strip()}) # add the value on the new row
getData(item, rows, columns, rowcount, name) # recursive call to walk trough each list of elements
xmlParse = Xet.parse('example.xml')
root = xmlParse.getroot()
rows = [{}] # adding at least one row from the start and will add additional rows as we go along
columns = [] # holds the names of the columns
rowcount = [] # holds the rows on which we add each element value; ex:
getData(root, rows, columns, rowcount)
df = pd.DataFrame(rows, columns=columns)
print(df)
df.to_csv('parse.csv')
运行此代码后的最终结果如下所示: csv result
这是普通的csv:
,root|child,root|child2,root|anotherchild|childid,root|anotherchild|childname,root|group|groupid,root|group|grouplist|groupzone|groupname,root|group|grouplist|groupzone|groupsize,root|secondgroup|secondgroupid,root|secondgroup|secondgrouptitle,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsub,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsub|secondsubid,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupname,root|secondgroup|secondgrouplist|secondgroupzone|secondgroupsize,root|child3
0,child-val,child2-val2,another child 45,another child name,groupid-123,first,4,secondgroupid-42,second group title,v1,12,third,4,val3
1,,,,,,second,6,,,v2,1,fourth,6,
2,,,,,,third,8,,,v3,45,tenth,10,
希望这能让您朝着正确的方向开始。
推荐阅读
- c# - 如何为文本框使用掩码?
- mongodb - 我可以在猫鼬操作中使用 $addFields 添加的字段之一上使用 $lookup 吗?
- ios - 如何在 PageViewController 中隐藏可重用 ViewController 中的按钮?
- python - pytest `--tb` 选项可以重定向到日志文件而不是控制台吗?
- python - 创建一个事件以防止类似的输入触发相同的事件
- mod-fcgid - 部分结果有效但处理不完整 ERROR
- node.js - 嵌入 discord.js v12 中的 Markdown
- sql - 如何在 plsql 开发人员工具命令窗口中创建假脱机文件
- swift - 如何使用 cellWillDisplay/scrollViewDidScroll 更新 API 获取数据?迅速
- node.js - Mongoose 不保存更改 (updateOne)