python - 使用python将复杂的xml转换为csv
问题描述
我有一个复杂的 XML 文件,如下所示:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE population SYSTEM "http://www.matsim.org/files/dtd/population_v6.dtd">
<population>
<person id="0">
<plan selected="yes">
<activity type="Home" x="324024" y="5814595" end_time="08:36:00"/>
<leg mode="walk"/>
<activity type="Study" x="323876" y="5813439" end_time="15:37:00"/>
<leg mode="pt"/>
<activity type="Pickup/Dropoff/Deliver" x="329427.1" y="5799654.6" end_time="15:53:00"/>
<leg mode="walk"/>
<activity type="Shop" x="329427.1" y="5799654.6" end_time="16:09:00"/>
<leg mode="walk"/>
<activity type="Personal" x="328992.7" y="5800844.4" end_time="21:01:30"/>
<leg mode="walk"/>
<activity type="Pickup/Dropoff/Deliver" x="329298" y="5800342" end_time="21:13:30"/>
<leg mode="walk"/>
<activity type="Social/Recreational" x="329298" y="5800342" end_time="21:26:00"/>
<leg mode="walk"/>
<activity type="Personal" x="328786" y="5802157" end_time="21:29:00"/>
<leg mode="walk"/>
<activity type="Pickup/Dropoff/Deliver" x="328853" y="5802718" end_time="22:33:30"/>
<leg mode="pt"/>
<activity type="Personal" x="325268" y="5804994" end_time="22:47:30"/>
<leg mode="pt"/>
<activity type="Home" x="324024" y="5814595" end_time="23:40:00"/>
</plan>
</person>
<person id="1">
<plan selected="yes">
<activity type="Home" x="323496" y="5814324" end_time="04:27:00"/>
<leg mode="pt"/>
<activity type="Work" x="330430" y="5794577" end_time="12:41:00"/>
<leg mode="walk"/>
<activity type="Personal" x="330050.4" y="5794864.2" end_time="14:12:00"/>
<leg mode="walk"/>
<activity type="Social/Recreational" x="329466" y="5795165.8" end_time="15:50:00"/>
<leg mode="walk"/>
<activity type="Other" x="329757.4" y="5795363.3" end_time="21:22:00"/>
<leg mode="walk"/>
<activity type="Pickup/Dropoff/Deliver" x="329870.2" y="5795125.6" end_time="22:46:00"/>
<leg mode="walk"/>
<activity type="Social/Recreational" x="332001.9" y="5792129.9" end_time="23:08:00"/>
<leg mode="pt"/>
<activity type="Home" x="323496" y="5814324" end_time="23:56:00"/>
</plan>
</person>
</population>
我想将其转换为 CSV 并使用 python 获取这些字段:
你知道我该怎么做吗?
谢谢您的帮助
解决方案
The way the XML file is structured makes this quite complicated and long-winded as far as I can see at a glance.
One problem is the <leg>
elem which is not part of the <activity>
elem but its mode
property needs to be included in the CSV row.
But, as per the documentation, you could try something like this:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml') # change this to the name of your XML file (in the same driectory /or absolute path)
root = tree.getroot()
data_dict = {}
for elem in root.iter('person'):
for child in elem:
if child.attrib['selected'] == 'yes':
data_dict[elem.attrib['id']] = []
for i in range(len(child)):
data_dict[elem.attrib['id']].append(child[i].attrib)
print(data_dict)
# {'0': [{'type': 'Home', 'x': '324024', 'y': '5814595', 'end_time': '08:36:00'}, {'mode': 'walk'}, {'type': 'Study', 'x': '323876', 'y': '5813439', 'end_time': '15:37:00'}, {'mode': 'pt'}, {'type': 'Pickup/Dropoff/Deliver', 'x': '329427.1', 'y': '5799654.6', 'end_time': '15:53:00'}, {'mode': 'walk'}, {'type': 'Shop', 'x': '329427.1', 'y': '5799654.6', 'end_time': '16:09:00'}, {'mode': 'walk'}, {'type': 'Personal', 'x': '328992.7', 'y': '5800844.4', 'end_time': '21:01:30'}, {'mode': 'walk'}, {'type': 'Pickup/Dropoff/Deliver', 'x': '329298', 'y': '5800342', 'end_time': '21:13:30'}, {'mode': 'walk'}, {'type': 'Social/Recreational', 'x': '329298', 'y': '5800342', 'end_time': '21:26:00'}, {'mode': 'walk'}, {'type': 'Personal', 'x': '328786', 'y': '5802157', 'end_time': '21:29:00'}, {'mode': 'walk'}, {'type': 'Pickup/Dropoff/Deliver', 'x': '328853', 'y': '5802718', 'end_time': '22:33:30'}, {'mode': 'pt'}, {'type': 'Personal', 'x': '325268', 'y': '5804994', 'end_time': '22:47:30'}, {'mode': 'pt'}, {'type': 'Home', 'x': '324024', 'y': '5814595', 'end_time': '23:40:00'}], '1': [{'type': 'Home', 'x': '323496', 'y': '5814324', 'end_time': '04:27:00'}, {'mode': 'pt'}, {'type': 'Work', 'x': '330430', 'y': '5794577', 'end_time': '12:41:00'}, {'mode': 'walk'}, {'type': 'Personal', 'x': '330050.4', 'y': '5794864.2', 'end_time': '14:12:00'}, {'mode': 'walk'}, {'type': 'Social/Recreational', 'x': '329466', 'y': '5795165.8', 'end_time': '15:50:00'}, {'mode': 'walk'}, {'type': 'Other', 'x': '329757.4', 'y': '5795363.3', 'end_time': '21:22:00'}, {'mode': 'walk'}, {'type': 'Pickup/Dropoff/Deliver', 'x': '329870.2', 'y': '5795125.6', 'end_time': '22:46:00'}, {'mode': 'walk'}, {'type': 'Social/Recreational', 'x': '332001.9', 'y': '5792129.9', 'end_time': '23:08:00'}, {'mode': 'pt'}, {'type': 'Home', 'x': '323496', 'y': '5814324', 'end_time': '23:56:00'}]}
You can see now though the problem with having leg
/mode
in a different element. Everything else is nicely grouped with x, y, end time etc. whereas mode
is alone in a separate dictionary. It is not impossible now to loop through the list of dictionaries under each id, check their length, and write to the row/column accordingly, but I wonder if it is really necessary? It might just be possible to include the mode
property inside the <activity>
element.
If the activity element looked liked this:
<activity type="Home" x="324024" y="5814595" end_time="08:36:00" mode="walk"/>
...it would be so much easier.
However, if this is not an option (or even if you had a mix of the above 2 strategies), you can try this next:
fields = ['person_id', 'leg_mode', 'type', 'x', 'y', 'end_time']
rows = []
for person in data_dict:
for activity in data_dict[person]:
if len(activity) > 1:
row = [person,
(activity.get('mode')
if activity.get('mode')
else data_dict[person][data_dict[person].index(activity) + 1]['mode']
if data_dict[person].index(activity) + 1 < len(data_dict[person])
and 'mode' in data_dict[person][data_dict[person].index(activity) + 1]
else ''), # this is the line that is particularly complicated due to the structure of the XML data.
activity['type'],
activity['x'],
activity['y'],
activity['end_time']]
rows.append(row)
with open('csv.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(fields)
writer.writerows(rows)
That will give you the csv file you are looking for.
For clarity, the complete script is as follows:
import csv
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
data_dict = {}
for elem in root.iter('person'):
for child in elem:
if child.attrib['selected'] == 'yes':
data_dict[elem.attrib['id']] = []
for i in range(len(child)):
data_dict[elem.attrib['id']].append(child[i].attrib)
fields = ['person_id', 'leg_mode', 'type', 'x', 'y', 'end_time']
rows = []
for person in data_dict:
for activity in data_dict[person]:
if len(activity) > 1:
row = [person,
(activity.get('mode')
if activity.get('mode')
else data_dict[person][data_dict[person].index(activity) + 1]['mode']
if data_dict[person].index(activity) + 1 < len(data_dict[person])
and 'mode' in data_dict[person][data_dict[person].index(activity) + 1]
else ''),
activity['type'],
activity['x'],
activity['y'],
activity['end_time']]
rows.append(row)
with open('csv.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(fields)
writer.writerows(rows)
推荐阅读
- r - 如何摆脱“案例和控件必须是数字的或有序的”。在 pROC 包中?
- python-3.x - 一种找到复杂算法时间复杂度的策略证明方法?
- javascript - easy-Autocomplete 实现问题:javascript 未执行且 JSON 未处理
- c# - Random() 创建相同的随机字母和数字
- python - 如何将信息从 Python 代码传递到 Web 信息显示?
- ios - 如何通过 iOS 中的键盘扩展更改文本字段中的字体输出
- c - 计算字符串中的一个字符 (C) - 问题
- vue.js - 用于 nuxt.js 构建的自定义 index.html
- javascript - useState 更新状态子数组对象不起作用
- android - 为什么网络蓝牙配对屏幕显示同一设备的多个条目?