首页 > 解决方案 > 使用python将复杂的xml转换为csv

问题描述

我有一个复杂的 XML 文件,如下所示:

    <?xml version="1.0" encoding="utf-8"?>
    <!DOCTYPE population SYSTEM "http://www.matsim.org/files/dtd/population_v6.dtd">
    <population>
    <person id="0">
      <plan selected="yes">
        <activity type="Home" x="324024" y="5814595" end_time="08:36:00"/>
        <leg mode="walk"/>
        <activity type="Study" x="323876" y="5813439" end_time="15:37:00"/>
        <leg mode="pt"/>
        <activity type="Pickup/Dropoff/Deliver" x="329427.1" y="5799654.6" end_time="15:53:00"/>
        <leg mode="walk"/>
        <activity type="Shop" x="329427.1" y="5799654.6" end_time="16:09:00"/>
        <leg mode="walk"/>
        <activity type="Personal" x="328992.7" y="5800844.4" end_time="21:01:30"/>
        <leg mode="walk"/>
        <activity type="Pickup/Dropoff/Deliver" x="329298" y="5800342" end_time="21:13:30"/>
        <leg mode="walk"/>
        <activity type="Social/Recreational" x="329298" y="5800342" end_time="21:26:00"/>
        <leg mode="walk"/>
        <activity type="Personal" x="328786" y="5802157" end_time="21:29:00"/>
        <leg mode="walk"/>
        <activity type="Pickup/Dropoff/Deliver" x="328853" y="5802718" end_time="22:33:30"/>
        <leg mode="pt"/>
        <activity type="Personal" x="325268" y="5804994" end_time="22:47:30"/>
        <leg mode="pt"/>
        <activity type="Home" x="324024" y="5814595" end_time="23:40:00"/>
      </plan>
    </person>
    <person id="1">
      <plan selected="yes">
        <activity type="Home" x="323496" y="5814324" end_time="04:27:00"/>
        <leg mode="pt"/>
        <activity type="Work" x="330430" y="5794577" end_time="12:41:00"/>
        <leg mode="walk"/>
        <activity type="Personal" x="330050.4" y="5794864.2" end_time="14:12:00"/>
        <leg mode="walk"/>
        <activity type="Social/Recreational" x="329466" y="5795165.8" end_time="15:50:00"/>
        <leg mode="walk"/>
        <activity type="Other" x="329757.4" y="5795363.3" end_time="21:22:00"/>
        <leg mode="walk"/>
        <activity type="Pickup/Dropoff/Deliver" x="329870.2" y="5795125.6" end_time="22:46:00"/>
        <leg mode="walk"/>
        <activity type="Social/Recreational" x="332001.9" y="5792129.9" end_time="23:08:00"/>
        <leg mode="pt"/>
        <activity type="Home" x="323496" y="5814324" end_time="23:56:00"/>
      </plan>
    </person>
    </population>

我想将其转换为 CSV 并使用 python 获取这些字段:
在此处输入图像描述

你知道我该怎么做吗?

谢谢您的帮助

标签: pythoncsv

解决方案


The way the XML file is structured makes this quite complicated and long-winded as far as I can see at a glance.

One problem is the <leg> elem which is not part of the <activity> elem but its mode property needs to be included in the CSV row.

But, as per the documentation, you could try something like this:

import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')    # change this to the name of your XML file (in the same driectory /or absolute path)
root = tree.getroot()

data_dict = {}

for elem in root.iter('person'):
    for child in elem:
        if child.attrib['selected'] == 'yes':
            data_dict[elem.attrib['id']] = []
            for i in range(len(child)):
                data_dict[elem.attrib['id']].append(child[i].attrib)

print(data_dict)

# {'0': [{'type': 'Home', 'x': '324024', 'y': '5814595', 'end_time': '08:36:00'}, {'mode': 'walk'}, {'type': 'Study', 'x': '323876', 'y': '5813439', 'end_time': '15:37:00'}, {'mode': 'pt'}, {'type': 'Pickup/Dropoff/Deliver', 'x': '329427.1', 'y': '5799654.6', 'end_time': '15:53:00'}, {'mode': 'walk'}, {'type': 'Shop', 'x': '329427.1', 'y': '5799654.6', 'end_time': '16:09:00'}, {'mode': 'walk'}, {'type': 'Personal', 'x': '328992.7', 'y': '5800844.4', 'end_time': '21:01:30'}, {'mode': 'walk'}, {'type': 'Pickup/Dropoff/Deliver', 'x': '329298', 'y': '5800342', 'end_time': '21:13:30'}, {'mode': 'walk'}, {'type': 'Social/Recreational', 'x': '329298', 'y': '5800342', 'end_time': '21:26:00'}, {'mode': 'walk'}, {'type': 'Personal', 'x': '328786', 'y': '5802157', 'end_time': '21:29:00'}, {'mode': 'walk'}, {'type': 'Pickup/Dropoff/Deliver', 'x': '328853', 'y': '5802718', 'end_time': '22:33:30'}, {'mode': 'pt'}, {'type': 'Personal', 'x': '325268', 'y': '5804994', 'end_time': '22:47:30'}, {'mode': 'pt'}, {'type': 'Home', 'x': '324024', 'y': '5814595', 'end_time': '23:40:00'}], '1': [{'type': 'Home', 'x': '323496', 'y': '5814324', 'end_time': '04:27:00'}, {'mode': 'pt'}, {'type': 'Work', 'x': '330430', 'y': '5794577', 'end_time': '12:41:00'}, {'mode': 'walk'}, {'type': 'Personal', 'x': '330050.4', 'y': '5794864.2', 'end_time': '14:12:00'}, {'mode': 'walk'}, {'type': 'Social/Recreational', 'x': '329466', 'y': '5795165.8', 'end_time': '15:50:00'}, {'mode': 'walk'}, {'type': 'Other', 'x': '329757.4', 'y': '5795363.3', 'end_time': '21:22:00'}, {'mode': 'walk'}, {'type': 'Pickup/Dropoff/Deliver', 'x': '329870.2', 'y': '5795125.6', 'end_time': '22:46:00'}, {'mode': 'walk'}, {'type': 'Social/Recreational', 'x': '332001.9', 'y': '5792129.9', 'end_time': '23:08:00'}, {'mode': 'pt'}, {'type': 'Home', 'x': '323496', 'y': '5814324', 'end_time': '23:56:00'}]}

You can see now though the problem with having leg/mode in a different element. Everything else is nicely grouped with x, y, end time etc. whereas mode is alone in a separate dictionary. It is not impossible now to loop through the list of dictionaries under each id, check their length, and write to the row/column accordingly, but I wonder if it is really necessary? It might just be possible to include the mode property inside the <activity> element.

If the activity element looked liked this:

<activity type="Home" x="324024" y="5814595" end_time="08:36:00" mode="walk"/>

...it would be so much easier.

However, if this is not an option (or even if you had a mix of the above 2 strategies), you can try this next:

fields = ['person_id', 'leg_mode', 'type', 'x', 'y', 'end_time']

rows = []
for person in data_dict:
    for activity in data_dict[person]:
        if len(activity) > 1:
            row = [person,
                   (activity.get('mode')
                    if activity.get('mode')
                    else data_dict[person][data_dict[person].index(activity) + 1]['mode']
                   if data_dict[person].index(activity) + 1 < len(data_dict[person])
                      and 'mode' in data_dict[person][data_dict[person].index(activity) + 1]
                   else ''),    # this is the line that is particularly complicated due to the structure of the XML data.
                   activity['type'],
                   activity['x'],
                   activity['y'],
                   activity['end_time']]
            rows.append(row)

with open('csv.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(fields)
    writer.writerows(rows)

That will give you the csv file you are looking for.

For clarity, the complete script is as follows:

import csv
import xml.etree.ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

data_dict = {}

for elem in root.iter('person'):
    for child in elem:
        if child.attrib['selected'] == 'yes':
            data_dict[elem.attrib['id']] = []
            for i in range(len(child)):
                data_dict[elem.attrib['id']].append(child[i].attrib)

fields = ['person_id', 'leg_mode', 'type', 'x', 'y', 'end_time']

rows = []

for person in data_dict:
    for activity in data_dict[person]:
        if len(activity) > 1:
            row = [person,
                   (activity.get('mode')
                    if activity.get('mode')
                    else data_dict[person][data_dict[person].index(activity) + 1]['mode']
                   if data_dict[person].index(activity) + 1 < len(data_dict[person])
                      and 'mode' in data_dict[person][data_dict[person].index(activity) + 1]
                   else ''),
                   activity['type'],
                   activity['x'],
                   activity['y'],
                   activity['end_time']]
            rows.append(row)

with open('csv.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(fields)
    writer.writerows(rows)


推荐阅读