首页 > 解决方案 > 如何将 XML 文件中的某些行转换为 csv

问题描述

我目前正在尝试将几千个 xml 文件转换为 csv,以便我可以做一些更简单的数据工作。我试图先转换其中一个,这样我才能确保它可以工作,然后我可以循环它。

当我在网上找到一个漂亮的教程时,我已经能够弄清楚大部分内容。我的 XML 文件如下所示:


<?xml version="1.0" encoding="UTF-8"?>
<orbit id="14737">
    <frame>
        <time>2015-08-15T05:28:39.014</time>
        <sza>113.48 deg</sza>
        <alt>1552 km</alt>
        <lat>-66.96 deg</lat>
        <lon>196.11 deg</lon>
        <x>-0.58 Rm</x>
        <rho>1.33 Rm</rho>
        <hperiod>0</hperiod>
        <hperiodquality>0</hperiodquality>
        <vperiod delaytime="167.443 μs">0</vperiod>
        <vperiodquality>0</vperiodquality>
        <cutoff>0</cutoff>
        <ionospheretrace delaytime="167.443 μs"/>
        <maxfreqquality>0</maxfreqquality>
        <groundtrace delaytime="167.443 μs"/>
    </frame>
...

当然,这种情况仍在继续。

我的问题出现在诸如 ionospheretrace 延迟时间之类的行上,它不遵循 XML 文件的一般格式。

我的 phython 代码如下所示:

import xml.etree.ElementTree as ET
import csv

tree = ET.parse("14737.xml")
root = tree.getroot()

# open a file for writing

Orbit_data = open('/csv/14737', 'w')

# create the csv writer object

csvwriter = csv.writer(Orbit_data)
orbit_head = []

orbit_head.append('time')            
orbit_head.append('sza')
orbit_head.append('alt')
orbit_head.append('lat')
orbit_head.append('lon')
orbit_head.append('x')
orbit_head.append('rho')
orbit_head.append('hperiod')
orbit_head.append('hperiodquality')
orbit_head.append('vperiod')
orbit_head.append('vperiodquality')
orbit_head.append('cutoff')
orbit_head.append('ionospheretrace delaytime')
orbit_head.append('maxfreqquality')
orbit_head.append('groundtrace delatytime')

csvwriter.writerow(orbit_head)


for member in root.findall('frame'):
    frame = []
    address_list = []

    time = member.find('time').text
    frame.append(time)
    sza = member.find('sza').text
    resident.append(sza)
    alt = member.find('alt').text
    resident.append(alt)

    lat = member.find('lat').text
        frame.append(lat)
        lon = member.find('lon').text
        frame.append(lon)
        x = member.find('x').text
        frame.append(x)
        rho = member.find('rho').text
        frame.append(rho)
        hperiod = member.find('hperiod').text
        frame.append(hperiod)
        hperiodquality = member.find('hperiodquality').text
        frame.append(hperiodquality)

        vperiod = member.find('vperiod').text
        frame.append(vperiod)
        vperiodquality = member.find('vperiodquality').text
        frame.append(vperiodquality)
        cutoff = member.find('cutoff').text
        frame.append(cutoff)
        ionospheretrace_delaytime = member.find('ionopspheretrace delaytime').text
        frame.append(ionospheretrace_delaytime)
        maxfreqquality = member.find('maxfreqquality').text
        frame.append(maxfreqquality)
        groundtrace_delatytime = member.find('groundtrace delatytime').text
        frame.append(groundtrace_delatytime)



    csvwriter.writerow(frame)
Orbit_data.close()

我希望发生的是我可以以某种方式存储延迟时间,但我不确定。

谢谢!

标签: pythonxmlparsingdata-conversion

解决方案


以下是收集数据的通用方法。

这个想法是标记“特殊”标签(我们需要使用属性值的那些)

我跳过了 csv 生成,因为您的主要挑战是如何从 xml 中提取数据。

import xml.etree.ElementTree as ET

ATTRIBUTE_BASED_ELEMENTS = ['ionospheretrace', 'vperiod', 'groundtrace']

tree = ET.parse('56116141.xml')
root = tree.getroot()

data = []

for frame in root.findall('.//frame'):
    one_frame = []
    for child in list(frame):
        if child.tag in ATTRIBUTE_BASED_ELEMENTS:
            one_frame.append(child.attrib['delaytime'])
        else:
            one_frame.append(child.text)
    data.append(one_frame)

for frame in data:
    print(frame)

56116141.xml

<?xml version="1.0" encoding="UTF-8"?>
<orbit id="14737">
    <frame>
        <time>2015-08-15T05:28:39.014</time>
        <sza>113.48 deg</sza>
        <alt>1552 km</alt>
        <lat>-66.96 deg</lat>
        <lon>196.11 deg</lon>
        <x>-0.58 Rm</x>
        <rho>1.33 Rm</rho>
        <hperiod>0</hperiod>
        <hperiodquality>0</hperiodquality>
        <vperiod delaytime="167.443 μs">0</vperiod>
        <vperiodquality>0</vperiodquality>
        <cutoff>0</cutoff>
        <ionospheretrace delaytime="167.443 μs"/>
        <maxfreqquality>0</maxfreqquality>
        <groundtrace delaytime="167.443 μs"/>
    </frame>
    <frame>
        <time>2016-08-15T05:28:39.014</time>
        <sza>113.42 deg</sza>
        <alt>1553 km</alt>
        <lat>-66.16 deg</lat>
        <lon>196.41 deg</lon>
        <x>-0.56 Rm</x>
        <rho>1.39 Rm</rho>
        <hperiod>1</hperiod>
        <hperiodquality>1</hperiodquality>
        <vperiod delaytime="107.443 μs">0</vperiod>
        <vperiodquality>1</vperiodquality>
        <cutoff>1</cutoff>
        <ionospheretrace delaytime="167.343 μs"/>
        <maxfreqquality>1</maxfreqquality>
        <groundtrace delaytime="967.443 μs"/>
    </frame>
</orbit>   

输出

['2015-08-15T05:28:39.014', '113.48 deg', '1552 km', '-66.96 deg', '196.11 deg', '-0.58 Rm', '1.33 Rm', '0', '0', '167.443 μs', '0', '0', '167.443 μs', '0', '167.443 μs']
['2016-08-15T05:28:39.014', '113.42 deg', '1553 km', '-66.16 deg', '196.41 deg', '-0.56 Rm', '1.39 Rm', '1', '1', '107.443 μs', '1', '1', '167.343 μs', '1', '967.443 μs']

推荐阅读