首页 > 解决方案 > 在python中将xml数据转换为pandas数据框

问题描述

我想把下面的数据变成一个数据框,de xml文件是这样的:


<?xml version="1.0" encoding="utf-8"?>
<file>
  <SORT_INFO>
    <sort_type>sort order</sort_type>
  </SORT_INFO>
  <ALL_INSTANCES>
    <instance>
      <ID>1</ID>
      <start>11.24</start>
      <end>31.24</end>
      <code>19. Luis Alehandru Paz Mulato</code>
      <label>
        <group>Team</group>
        <text>America de Cali</text>
      </label>
      <label>
        <group>Action</group>
        <text>Passes accurate</text>
      </label>
      <label>
        <group>Half</group>
        <text>1st half</text>
      </label>
      <pos_x>52.6</pos_x>
      <pos_y>34.2</pos_y>
    </instance>
    <instance>
      <ID>2</ID>
      <start>11.24</start>
      <end>31.24</end>
      <code>19. Luis Alehandru Paz Mulato</code>
      <label>
        <group>Team</group>
        <text>America de Cali</text>
      </label>
      <label>
        <group>Action</group>
        <text>Positional attacks</text>
      </label>
      <label>
        <group>Half</group>
        <text>1st half</text>
      </label>
      <pos_x>52.6</pos_x>
      <pos_y>34.2</pos_y>
    </instance>
    <instance>
      <ID>3</ID>
      <start>14.43</start>
      <end>34.43</end>
      <code>6. Felipe Jaramillo Velasquez</code>
      <label>
        <group>Team</group>
        <text>America de Cali</text>
      </label>
      <label>
        <group>Action</group>
        <text>Positional attacks</text>
      </label>
      <label>
        <group>Half</group>
        <text>1st half</text>
      </label>
      <pos_x>38.6</pos_x>
      <pos_y>26.2</pos_y>
    </instance>
  </ALL_INSTANCES>

在 </ALL_INSTANCES> 之后还有一点点,但我只需要“实例”标签内的数据,就好像那些是行一样。所以我的专栏如下:

['ID', 'start', 'end', 'code', 'group1', 'text1', 'group2', 'text2', 'group3', 'text3', 'pos_x','pos_y']

第一行是:

['1', '11.24', '31.24', '19. Luis Alehandru Paz Mulato', 'Team', 'America de Cali', 'Action', 'Passes accurate', 'Half', '1st half', '52.6','34.2']

这是我第一次使用 xml 文件,非常感谢您的帮助。

我已经尝试了以下方法:

import xml.etree.ElementTree as et 

xtree = et.parse("players_vs_nacional_0-3.xml")
xroot = xtree.getroot()
    

for node in xroot:         
    id_action = node.find("ID").text   
    start = node.find("start").text 
    end = node.find("end").text 
    code = node.find("code").text 
    posx = node.find("pos_x").text 
    posy = node.find("pos_y").text 

但我得到了错误:

AttributeError:“NoneType”对象没有属性“文本”

我相信这是因为第一个节点不包含我正在寻找的内容(“ID”、“start”、“end”等),但我无法通过它,而且我也很难弄清楚了解如何从标签中获取信息,因为每个标签都有三个,我需要它们都在同一行。

标签: pythonxmlpandasdataframe

解决方案


我相信你可能正在寻找这样的东西:

from lxml import etree
import pandas as pd
inst = """[your xml above, corrected; it's missing a closing </file> tag]"""
doc = etree.XML(inst.encode())

rows =[]
targets = doc.xpath('//instance')
columns = [elem.tag for elem in targets[0].xpath('.//*') if len(elem.getchildren())==0 ]

for target in targets:
    row = []
    for item in target.xpath('.//*'):            
        if len(item.getchildren())>0:
            continue
        else:
            row.append(item.text.strip())
    rows.append(row)
df = pd.DataFrame(rows,columns=columns)
df

输出:

ID  start   end     code    group   text    group   text    group   text    pos_x   pos_y
0   1   11.24   31.24   19. Luis Alehandru Paz Mulato   Team    America de Cali     Action  Passes accurate     Half    1st half    52.6    34.2
1   2   11.24   31.24   19. Luis Alehandru Paz Mulato   Team    America de Cali     Action  Positional attacks  Half    1st half    52.6    34.2
2   3   14.43   34.43   6. Felipe Jaramillo Velasquez   Team    America de Cali     Action  Positional attacks  Half    1st half    38.6    26.2

推荐阅读