首页 > 解决方案 > Python - 到 Pandas 数据框的 XML 文件

问题描述

我对 python 还很陌生,希望能在将 XML 文件转换为 Pandas Dataframe 时获得一些帮助。我已经搜索了其他资源,但仍然卡住了。我希望将标签之间的所有字段都放入一个表中。任何帮助是极大的赞赏!谢谢你。

下面是我尝试过的代码,但它不能正常工作。

import xml.etree.ElementTree as ET
import pandas as pd

xml_data = open('5249009-08-34-59-126029.xml', 'r').read()
root = ET.XML(xml_data)

data = []
cols = []
for i, child in enumerate(root):
    data.append([subchild.text for subchild in child])
    cols.append(child.tag)

df = pd.DataFrame(data).T 
df.columns = cols 

print(df)

下面是示例输入数据"

<?xml version="1.0"?>

-<RECORDING>

<IDENT>0</IDENT>

<DEVICEID>133242232</DEVICEID>

<DEVICEALIAS>52232009</DEVICEALIAS>

<GROUP>1823481655</GROUP>

<GATE>1011655</GATE>

<ANI>7777777777</ANI>

<DNIS>777777777</DNIS>

<USER1>00:07:53.2322691,00:03:21.34232761</USER1>

<USER2>text</USER2>

<USER3/>

<USER4/>

<USER5>34fc0a8d-d5632c9b1</USER5>

<USER6>000dfsdf98701596638094</USER6>

<USER7>97</USER7>

<USER8>00701596638094</USER8>

<USER9>10155</USER9>

<USER10/>

<USER11/>

<USER12/>

<USER13>Text</USER13>

<USER14>4</USER14>

<USER15>10</USER15>

<CALLSEGMENTID/>

<CALLID>9870</CALLID>

<FILENAME>\\folderpath\folderpath\folderpath\folderpath\2020\Aug\05\5249009\52343109-234234-34-59-1234234029</FILENAME>

<DURATION>201</DURATION>

<STARTYEAR>2020</STARTYEAR>

<STARTMONTH>08</STARTMONTH>

<STARTMONTHNAME>August</STARTMONTHNAME>

<STARTDAY>05</STARTDAY>

<STARTDAYNAME>Wednesday</STARTDAYNAME>

<STARTHOUR>08</STARTHOUR>

<STARTMINUTE>34</STARTMINUTE>

<STARTSECOND>59</STARTSECOND>

<PRIORITY>50</PRIORITY>

<RECORDINGTYPE>S</RECORDINGTYPE>

<CALLDIRECTION>I</CALLDIRECTION>

<SCREENCAPTURE>7</SCREENCAPTURE>

<KEEPCALLFORDAYS>90</KEEPCALLFORDAYS>

<BLACKOUTREMOTEAUDIO>false</BLACKOUTREMOTEAUDIO>

<BLACKOUTS/>

</RECORDING>

标签: pythonxmlpandasdataframe

解决方案


一种可能的解决方案如何解析文件:

import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("your_file.xml", "r"), "xml")

d = {}
for tag in soup.RECORDING.find_all(recursive=False):
    d[tag.name] = tag.get_text(strip=True)

df = pd.DataFrame([d])
print(df)

印刷:

  IDENT   DEVICEID DEVICEALIAS       GROUP     GATE         ANI       DNIS                               USER1 USER2 USER3 USER4               USER5                   USER6 USER7           USER8  USER9 USER10 USER11 USER12 USER13 USER14 USER15 CALLSEGMENTID CALLID                                           FILENAME DURATION STARTYEAR STARTMONTH STARTMONTHNAME STARTDAY STARTDAYNAME STARTHOUR STARTMINUTE STARTSECOND PRIORITY RECORDINGTYPE CALLDIRECTION SCREENCAPTURE KEEPCALLFORDAYS BLACKOUTREMOTEAUDIO BLACKOUTS
0     0  133242232    52232009  1823481655  1011655  7777777777  777777777  00:07:53.2322691,00:03:21.34232761  text              34fc0a8d-d5632c9b1  000dfsdf98701596638094    97  00701596638094  10155                        Text      4     10                 9870  \\folderpath\folderpath\folderpath\folderpath\...      201      2020         08         August       05    Wednesday        08          34          59       50             S             I             7              90               false          

推荐阅读