首页 > 解决方案 > 列出要存储在数据框中的 xml 文件中的路径和数据

问题描述

这是一个xml文件:

<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header />
  <SOAP-ENV:Body>
    <ADD_LandIndex_001>
      <CNTROLAREA>
        <BSR>
          <status>ADD</status>
          <NOUN>LandIndex</NOUN>
          <REVISION>001</REVISION>
        </BSR>
      </CNTROLAREA>
      <DATAAREA>
        <LandIndex>
          <reportId>AMI100031</reportId>
          <requestKey>R3278458</requestKey>
          <SubmittedBy>EN4871</SubmittedBy>
          <submittedOn>2015/01/06 4:20:11 PM</submittedOn>
          <LandIndex>
            <agreementdetail>
              <agreementid>001       4860</agreementid>
              <agreementtype>NATURAL GAS</agreementtype>
              <currentstatus>
                <status>ACTIVE</status>
                <statuseffectivedate>1965/02/18</statuseffectivedate>
                <termdate>1965/02/18</termdate>
              </currentstatus>
              <designatedrepresentative></designatedrepresentative>
            </agreementdetail>
          </LandIndex>
        </LandIndex>
      </DATAAREA>
    </ADD_LandIndex_001>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

我想保存在数据框中:1)路径和 2)与路径对应的元素的文本。为了做这个数据框,我正在考虑做一个字典来存储两者。所以首先我想得到一个这样的字典(我有与相应路径相关联的值)。

{'/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': 'ADD', /Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN: 'LandIndex',...}

就像那样,我只需要使用该函数df=pd.DataFrame()来创建一个可以导出到 Excel 工作表中的数据框。我已经有了路径列表的一部分,但是我无法从这些路径中获取文本。我不明白 lxml 库是如何工作的。我尝试了该功能.text()text_content()但出现错误。

这是我的代码:

from lxml import etree
import xml.etree.ElementTree as et
from bs4 import BeautifulSoup
import pandas as pd

filename = 'file_try.xml'

with open(filename, 'r') as f: 
    soap = f.read() 

root = etree.XML(soap.encode())    
tree = etree.ElementTree(root)


mylist_path = []
mylist_data = []
mydico = {}
mylist = []

for target in root.xpath('//text()'):

    if len(target.strip())>0:       
        path = tree.getpath(target.getparent()).replace('SOAP-ENV:','')
        mydico[path] = target.text()

        mylist_path.append(path)
        mylist_data.append(target.text())
        mylist.append(mydico)

df=pd.DataFrame(mylist)
df.to_excel("data_xml.xlsx") 

print(mylist_path)
print(mylist_data)

感谢您的帮助 !

标签: pythonxmldataframexpathelementtree

解决方案


下面是一个遍历 XML 树的例子。为此,将需要递归函数。幸运的是,lxml 为此提供了所有功能。

from lxml import etree as et
from collections import defaultdict
import pandas as pd

d = defaultdict(list)
root = et.fromstring(xml)
tree = et.ElementTree(root)

def traverse(el, d):
    if len(list(el)) > 0:
        for child in el:
            traverse(child, d)
    else:
      if el.text is not None:
        d[tree.getelementpath(el)].append(el.text)

traverse(root, d)

df = pd.DataFrame(d)

df.head()

输出:

{
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': ['ADD'],
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN': ['LandIndex'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION': ['001'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId': ['AMI100031'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey': ['R3278458'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy': ['EN4871'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn': ['2015/01/06 4:20:11 PM'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid': ['001       4860'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype': ['NATURAL GAS'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status': ['ACTIVE'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate': ['1965/02/18'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate': ['1965/02/18']
}

请注意,字典d包含列表作为值。这是因为元素可以在 XML 中重复,否则最后一个值将覆盖前一个值。如果您的特定 XML 不是这种情况,请使用常规 dict 而不是 defaultdictd = {}并使用 assignment 而不是 appending d[tree.getelementpath(el)] = el.text

从文件读取时相同:

d = defaultdict(list)

with open('output.xml', 'rb') as file:
    root = et.parse(file).getroot()
    
tree = et.ElementTree(root)

def traverse(el, d):
    if len(list(el)) > 0:
        for child in el:
            traverse(child, d)
    else:
      if el.text is not None:
        d[tree.getelementpath(el)].append(el.text)

traverse(root, d)

df = pd.DataFrame(d)

print(d)

推荐阅读