首页 > 解决方案 > 你如何正确地从这个嵌套的 XML 中获取?

问题描述

我有以下 XML:

<?xml version="1.0" encoding="UTF-8"?>
<data>
  <columns>
    <Leftover index="5">Leftover</Leftover>
    <NODE5 index="6"></NODE5>
    <NODE6 index="7"></NODE6>
    <NODE8 index="9"></NODE8>
    <Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
    <Year index="8">2020</Year>
    <Name index="1">Name</Name>
    <Value_code index="3">Value code</Value_code>
  </columns>
  <records>
    <record index="1">
      <Leftover>Leftover</Leftover>
      <NODE5>Test1</NODE5>
      <NODE6>Test2</NODE6>
      <NODE8>Test3</NODE8>
      <Nomenk__Nr_></Nomenk__Nr_>
      <Name></Name>
      <Value_code></Value_code>
    </record>
  ... (it repeats itself with different values and the index value increments)

我的代码是:

import lxml
import lxml.etree as et
xml = open('C:\outputfile.xml', 'rb')
xml_content = xml.read()
tree = et.fromstring(xml_content)
for bad in tree.xpath("//records[@index=\'*\']/NODE5"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it
result = (et.tostring(tree, pretty_print=True, xml_declaration=True))
f = open( 'outputxml.xml', 'w' )
f.write( str(result) )
f.close()

我需要做的是删除NODE5、NODE6、NODE8。我尝试使用通配符,然后指定其中一个节点(参见第 6 行),但这似乎不起作用......在第一个字符的循环之后,我也遇到了语法错误,但代码执行。

我的问题还在于,当文件“导出”后,lxml 的编码设置为 ASCII。

更新 我在第 8 行收到此错误:

    return = ...
    ^
SyntaxError: invalid syntax

我从https://stackoverflow.com/a/7981894/1987598获取了一些代码

标签: pythonxmlxpathlxml

解决方案


我需要做的是删除NODE5、NODE6、NODE8。

以下

import xml.etree.ElementTree as ET


xml = '''<?xml version="1.0" encoding="UTF-8"?>
<data>
   <columns>
      <Leftover index="5">Leftover</Leftover>
      <NODE5 index="6" />
      <NODE6 index="7" />
      <NODE8 index="9" />
      <Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
      <Year index="8">2020</Year>
      <Name index="1">Name</Name>
      <Value_code index="3">Value code</Value_code>
   </columns>
   <records>
      <record index="1">
         <Leftover>Leftover</Leftover>
         <NODE5>Test1</NODE5>
         <NODE6>Test2</NODE6>
         <NODE8>Test3</NODE8>
         <Nomenk__Nr_ />
         <Name />
         <Value_code />
      </record>
      <record index="21">
         <Leftover>Leftover</Leftover>
         <NODE5>Test11</NODE5>
         <NODE6>Test21</NODE6>
         <NODE8>Test39</NODE8>
         <Nomenk__Nr_ />
         <Name />
         <Value_code />
      </record>      
   </records>
</data>'''

root = ET.fromstring(xml)

col = root.find('./columns')
for x in ['5','6','8']:
    nodes_to_remove = col.findall('./NODE{}'.format(x))
    for node in nodes_to_remove:
        col.remove(node)
records = root.find('./records')
records_lst = records.findall('./record'.format(x))
for r in records_lst:
    for x in ['5','6','8']:
        nodes_to_remove = r.findall('./NODE{}'.format(x))
        for node in nodes_to_remove:
            r.remove(node)
       
ET.dump(root)

输出

<data>
   <columns>
      <Leftover index="5">Leftover</Leftover>
      <Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
      <Year index="8">2020</Year>
      <Name index="1">Name</Name>
      <Value_code index="3">Value code</Value_code>
   </columns>
   <records>
      <record index="1">
         <Leftover>Leftover</Leftover>
         <Nomenk__Nr_ />
         <Name />
         <Value_code />
      </record>
      <record index="2">
         <Leftover>Leftover</Leftover>
         <Nomenk__Nr_ />
         <Name />
         <Value_code />
      </record>      
   </records>
</data>

推荐阅读