首页 > 解决方案 > 使用良好 XML 中缺少的特定元素更新 XML

问题描述

我有一个包含所需元素的“好”XML 文件和一个缺少特定元素的“坏”XML 文件。以下是好的和坏的 XML 文件。

好.xml:

<?xml version="1.0" encoding="UTF-8"?>
<level1>
<level2>
<level3>
<l3v1>hello</l3v1>
<l3v2>world</l3v2>
</level3>
</level2>
</level1>

坏1.xml:

<?xml version="1.0" encoding="UTF-8"?>
<level1>
<level2>
<level4>
<l4v1>inconsequential</l4v1>
</level4>
</level2>
</level1>

坏2.xml:

<?xml version="1.0" encoding="UTF-8"?>
<level1>
<level2>
<level3>
<l3v3>inconsequential also</l3v3>
</level3>
</level2>
</level1>

我想阅读good.xml,看看bad1.xml,bad2.xml是否有level3/l3v1,level3/l3v2..如果没有,添加这些元素。

我到目前为止的代码是:

import xml.etree.ElementTree as ET
import functools

def update_configxml(element_name: str,bad_xml: str, good_xml: str):
  tree = ET.parse(good_xml) #Correct XML
  root = tree.getroot()
  for item in root.findall(element_name): #element of interest
    print(item.tag)

  t2 = ET.parse(bad_xml) #Incorrect XML
  r2 = t2.getroot()

  try:
    if r2.findall(element_name) == []:
      print ("{} Not found.\nAppending...".format(item.tag))
      #r2.append(item) #This does not create level3 under level2, but puts level3 under level1
      parent_str = functools.reduce(lambda q,r: str(q)+"/"+str(r), element_name.split('/')[0:-1])
      parent = r2.find(parent_str)
      if parent is None:
        parent = ET.SubElement(r2,parent_str)
        item = ET.SubElement(parent,item)
      else:
        item = ET.SubElement(parent,item)
    else:
      print ("{} found".format(item.tag))
  except UnboundLocalError as notfound:
    print(notfound)
    print("The good config also does not seem to have the required tag")

  print(ET.tostring(r2, encoding='utf8').decode('utf8'))



if __name__ == '__main__':
  element = './level2/level3' #Element to add to bad XML files
  update_configxml(element,'bad1.xml','good.xml')

但我得到:

(base) C:\Users\myneni\jenkins>python3 update_xml_test.py
level3
level3 Not found.
Appending...
Traceback (most recent call last):
  File "update_xml_test.py", line 35, in <module>
    update_configxml(element,'bad1.xml','good.xml')
  File "update_xml_test.py", line 29, in update_configxml
    print(ET.tostring(r2, encoding='utf8').decode('utf8'))
  File "C:\Python36\lib\xml\etree\ElementTree.py", line 1135, in tostring
    short_empty_elements=short_empty_elements)
  File "C:\Python36\lib\xml\etree\ElementTree.py", line 773, in write
    qnames, namespaces = _namespaces(self._root, default_namespace)
  File "C:\Python36\lib\xml\etree\ElementTree.py", line 885, in _namespaces
    _raise_serialization_error(tag)
  File "C:\Python36\lib\xml\etree\ElementTree.py", line 1057, in _raise_serialization_error
    "cannot serialize %r (type %s)" % (text, type(text).__name__)
TypeError: cannot serialize <Element 'level3' at 0x0000021616D81098> (type Element)

如何复制父树并将其添加到 bad.xml 文件以包含感兴趣的元素?谢谢!

标签: pythonxmlelementtree

解决方案


在您的解决方案中,我注意到一个错误item = ET.SubElement(parent, item)item参数应该是创建元素的名称(字符串),而您传递了一个Element

另一个值得怀疑的点是ElementTree中的一个元素是否可以放在多棵树中。请注意,您尝试复制整个子树,我怀疑是否允许这种组合。也许您应该改用deepcopy并将 创建的副本附加到坏树。

我没有更正您的解决方案,而是提出了另一个基于lxml的解决方案。这个模块有更多有用的功能,我的解决方案(在我看来)更模块化、可读性更强、更易于维护。

从导入开始:

from lxml import etree

然后创建2个函数:

  1. 第一个函数沿着给定的xpath在给定的中创建缺失的元素:

    def createElem(tree: etree._ElementTree, xpath: str):
        nodes = tree.xpath(xpath)
        if nodes:         # Target element exists
            pass  # print(f'Target element exists.')
        else:
            # Drop empty string before 1st "/" and the root node name
            parts = xpath.split('/')[2:]
            p = tree.getroot()  # Start scanning from the root
            for part in parts:
                nodes = p.xpath(part)  # Attempt to descend 1 level
                if nodes:  # Child element exists
                    # print(f'Element {part} exists.')
                    p = nodes[0]
                else:  # No such child element
                    n = etree.SubElement(p, part)  # Create child
                    # print(f'Element {part} created.')
                    p = n      # Parent for the next step
    
  2. 第二个函数(您的函数的重新设计版本)读取两个 XML 文件,将缺失的元素添加到坏树并返回它。

    def update_configxml(element_name: str, bad_xml: str, good_xml: str) -> etree._ElementTree:
        parser = etree.XMLParser(remove_blank_text=True)
        tree = etree.parse(good_xml, parser)  # Read good XML
        root = tree.getroot()
        t2 = etree.parse(bad_xml, parser)     # Read bad XML
        basePath = '/' + root.tag + element_name[1:]
        for it in tree.iter():
            pth = tree.getpath(it)
            if pth.startswith(basePath):
                # print(f'Check for: {it.tag:6}  {pth}')
                createElem(t2, pth)
        return t2
    

与您的代码相比,我更改了此函数的签名(它返回更新的树),以便允许对其执行某些操作(不仅仅是打印)。

我调用了上面的函数:

result = update_configxml('./level2/level3', 'bad1.xml', 'good.xml')

并打印其内容:

print(etree.tostring(result, encoding='unicode', pretty_print=True))

得到:

<level1>
  <level2>
    <level4>
      <l4v1>inconsequential</l4v1>
    </level4>
    <level3>
      <l3v1/>
      <l3v2/>
    </level3>
  </level2>
</level1>

如您所见,已添加level3及其子级。

关于update_configxml函数的注释:我在那里添加了一个带有remove_blank_text选项的解析器,以获得具有适当缩进的最终打印输出。要查看差异:

  • 将坏 XML 的读取更改为 just t2 = etree.parse(bad_xml),
  • 运行修改后的代码和
  • 打印结果(就像我一样)。

推荐阅读