首页 > 解决方案 > 字符串Python中子字符串的精确匹配

问题描述

我知道这个问题很常见,但我下面的示例比问题标题所暗示的要复杂一些。

假设我有以下“test.xml”文件:

<?xml version="1.0" encoding="UTF-8"?>
<test:xml xmlns:test="http://com/whatever/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <parent xsi:type="parentType">
    <child xsi:type="childtype">
      <grandchild>
        <greatgrandchildone>greatgrandchildone</greatgrandchildone>
        <greatgrandchildtwo>greatgrandchildtwo</greatgrandchildtwo>
      </grandchild><!--random comment -->
    </child>
    <child xsi:type="childtype">
      <greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
      <greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--another random comment -->
    </child>
    <child xsi:type="childtype">
      <greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
      <greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--third random comment -->
    </child>
  </parent>
</test:xml>

在下面的程序中,我主要做两件事:

  1. 找出 xml 中包含“类型”属性的所有节点
  2. 循环遍历 xml 的每个节点并找出它是否是包含“类型”属性的元素的子元素

这是我的代码:

from lxml import etree
import re

xmlDoc = etree.parse("test.xml")
root = xmlDoc.getroot()

nsmap = {
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}

nodesWithType = []

def check_type_in_path(nodesWithType, path, root):
    typesInPath = []
    elementType = ""

    for node in nodesWithType:
        print("checking node: ", node, " and path: ", path)

        if re.search(r"\b{}\b".format(
            node), path, re.IGNORECASE) is not None:

            element = root.find('.//{0}'.format(node))
            elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
            if elementType is not None:
                print("found an element for this path. adding to list")
                typesInPath.append(elementType)
        else:
            print("element: ", node, " not found in path: ", path)

    print("path ", path ," has types: ", elementType)
    print("-------------------")
    return typesInPath

def get_all_node_types(xmlDoc):
    nodesWithType = []
    root = xmlDoc.getroot()

    for node in xmlDoc.iter():

        path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])

        if "COMMENT" not in path.upper():
            element = root.find('.//{0}'.format(path))
            elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
            if elementType is not None:
                nodesWithType.append(path)

    return nodesWithType

nodesWithType = get_all_node_types(xmlDoc)
print("nodesWithType: ", nodesWithType)

for node in xmlDoc.xpath('//*'):
    path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])
    typesInPath = check_type_in_path(nodesWithType, path, root)

代码应返回特定路径中包含的所有类型。例如,考虑路径parent/child[3]/greatgrandchildfour。此路径是包含属性“type”的两个节点的子节点(直接或远距离):parentparent/child[3]。因此,我希望nodesWithType该特定节点的数组同时包含“parentType”和“childtype”。

但是,根据下面的打印,nodesWithType此节点的数组仅包含“parentType”类型,不包含“childtype”。此逻辑的主要重点是检查具有该类型的节点的路径是否包含在相关节点的路径中(因此检查字符串的精确匹配)。但这显然行不通。我不确定是不是因为条件中有数组注释没有对其进行验证,或者可能是其他原因。

对于上面的例子,返回的打印是:

checking node:  parent  and path:  parent/child[3]/greatgrandchildfour
found an element for this path. adding to list
checking node:  parent/child[1]  and path:  parent/child[3]/greatgrandchildfour
element:  parent/child[1]  not found in path:  parent/child[3]/greatgrandchildfour
checking node:  parent/child[2]  and path:  parent/child[3]/greatgrandchildfour
element:  parent/child[2]  not found in path:  parent/child[3]/greatgrandchildfour
checking node:  parent/child[3]  and path:  parent/child[3]/greatgrandchildfour
element:  parent/child[3]  not found in path:  parent/child[3]/greatgrandchildfour
path  parent/child[3]/greatgrandchildfour  has types:  parentType

标签: pythonxmllxml

解决方案


推荐阅读