python - 使用 Python 解析 XML 时跳过“嵌套标签”

问题描述

我目前有一个我想用 Python 解析的 XML 文件。我正在使用 Python 的元素树，它工作正常，除了我有一个问题。

该文件目前看起来像：

<Instance>
  <TextContent>
    <Sentence>Hello, my name is John and his <Thing>name</Thing> is Tom.</Sentence>
  </TextContent>
<Instance>

我基本上想要做的是跳过标签内的嵌套标签<Sentence>（即<Thing>）。我发现这样做的一种方法是获取文本内容直到标记，标记的文本内容，并将它们连接起来。我正在使用的代码是：

import xml.etree.ElementTree as ET


xtree = ET.parse('some_file.xml')
xroot = xtree.getroot()

for node in xroot:
    text_before = node[0][0].text
    text_nested = node[0][0][0].text

如何获取嵌套标签之后的文本部分？
更好的是，有没有一种方法可以完全忽略嵌套标签？

提前致谢。

标签： pythonxmlelementtree

我稍微更改了您的源 XML 文件，以便Sentence包含两个子元素：

<Instance>
  <TextContent>
    <Sentence>Hello, my <Thing>name</Thing> is John and his <Thing>name</Thing> is Tom.</Sentence>
  </TextContent>
</Instance>

要查找Sentence元素，请运行：st = xroot.find('.//Sentence')。

然后定义以下生成器：

def allTextNodes(root):
    if root.text is not None:
        yield root.text
    for child in root:
        if child.tail is not None:
            yield child.tail

要查看所有直接后代文本节点的列表，请运行：

lst = list(allTextNodes(st))

结果是：

['Hello, my ', ' is John and his ', ' is Tom.']

但是要获得连接的文本，作为单个变量，运行：

txt = ''.join(allTextNodes(st))

获取：（Hello, my is John and his is Tom.注意双空格，“包围”都省略了Thing元素。

python - 使用 Python 解析 XML 时跳过“嵌套标签”

问题描述

解决方案

推荐阅读