python - Python - 使用 lxml 在 XLM 文件中提取带有标签的内容
问题描述
我正在尝试使用 lxml 从 xml 文件中提取数据。例如:test.xml
<document>
<body>
<title>test title</title>
<subtitle>test subtitle</subtitle>
<content>
<p>blabla bla bla <em>bla bla</em> blabla bla bla <strong>blabla</strong> blabla</p>
<p>blabla bla bla blabla bla bla blabla</p>
<p>blabla bla bla <em>bla bla</em> blabla</p>
</content>
</body>
</document>
要提取标题或副标题,可以:
from lxml import etree
xmlData = []
tree = '/folder/test/xml'
for title in tree.xpath("/document/body/title"):
xmlData['title'] = title.text
for subtitle in tree.xpath("/document/body/subtitle"):
xmlData['subtitle'] = subtitle.text
但是对于内容来说,就不一样了……for content in tree.xpath("/document/body/content")
不行,我要尝试一下,for content in tree.xpath("/document/body/content/p")
但是,我不会提取em内容和强内容。
我需要调用 tree.xpath("/document/body/content/p/em") 和 tree.xpath("/document/body/content/strong")。但是,在这种情况下,内容被分成三部分,我无法按正确的顺序将它们放在一起。例如,如果我尝试类似:
for content in tree.xpath("/document/body/content/p"):
for em in tree.xpath("/document/body/content/p/em"):
for strong in tree.xpath("/document/body/content/p/strong"):
xmlData['content'] = content.text + em.text + strong.text
对于每个段落,我将拥有相同的 em 和强大的内容,即使对于没有这些标签的段落也是如此。
此外,如果我想保留html标签,我必须自己添加它们......
for content in tree.xpath("/document/body/content/p"):
xmlData['content'] = '<p>' + content.text + '</p>'
我可以编写代码以提取 <content> 和</content> 之间的所有内容并将所有标签保留在里面吗?
解决方案
使用 python 核心 XML 库 ElementTree。不需要外部库。这个想法是使用递归函数
扫描元素并收集文本。
该代码将所需的信息收集到一个字典中。p
import xml.etree.ElementTree as ET
xml = '''<document>
<body>
<title>test title</title>
<subtitle>test subtitle</subtitle>
<content>
<p>jack<em>dan</em>ben<strong>jim</strong>steve</p>
<p>blabla bla bla blabla bla bla blabla</p>
<p>A<em>B</em>C</p>
</content>
</body>
</document>'''
root = ET.fromstring(xml)
title = root.find('.//title').text
subtitle = root.find('.//subtitle').text
data = dict(title=title, subtitle=subtitle)
p_list = []
for idx, p in enumerate(root.findall('.//p')):
p_list.append(ET.tostring(p).strip().decode())
data['content'] = ' '.join(p_list)
print(data)
输出
{'title': 'test title', 'subtitle': 'test subtitle', 'content': '<p>jack<em>dan</em>ben<strong>jim</strong>steve</p> <p>blabla bla bla blabla bla bla blabla</p> <p>A<em>B</em>C</p>'}
推荐阅读
- c# - 我想制作一个简单的程序来防止文本框将重复数据输入到 DataGridView
- python - Using the URLconf defined in fifteen.urls, Django tried these URL patterns
- matlab - Find centers of light spot on a picture?
- c++ - Why does debugging c++ using gdb in vscode create a an executable called launch.exe despite me specifying a name?
- python - Match strings in a particular column to another column even though it occurs in different patterns
- python - Manually closing the kivy app makes the spyder window crash
- amazon-web-services - Multi-cloud swarm cluster service discovery using DNS
- c# - How can I skip loop for first and last element of array and set them to constant value?
- flutter - Unwanted animation while scrolling down the ListView.builder in Flutter
- html - Display Ninja Tables meta data at the top of the table? Or is there a way to hide the table completely until fully loaded?