首页 > 解决方案 > get only xml data from text file using python

问题描述

I have a text file where I have some XML data and some HTML data. Both start with "<". Now I want to extract only XML data and save it in another file. How can I do it?

File example:

xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

xyz data
<bold>xyz</bold>

text 
text 
text

<bold>xyz</bold>

again XML data

Note: This file is in .txt format.

标签: pythonxmlxml-parsingetldata-extraction

解决方案


我不会将您的整个输入视为 XML,而是将其视为 HTML 片段。HTML 可以包含非标准元素,所以<note>etc. 很好。

为方便起见,我建议pyquery( link ) 处理 HTML。它的工作方式与 jQuery 几乎相同,所以如果您以前使用过它,它应该很熟悉。

这很简单。加载您的数据,将其包装"<html></html>",解析,查询。

from pyquery import PyQuery as pq

data = """xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

xyz data
<bold>xyz</bold>

text 
text 
text

<bold>xyz</bold>

again XML data"""

doc = pq(f"<html><body>{data}</body></html>")
note = doc.find("note")

print(note.find("body").text())

打印"Don't forget me this weekend!"


推荐阅读