python - 删除子标签但将文本保留在 xml 中？

问题描述

我有一个看起来像这样的 xml

<?xml version='1.0' encoding='utf8'?>
<all>
<articletitle>text1<x> </x></articletitle>
<affiliation><x> </x><label id="aff1">12</label><affnorg>College of Materials Science and Engineering</affnorg><x>, </x><affnorg>Guangdong Research Center for Interfacial Engineering of Functional Materials</affnorg><x>, </x><affnorg>Shenzhen University</affnorg><x>, </x><affnadd>3688 Nanhai Ave</affnadd><x>, </x><affncity>Shenzhen</affncity><x>, </x><affnpost>518060</affnpost><x>, </x><affncountry>PR China</affncountry><x>.</x></affiliation>
<affiliation><x> </x><label id="aff2">2</label><affnorg>Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province</affnorg><x>, </x><affnorg>College of Optoelectronic Engineering</affnorg><x>, </x><affnorg>Shenzhen University</affnorg><x>, </x><affnadd>3688 Nanhai Ave</affnadd><x>, </x><affncity>Shenzhen</affncity><x>, </x><affnpost>518060</affnpost><x>, </x><affncountry>PR China</affncountry><x>.</x></affiliation>
</all>

任务是我必须删除所有<x>标签并仅在标签中保留它们的文本affiliation，使用 ElementTree 我可以删除标签，但它也会删除文本，但我希望该文本位于父标签中，这样我的新 xml看起来像这样

<?xml version='1.0' encoding='utf8'?>
<all>
<articletitle>text1<x> </x></articletitle>
<affiliation> <label id="aff1">12</label><affnorg>College of Materials Science and Engineering</affnorg>, <affnorg>Guangdong Research Center for Interfacial Engineering of Functional Materials</affnorg>, <affnorg>Shenzhen University</affnorg>, <affnadd>3688 Nanhai Ave</affnadd>, <affncity>Shenzhen</affncity>, <affnpost>518060</affnpost>, <affncountry>PR China</affncountry>.</affiliation>
<affiliation> <label id="aff2">2</label><affnorg>Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province</affnorg>, <affnorg>College of Optoelectronic Engineering</affnorg>, <affnorg>Shenzhen University</affnorg>, <affnadd>3688 Nanhai Ave</affnadd>, <affncity>Shenzhen</affncity>, <affnpost>518060</affnpost>, <affncountry>PR China</affncountry>.</affiliation>
</all>

标签： pythonregexxml

解决方案

BeautifulSoup您可以使用以下unwrap()功能：

data = '''<?xml version='1.0' encoding='utf8'?>
<all>
<articletitle>text1<x> </x></articletitle>
<affiliation><x> </x><label id="aff1">12</label><affnorg>College of Materials Science and Engineering</affnorg><x>, </x><affnorg>Guangdong Research Center for Interfacial Engineering of Functional Materials</affnorg><x>, </x><affnorg>Shenzhen University</affnorg><x>, </x><affnadd>3688 Nanhai Ave</affnadd><x>, </x><affncity>Shenzhen</affncity><x>, </x><affnpost>518060</affnpost><x>, </x><affncountry>PR China</affncountry><x>.</x></affiliation>
<affiliation><x> </x><label id="aff2">2</label><affnorg>Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province</affnorg><x>, </x><affnorg>College of Optoelectronic Engineering</affnorg><x>, </x><affnorg>Shenzhen University</affnorg><x>, </x><affnadd>3688 Nanhai Ave</affnadd><x>, </x><affncity>Shenzhen</affncity><x>, </x><affnpost>518060</affnpost><x>, </x><affncountry>PR China</affncountry><x>.</x></affiliation>
</all>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data,'xml')

for x in soup.select('affiliation x'):
    x.unwrap()

print(soup)

印刷：

<?xml version="1.0" encoding="utf-8"?>
<all>
<articletitle>text1<x> </x></articletitle>
<affiliation> <label id="aff1">12</label><affnorg>College of Materials Science and Engineering</affnorg>, <affnorg>Guangdong Research Center for Interfacial Engineering of Functional Materials</affnorg>, <affnorg>Shenzhen University</affnorg>, <affnadd>3688 Nanhai Ave</affnadd>, <affncity>Shenzhen</affncity>, <affnpost>518060</affnpost>, <affncountry>PR China</affncountry>.</affiliation>
<affiliation> <label id="aff2">2</label><affnorg>Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province</affnorg>, <affnorg>College of Optoelectronic Engineering</affnorg>, <affnorg>Shenzhen University</affnorg>, <affnadd>3688 Nanhai Ave</affnadd>, <affncity>Shenzhen</affncity>, <affnpost>518060</affnpost>, <affncountry>PR China</affncountry>.</affiliation>
</all>

python - 删除子标签但将文本保留在 xml 中？

问题描述

解决方案

推荐阅读