首页 > 解决方案 > 从 HTML 中删除标签,除了特定的标签(但保留它们的内容)

问题描述

我使用此代码删除 HTML 中的所有标记元素。我需要保持<br><br/>。所以我使用这段代码:

import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'\1', MyString)
print(MyString)

输出是:

aaaRadio and<BR> television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

结果是对的,但现在我想保留<p>and</p><br>and <br/>

如何修改我的代码?

标签: pythonregexpython-3.xparsinghtml-parsing

解决方案


使用 HTML 解析器比使用正则表达式更健壮。正则表达式不应用于解析 HTML 等嵌套结构。

这是一个有效的实现,它遍历所有 HTML 标签,对于那些不是porbr的人,将它们从标签中剥离:

from bs4 import BeautifulSoup

mystring = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'

soup = BeautifulSoup(mystring,'html.parser')
for e in soup.find_all():
    if e.name not in ['p','br']:
        e.unwrap()
print(soup)

输出:

aaa<p>Radio and<br/> television.<br/></p><p>very<br> popular in the world today.</br></p><p>Millions of people watch TV. </p><p>That’s because a radio is very small 98.2%</p><p>and it‘s easy to carry. haha100%</p>bb

推荐阅读