python - 从 html 中删除不匹配的结束标记

问题描述

我有一些 html 存储在一个字符串中。html 无效，并且在<td>ie中包含不匹配的关闭

<table>
  <tr><td>
    <p>First section of text.</p>
    <p>Second section of text.</span></p>
    <table>
      <tr><td>
        <p>Third section of text.</p>
      </td></tr>
    </table>
  </td></tr>
</table>

<p>Fourth section of text.</p>

当我将此html加载到BS中并使用以下方法提取为字符串时，我想使用BeautifulSoup来修改html：

soup = BeautifulSoup(html, 'html.parser')
print( str( soup.prettify() ) )

BS 大幅修改了结构。

<table>
 <tr>
  <td>
   <p>
    First section of text.
   </p>
   <p>
    Second section of text.
   </p>
  </td>
 </tr>
</table>
<table>
 <tr>
  <td>
   <p>
    Third section of text.
   </p>
  </td>
 </tr>
</table>
<p>
 Fourth section of text.
</p>

没有无与伦比的BS 输出，正如我所期望的那样

<table>
 <tr>
  <td>
   <p>
    First section of text.
   </p>
   <p>
    Second section of text.
   </p>
   <table>
    <tr>
     <td>
      <p>
       Third section of text.
      </p>
     </td>
    </tr>
   </table>
  </td>
 </tr>
</table>
<p>
 Fourth section of text.
</p>

我想做的是从 html 中删除不匹配的内容。如果不编写自己的解析器来寻找不匹配的标签，我怎么能做到这一点？我希望我可以使用 BS 来清理代码，但它不起作用。

标签： pythonhtmlbeautifulsoup

您可以将其拆分然后加入。

from bs4 import BeautifulSoup
data='''
<table>
  <tr><td>
    <p>First section of text.</p>
    <p>Second section of text.</span></p>
    <table>
      <tr><td>
        <p>Third section of text.</p>
      </td></tr>
    </table>
  </td></tr>
</table>

<p>Fourth section of text.</p>
'''


soup=BeautifulSoup(data, 'html.parser')
data="".join(item.strip() for item in data.split("</span>"))
print(data)

这是打印的输出。

<table>
  <tr><td>
    <p>First section of text.</p>
    <p>Second section of text.</p>
    <table>
      <tr><td>
        <p>Third section of text.</p>
      </td></tr>
    </table>
  </td></tr>
</table>

<p>Fourth section of text.</p>

...如果您的 html 中存在标签并希望将其从 html 中删除，请使用分解。

from bs4 import BeautifulSoup
data='''
<table>
  <tr><td>
    <p>First section of text.</p>
    <p>Second section of text.<span>xxxxx</span></p>
    <table>
      <tr><td>
        <p>Third section of text.</p>
      </td></tr>
    </table>
  </td></tr>
</table>

<p>Fourth section of text.</p>
'''


soup=BeautifulSoup(data, 'html.parser')

soup.span.decompose()
print(soup)

python - 从 html 中删除不匹配的结束标记

问题描述

解决方案

推荐阅读