首页 > 解决方案 > 如何删除最外循环中的文本

问题描述

我正在尝试使用 python 中的正则表达式删除网页上最外层括号内包含链接的所有文本,但无济于事。

我尝试了一些正则表达式模式,例如以下模式:

paragraph = re.sub(r'\(.*[<a]+\)', '', p)

我试图检查标签是否存在于最外面的括号之间。

在维基百科的这个例子中:

    Rwanda (/ruˈɑːndə, -ˈæn-/ (About this soundlisten); Kinyarwanda: U Rwanda [u.ɾɡwaː.nda] (About this soundlisten)), officially the Republic of Rwanda (Kinyarwanda: Repubulika y'u Rwanda; Swahili: Jamhuri ya Rwanda; French: République du Rwanda) , is a country in Central ...

输入文本:

'<p><b>Rwanda</b> (<span class="nowrap"><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="\'r\' in \'rye\'">r</span><span title="/u/: \'u\' in \'influence\'">u</span><span title="/ˈ/: primary stress follows">ˈ</span><span title="/ɑː/: \'a\' in \'father\'">ɑː</span><span title="\'n\' in \'nigh\'">n</span><span title="\'d\' in \'dye\'">d</span><span title="/ə/: \'a\' in \'about\'">ə</span></span>, <wbr/>-<span style="border-bottom:1px dotted"><span title="/ˈ/: primary stress follows">ˈ</span><span title="/æ/: \'a\' in \'bad\'">æ</span><span title="\'n\' in \'nigh\'">n</span></span>-/</a></span> <span class="nowrap" style="font-size:85%"><bracket><span class="unicode haudio"><span class="fn"><span style="white-space:nowrap;margin-right:.25em;"><a href="/wiki/File:Rwanda_pronunciation.ogg" title="About this sound"><img alt="About this sound" data-file-height="20" data-file-width="20" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/11px-Loudspeaker.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/17px-Loudspeaker.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/22px-Loudspeaker.svg.png 2x" width="11"/></a></span><a class="internal" href="//upload.wikimedia.org/wikipedia/commons/2/2c/Rwanda_pronunciation.ogg" title="Rwanda pronunciation.ogg">listen</a></span></span>)</span></span>; <a class="mw-redirect" href="/wiki/Kinyarwanda_language" title="Kinyarwanda language">Kinyarwanda</a>:  <small></small><span class="IPA" title="Representation in the International Phonetic Alphabet <bracket>IPA)"><a href="/wiki/Help:IPA" title="Help:IPA">[u.ɾɡwaː.nda]</a></span> <span class="nowrap" style="font-size:85%"><bracket><span class="unicode haudio"><span class="fn"><span style="white-space:nowrap;margin-right:.25em;"><a href="/wiki/File:Rwanda_<bracket>rw)_pronunciation.ogg" title="About this sound"><img alt="About this sound" data-file-height="20" data-file-width="20" decoding="async" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/11px-Loudspeaker.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/17px-Loudspeaker.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/22px-Loudspeaker.svg.png 2x" width="11"/></a></span><a class="internal" href="//upload.wikimedia.org/wikipedia/commons/3/34/Rwanda_%28rw%29_pronunciation.ogg" title="Rwanda <bracket>rw) pronunciation.ogg">listen</a></span></span>)</span>), officially the <b>Republic of Rwanda</b> <bracket><a class="mw-redirect" href="/wiki/Kinyarwanda_language" title="Kinyarwanda language">Kinyarwanda</a>: ; <a class="mw-redirect" href="/wiki/Kiswahili" title="Kiswahili">Swahili</a>: ; <a href="/wiki/French_language" title="French language">French</a>: ), is a country  in <a href="/wiki/Central_Africa" title="Central Africa">Central</a> ... </p>'

我希望输出如下:

Rwanda, officially the Republic of Rwanda ...

然而; 它失败了,它从第一个左括号到最后一个左括号获取所有文本,而不是获取第一组外括号。

我可以使用正则表达式来做到这一点,还是必须在其他地方寻找?

标签: pythonregex

解决方案


您可以使用 BeautifulSoup 将此问题转换为解析 HTML 页面之类的问题(假设括号是平衡的):

s = '''
Rwanda (/ruˈɑːndə, -ˈæn-/ (About this soundlisten); Kinyarwanda: U Rwanda [u.ɾɡwaː.nda] (About this soundlisten)), officially the Republic of Rwanda (Kinyarwanda: Repubulika y'u Rwanda; Swahili: Jamhuri ya Rwanda; French: République du Rwanda)
'''

import re
from bs4 import BeautifulSoup

s = re.sub(r'\(', r'<bracket>', s)
s = re.sub(r'\)', r'</bracket>', s)

soup = BeautifulSoup(s, 'lxml')
for bracket in soup.select('bracket'):
    bracket.extract()

s = re.sub(r'\s+,', r',', soup.body.text.strip())

print(s)

印刷:

Rwanda, officially the Republic of Rwanda

推荐阅读