python - Given an html paragraph and a link, is there a way to retrieve the text before and the text after the link inside the paragraph in Python?
问题描述
I am using urllib3 to get the html of some pages.
I want to retrieve the text from the paragraph where the link is, with the text before and after the link stored separately.
For example:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('get', "https://www.snopes.com/fact-check/michael-novenche/")
body = r.data
soup = BeautifulSoup(body, 'lxml')
for a in soup.findAll('a'):
if a.has_attr('href'):
if (a['href'] == "http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"):
link_text = a
link_para = a.find_parent("p")
print(link_text)
print(link_para)
Paragraph
<p>The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with
all the changes in his condition proved a challenge. The message quoted above
stated that Michael had a large tumor in his brain, was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An <nobr>October 2000</nobr> article in <a
href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany
Weekly</i></a> didn’t mention anything about little Michael’s medical
condition but said that his family was “in need of funds to help pay for the
transportation to the hospital and other costs not covered by their
insurance.” A June 2000 message posted to the <a
href="http://www.ecunet.org/whatisecupage.html"
onmouseout="window.status='';return true"
onmouseover="window.status='Ecunet';return true" target="_blank">Ecunet</a>
mailing list indicated that Michael had just turned <nobr>3 years</nobr> old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:</p>
Link
<a href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany
Weekly</i></a>
Text to be retrieved (2 parts)
The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy ... was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An October 2000 article in
didn’t mention anything about little Michael’s medical
condition but said that his family was ... turned 3 years old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:
I cant simply get_text() then use split as the link text might be repeated.
I thought I might just add a counter to see how many times the link text is repeated, use split(), then use a loop to get the parts I want.
I would appreciate a better, less messy method though.
解决方案
You can iterate a
tag parent's content and compare if actual value is our a
tag. If it is, we found one part and continue building another:
data = '''<p>The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with
all the changes in his condition proved a challenge. The message quoted above
stated that Michael had a large tumor in his brain, was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An <nobr>October 2000</nobr> article in <a
href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany
Weekly</i></a> didn’t mention anything about little Michael’s medical
condition but said that his family was “in need of funds to help pay for the
transportation to the hospital and other costs not covered by their
insurance.” A June 2000 message posted to the <a
href="http://www.ecunet.org/whatisecupage.html"
onmouseout="window.status='';return true"
onmouseover="window.status='Ecunet';return true" target="_blank">Ecunet</a>
mailing list indicated that Michael had just turned <nobr>3 years</nobr> old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
link_url='http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general'
a = soup.find('a', href=link_url)
s, parts = '', []
for t in a.parent.contents:
if t == a:
parts += [s]
s = ''
continue
s += str(t)
parts += [s]
for part in parts:
print(BeautifulSoup(part, 'lxml').body.text.strip())
print('*' * 80)
Prints:
The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with
all the changes in his condition proved a challenge. The message quoted above
stated that Michael had a large tumor in his brain, was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An October 2000 article in
********************************************************************************
didn’t mention anything about little Michael’s medical
condition but said that his family was “in need of funds to help pay for the
transportation to the hospital and other costs not covered by their
insurance.” A June 2000 message posted to the Ecunet
mailing list indicated that Michael had just turned 3 years old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:
********************************************************************************
推荐阅读
- f# - 具有度量单位的 F# 生成类型提供程序
- c++ - OpenCascade:从 IGES 读取修剪过的表面
- r - 从数百万个 GPS 坐标确定 COUNTRY 的最快方法 [R]
- xml - android studio不包装xml
- r - 为什么 r str 会改变评估
- c++ - 如何计算 const char** 中的总字符数?
- security - 加密 kubernetes etcd 存储中的秘密数据
- ios - 如何删除 use_frameworks!并继续在 Objective-C 项目中使用 swift pod?
- azure - 停止到 Azure WebApp 的直接流量,只允许通过 CDN 的流量
- javascript - 为什么图标不显示在角度 5 中?