首页 > 解决方案 > BeautifulSoup - 在标签后获取文本

问题描述

标签: pythonhtmlweb-scrapingbeautifulsoup

解决方案


我想这就是你要找的。它找到父 p 元素,将汤对象转换为字符串,删除强元素,然后将字符串转换回汤对象。

from bs4 import BeautifulSoup

soup = BeautifulSoup("<p><strong>High School Honors: </strong><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>", 'html.parser')
headerList = []
infoList = []

for strong_tag in soup.findAll('strong'):
    parent = strong_tag.find_parent('p')
    content = str(parent).replace(f'{strong_tag}', '')
    souped_content = BeautifulSoup(content, 'html.parser')
    infoList.append(souped_content)
    headerList.append(strong_tag)

print(headerList)
print(infoList)

这将输出以下内容:

[<strong>High School Honors: </strong>]
[<p><em>Parade </em>All-American; <em>Chicago Sun-Times </em>Illinois Player of the Year honors; rushed for 2,100 yards and 31 TDs as a senior; led team to 14-0 record and Class 4A State Championship as a junior with 1,820 yards and 26 TDs; also lettered in baseball.</p>]

推荐阅读