python - 无法提取某个“b”标签之后的所有内容,直到下一个“b”标签
问题描述
我试图在任何b
标签之后抓取所有可用的内容,直到b
网页中的下一个标签。为了让您可视化整个画面,我附上了相关的html 元素。
以下三个是b
标签的内容Job Description
,This is an internship
和Qualifications
。所以,当我选择任何b
标签时,我会想抓取该特定b
标签和下一个b
标签之间的任何内容。
我试过这样:
import requests
from bs4 import BeautifulSoup
from itertools import takewhile
link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
desc = [i.get_text(strip=True) for i in takewhile(lambda tag: tag.name!='b', soup.select("div > b:contains('Job Description') ~ *"))]
print(desc)
我得到的输出:
['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.', '', 'This is an internship.', 'Qualifications']
我希望得到的输出(踢出最后两个b
标签的内容):
['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.']
编辑:
这是您的测试的另一个链接。
解决方案
尝试这个:
import requests
from bs4 import BeautifulSoup
link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'
soup = BeautifulSoup(requests.get(link).text, "lxml")
desc = [
i.strip() for i in soup.find_all(text=True)
if i.strip() and i.parent.name != "b"
]
print("\n".join(desc))
输出:
This job entails researching developing testing and deploying mechanical solutions.
Typical activities include:
Designing and developing thermal or mechanical tooling systems.
The ideal candidate should exhibit the following behavioral traits:
Work in a technically diverse environment-Adapt to changing requirements.
Verbal and written communication.
Project management.
推荐阅读
- modelica - 无法在 Dymola 中生成平面 Modelica 代码?
- amazon-web-services - 降低 AMI 快照的成本
- python - 在 Python 中使用 Gekko 运行时间序列线性优化
- python - 在python中迭代大型csv文件中的行的最佳方法,写入新的
- python-3.x - 如何使用 OpenCV Python 遮蔽圆圈外的区域?
- machine-learning - YOLO 中的联合交叉 (IOU) 基本事实
- python - Google Chrome 扩展程序不显示我的图标(Selenium 模式,没有 GOOGLE WEBSTORE)
- google-cloud-platform - 为什么 Google Cloud Debugger 不能自动检测我的源代码?
- r - 如何在 R 中解析多个分隔数据中的列/值
- c# - 在 C# 中杀死进程 Razer Synapse