首页 > 解决方案 > 无法提取某个“b”标签之后的所有内容,直到下一个“b”标签

问题描述

我试图在任何b标签之后抓取所有可用的内容,直到b网页中的下一个标签。为了让您可视化整个画面,我附上了相关的html 元素

以下三个是b标签的内容Job DescriptionThis is an internshipQualifications。所以,当我选择任何b标签时,我会想抓取该特定b标签和下一个b标签之间的任何内容。

我试过这样:

import requests
from bs4 import BeautifulSoup
from itertools import takewhile

link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'

res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
desc = [i.get_text(strip=True) for i in takewhile(lambda tag: tag.name!='b', soup.select("div > b:contains('Job Description') ~ *"))]
print(desc)

我得到的输出:

['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.', '', 'This is an internship.', 'Qualifications']

我希望得到的输出(踢出最后两个b标签的内容):

['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.']

编辑:

这是您的测试的另一个链接

标签: pythonpython-3.xweb-scrapingbeautifulsoup

解决方案


尝试这个:

import requests
from bs4 import BeautifulSoup

link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'
soup = BeautifulSoup(requests.get(link).text, "lxml")
desc = [
    i.strip() for i in soup.find_all(text=True)
    if i.strip() and i.parent.name != "b"
]
print("\n".join(desc))

输出:

This job entails researching developing testing and deploying mechanical solutions.
Typical activities include:
Designing and developing thermal or mechanical tooling systems.
The ideal candidate should exhibit the following behavioral traits:
Work in a technically diverse environment-Adapt to changing requirements.
Verbal and written communication.
Project management.

推荐阅读