python - 按 bs4 标签拆分/获取两个标签之间的文本
问题描述
目前我试图从网页中读取两个标签之间的文本。
到目前为止,这是我的代码:
soup = BeautifulSoup(r.text, 'lxml')
text = soup.text
tag_one = soup.select_one('div.first-header')
tage_two = soup.select_one('div.second-header')
text = text.split(tag_one)[1]
text = text.split(tage_two)[0]
print(text)
基本上我试图通过识别它们的标签来获取第一个和第二个标题之间的文本。我打算通过拆分第一个标签和第二个标签来做到这一点。这甚至可能吗?有没有更聪明的方法来做到这一点?
示例:如果您查看:https://en.wikipedia.org/wiki/Python_(programming_language)
我想找到一种方法来通过识别“历史”和“特征和哲学”的标签并按这些标签拆分来提取“历史”下的文本。
解决方案
With BeautifulSoup 4.7+, the CSS select ability is much improved. This task can be done utilizing the CSS4 :has()
selector that is now supported in BeautifulSoup:
import requests
from bs4 import BeautifulSoup
website_url = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)").text
soup = BeautifulSoup(website_url, "lxml")
els = soup.select('h2:has(span#History) ~ *:has(~ h2:has(span#Features_and_philosophy))')
with codecs.open('text.txt', 'w', 'utf-8') as f:
for el in els:
print(el.get_text())
The output:
Guido van Rossum at OSCON 2006.Main article: History of PythonPython was conceived in the late 1980s[31] by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired by SETL)[32], capable of exception handling and interfacing with the Amoeba operating system.[7] Its implementation began in December 1989.[33] Van Rossum's long influence on Python is reflected in the title given to him by the Python community: Benevolent Dictator For Life (BDFL) – a post from which he gave himself permanent vacation on July 12, 2018.[34]
Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-detecting garbage collector and support for Unicode.[35]
Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not completely backward-compatible.[36] Many of its major features were backported to Python 2.6.x[37] and 2.7.x version series. Releases of Python 3 include the 2to3 utility, which automates (at least partially) the translation of Python 2 code to Python 3.[38]
Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that a large body of existing code could not easily be forward-ported to Python 3.[39][40] In January 2017, Google announced work on a Python 2.7 to Go transcompiler to improve performance under concurrent workloads.[41]
推荐阅读
- c++ - 在模板参数的函数中使用一个或另一个命名空间
- python - SQLAlchemy:现有列没有这样的列
- php - 带文本的数字单元格格式
- python - 如何有效地将大型 .tsv 文件上传到 pyspark 中具有拆分列的 Hive 表?
- excel - 检查 Excel 是否安装在用户的计算机中
- angular - 使用 mat-icon 自动选择 matInput 的内容
- javascript - 样式化组件插值
- python - 如何在避免python中的递归错误的同时从自身调用函数?
- laravel - LARAVEL - 此集合实例上不存在属性 [id]
- gulp - Gulp 4 - 多路径的单个任务