python - 无法根据不同的标题及其相应的段落对输出进行分层
问题描述
我正在尝试从下面的 html 元素中获取每个标题及其相应的段落。结果应存储在字典中。到目前为止,无论我尝试过什么,都会产生可笑的随意输出。我故意不粘贴当前输出只是因为空间简洁。
html = """
<h1>a Complexity Profile</h1>
<p>Since time immemorial humans have...</p>
<p>How often have we been told</p>
<h2>INDIVIDUAL AND COLLECTIVE BEHAVIOR</h2>
<p>Building a model of society based...</p>
<p>All macroscopic systems...</p>
<h3>COMPLEXITY PROFILE</h3>
<p>It is much easier to think about the...</p>
<p>A formal definition of scale considers...</p>
<p>The complexity profile counts...</p>
<h2>CONTROL IN HUMAN ORGANIZATIONS</h2>
<p>Using this argument it is straightforward...</p>
<h2>Conclusion</h2>
<p>There are two natural conclusions...</p>
"""
我试过(产生混乱的输出):
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
data = []
for item in soup.select("h1,h2,h3,h4,h5,h6"):
d = {}
d['title'] = item.text
d['text'] = [i.text for i in item.find_next_siblings('p')]
data.append(d)
print(json.dumps(data,indent=4))
我希望得到的输出:
[
{
"title": "a Complexity Profile",
"text": [
"Since time immemorial humans have...",
"How often have we been told",
]
},
{
"title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
"text": [
"Building a model of society based...",
"All macroscopic systems...",
]
},
{
"title": "COMPLEXITY PROFILE",
"text": [
"It is much easier to think about the...",
"A formal definition of scale considers...",
"The complexity profile counts...",
]
},
{
"title": "CONTROL IN HUMAN ORGANIZATIONS",
"text": [
"Using this argument it is straightforward...",
]
},
{
"title": "Conclusion",
"text": [
"There are two natural conclusions..."
]
}
]
解决方案
你可以find_previous()
用来检查你是否在正确的“部分”:
import re
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
r = re.compile(r"h\d+", flags=re.I)
data = []
for h in soup.find_all(name=r):
data.append({"title": h.get_text(strip=True), "text": []})
p = h.find_next_sibling("p")
while p and p.find_previous(name=r) == h:
data[-1]["text"].append(p.get_text())
p = p.find_next_sibling("p")
print(json.dumps(data, indent=4))
印刷:
[
{
"title": "a Complexity Profile",
"text": [
"Since time immemorial humans have...",
"How often have we been told"
]
},
{
"title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
"text": [
"Building a model of society based...",
"All macroscopic systems..."
]
},
{
"title": "COMPLEXITY PROFILE",
"text": [
"It is much easier to think about the...",
"A formal definition of scale considers...",
"The complexity profile counts..."
]
},
{
"title": "CONTROL IN HUMAN ORGANIZATIONS",
"text": [
"Using this argument it is straightforward..."
]
},
{
"title": "Conclusion",
"text": [
"There are two natural conclusions..."
]
}
]
编辑:好主意是使用itertools.takewhile
:
import re
import json
from bs4 import BeautifulSoup
from itertools import takewhile
soup = BeautifulSoup(html, "lxml")
r = re.compile(r"h\d+", flags=re.I)
data = []
for h in soup.find_all(name=r):
data.append({"title": h.get_text(strip=True), "text": []})
for p in takewhile(lambda p: p.find_previous(name=r) == h, h.find_next_siblings("p")):
data[-1]["text"].append(p.get_text())
print(json.dumps(data, indent=4))
仅使用列表理解:
data = [
{
"title": h.get_text(strip=True),
"text": [
p.get_text()
for p in takewhile(
lambda p: p.find_previous(name=r) == h,
h.find_next_siblings("p"),
)
],
}
for h in soup.find_all(name=r)
]
推荐阅读
- reactjs - 尝试输入反应应用程序路由时出现 404
- airflow - 在重建/更新 Google Cloud Composer 时减少延迟?
- c# - 将单个项目控制台应用程序作为可执行文件运行时找不到 App.Config
- javascript - 如果 var 中的值函数,为什么 setInterval 方法不起作用?
- python - 用于读写数据的 Python Dynamics 365 包
- scala - 有没有办法防止使用依赖项提供的特定功能?
- xcode - SwiftUI 预览错误:连接到已启动的交互式代理
- python-3.x - Python:非负数的平均值
- javascript - Ag-grid:在某个预定义的时间内保持打开工具提示
- android-edittext - 为 EditText 动态设置数字 - Android