首页 > 解决方案 > 无法根据不同的标题及其相应的段落对输出进行分层

问题描述

我正在尝试从下面的 html 元素中获取每个标题及其相应的段落。结果应存储在字典中。到目前为止,无论我尝试过什么,都会产生可笑的随意输出。我故意不粘贴当前输出只是因为空间简洁。

html = """
<h1>a Complexity Profile</h1>
<p>Since time immemorial humans have...</p>
<p>How often have we been told</p>

<h2>INDIVIDUAL AND COLLECTIVE BEHAVIOR</h2>
<p>Building a model of society based...</p>
<p>All macroscopic systems...</p>

<h3>COMPLEXITY PROFILE</h3>
<p>It is much easier to think about the...</p>
<p>A formal definition of scale considers...</p>
<p>The complexity profile counts...</p>

<h2>CONTROL IN HUMAN ORGANIZATIONS</h2>
<p>Using this argument it is straightforward...</p>

<h2>Conclusion</h2>
<p>There are two natural conclusions...</p>
"""

我试过(产生混乱的输出):

import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
data = []
for item in soup.select("h1,h2,h3,h4,h5,h6"):
    d = {}
    d['title'] = item.text
    d['text'] = [i.text for i in item.find_next_siblings('p')]
    data.append(d)

print(json.dumps(data,indent=4))

我希望得到的输出

[
    {
        "title": "a Complexity Profile",
        "text": [
            "Since time immemorial humans have...",
            "How often have we been told",
        ]
    },
    {
        "title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
        "text": [
            "Building a model of society based...",
            "All macroscopic systems...",
        ]
    },
    {
        "title": "COMPLEXITY PROFILE",
        "text": [
            "It is much easier to think about the...",
            "A formal definition of scale considers...",
            "The complexity profile counts...",
        ]
    },
    {
        "title": "CONTROL IN HUMAN ORGANIZATIONS",
        "text": [
            "Using this argument it is straightforward...",
        ]
    },
    {
        "title": "Conclusion",
        "text": [
            "There are two natural conclusions..."
        ]
    }
]

标签: pythonpython-3.xweb-scrapingbeautifulsoup

解决方案


你可以find_previous()用来检查你是否在正确的“部分”:

import re
import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")
r = re.compile(r"h\d+", flags=re.I)

data = []
for h in soup.find_all(name=r):
    data.append({"title": h.get_text(strip=True), "text": []})
    p = h.find_next_sibling("p")
    while p and p.find_previous(name=r) == h:
        data[-1]["text"].append(p.get_text())
        p = p.find_next_sibling("p")

print(json.dumps(data, indent=4))

印刷:

[
    {
        "title": "a Complexity Profile",
        "text": [
            "Since time immemorial humans have...",
            "How often have we been told"
        ]
    },
    {
        "title": "INDIVIDUAL AND COLLECTIVE BEHAVIOR",
        "text": [
            "Building a model of society based...",
            "All macroscopic systems..."
        ]
    },
    {
        "title": "COMPLEXITY PROFILE",
        "text": [
            "It is much easier to think about the...",
            "A formal definition of scale considers...",
            "The complexity profile counts..."
        ]
    },
    {
        "title": "CONTROL IN HUMAN ORGANIZATIONS",
        "text": [
            "Using this argument it is straightforward..."
        ]
    },
    {
        "title": "Conclusion",
        "text": [
            "There are two natural conclusions..."
        ]
    }
]

编辑:好主意是使用itertools.takewhile

import re
import json
from bs4 import BeautifulSoup
from itertools import takewhile

soup = BeautifulSoup(html, "lxml")
r = re.compile(r"h\d+", flags=re.I)

data = []
for h in soup.find_all(name=r):
    data.append({"title": h.get_text(strip=True), "text": []})
    for p in takewhile(lambda p: p.find_previous(name=r) == h, h.find_next_siblings("p")):
        data[-1]["text"].append(p.get_text())

print(json.dumps(data, indent=4))

仅使用列表理解:

data = [
    {
        "title": h.get_text(strip=True),
        "text": [
            p.get_text()
            for p in takewhile(
                lambda p: p.find_previous(name=r) == h,
                h.find_next_siblings("p"),
            )
        ],
    }
    for h in soup.find_all(name=r)
]

推荐阅读