首页 > 解决方案 > 我如何刮多个

s 下多个

问题描述

我正在学习 BS4 并承担起自己抓取一个通用多页网站的任务。我想抓取然后跟进将材料放入 JSON 文件中。我参考了如何打印段落...Python Beautiful..Extract the text between ...。但无法获得首选答案。

这是我的标签结构

<div class ="article-body">
<section class="xyz" >
<h2>title</h2>

<h3>subtitle</h3>
<p>para 1</p>
<p>para 2</p>
<p>para 3</p>

<h3>subtitle</h3>
<p>para 1</p>

<h2>subtitle</h3>
<p>para 1</p>  
<p>para 2</p> 
  
</section>

<section> ANOTHER SECTION </section>
</div>

这是我的代码.. 我只能得到第一个标题、第一个副标题和第一个段落

from bs4 import BeautifulSoup

url = "https://www.project.com/page-1.aspx"
url2 = "https://www.project.com/page-13.aspx"
    
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

span = soup.find_all('section', {'id': 'section-body'}, {'class': 'text-body'})

for items in span:
    
    for item in items.find_all(['h2']):
        # for title
        title = item.find_next("h2").text
        
        # for subtitle
        subtitle = item.find_next("h3").text
        
        # for para
        para = item.find_next("p").text
        
        print(para)

这是我尝试过的另一种结构,这让我一切都加入了

for items in span:
    # all joined
    data = '\n'.join([item.text for item in items.find_all(["h3","p"])])
   

我什至尝试合并段落

subtitle = []
para = []
for items in span:
    for item in items.find_all(['p', 'h3']):
        if item.name == 'h3':
            title = item.text
            subtitle.append(title)
            print(title)
            
        if item.name =='p':
                if item.find_next('h3'):
                    soup = soup + item.text
                    para.append(soup)
                    soup = ''
                    print(para)
                    print('******\n\n')
                else:
                    soup = item.text
                    para.append(soup)

我想要的输出 - JSON 格式

[
{
"page": "page-1",
"section_one": [
        {
            "subtitle": "subtitle here h3"
            "para": "para 1 + para 2 + para 3.. joined"
        },
        {
             "subtitle": "subtitle here h3"
             "para": "para 1"
        },
        {
              "subtitle": "subtitle here h3"
               "para": "para 1 + para 2.. joined"
         }
     ],
"section_two": [
         {
"subtitle": "subtitle here h3"
"para": "para 1 + para 2 + para 3.. joined"
         }
    ]
},


{
"page": "page-2"
// Comment - Page 2 related stuff
}

]

标签: pythonjsonweb-scrapingbeautifulsoup

解决方案


推荐阅读