首页 > 解决方案 > 如何拆分里面的元素

网页抓取时标记

问题描述

我正在尝试抓取url。但是输出不是所需的格式。我只需要分行名称和地址。如何从 p 标签中拆分此信息。

    import re

    import requests
    from bs4 import BeautifulSoup
    
    page = requests.get(url)
    Branch_list=[]
    
    soup = BeautifulSoup(page.content, 'html.parser')
    
    for i in soup.find_all('div',class_="col-md-9 text-left"):
    
        Branch=i.find_all('p') if i.find_all('p') else '' 
    
        for k in Branch:
    
            k=re.sub(r'<(.*?)>','', str(k))
            
            Branch_list.append(k)

标签: web-scrapingbeautifulsoup

解决方案


尝试这个:

import re

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.bukopin.co.id/page/jaringankantor")

soup = BeautifulSoup(page.text, 'html.parser').find_all('div', class_="col-md-9 text-left")
paragraphs = [re.sub(r"Tel.+", "", p.find("p").getText(strip=True)) for p in soup]

for paragraph in paragraphs:
    print(paragraph)

输出:

KCP Rasuna SaidGd. Kementerian Koperasi & UKM, Lt. 1. Jl. HR. Rasuna Said Kav. 3 - 5, Jakarta Selatan 12940
KCP Plaza AsiaJl. Jend. Sudirman Kav. 59 No. 77 Lt. GF No. GF - D Blok A Senayan, Kebayoran Baru, Jakarta Selatan
KCP Bulog IIGedung Diklat Bulog II Jl. Kuningan Timur Blok M2 No.5 Jakarta Selatan 12950
KCP Pondok Indah Plaza VPlaza V Pondok Indah Kav.A11 Jl. Marga Guna Raya - Pondok Indah Jakarta Selatan
KCP Kebayoran LamaJl. Raya Kebayoran Lama No.10 Jakarta Selatan 12220
KCP Kebayoran BaruJl. RS. Fatmawati No.7 Blok A Kebayoran Baru Jakarta Selatan12140
KCP MelawaiJl. Melawai Raya Kebayoran Baru No. 66 Jakarta Selatan 12160
KK PLN Lenteng AgungJl. Raya Tanjung Barat No. 55 Jakarta Selatan 12610
and so on...

编辑:要pandas dataframe试试这个:

import re

import requests
import pandas as pd
from bs4 import BeautifulSoup

page = requests.get("https://www.bukopin.co.id/page/jaringankantor")
soup = BeautifulSoup(page.text, 'html.parser').find_all('div', class_="col-md-9 text-left")

data = []
for div in soup:
    branch = div.find("strong").getText()
    address = div.find("p").getText(strip=True)
    data.append([branch, re.sub(r"Telp.+", "", address[len(branch):])])

print(pd.DataFrame(data, columns=["Branch", "Address"]))

输出:

                                               Branch                                            Address
0                                     KCP Rasuna Said  Gd. Kementerian Koperasi & UKM, Lt. 1. Jl. HR....
1                                      KCP Plaza Asia  Jl. Jend. Sudirman Kav. 59 No. 77 Lt. GF No. G...
2                                        KCP Bulog II  Gedung Diklat Bulog II Jl. Kuningan Timur Blok...
3                            KCP Pondok Indah Plaza V  Plaza V Pondok Indah Kav.A11 Jl. Marga Guna Ra...
4                                  KCP Kebayoran Lama  Jl. Raya Kebayoran Lama No.10 Jakarta Selatan ...
5                                  KCP Kebayoran Baru  Jl. RS. Fatmawati No.7 Blok A Kebayoran Baru J...
...

推荐阅读