web-scraping - 如何拆分里面的元素
网页抓取时标记
问题描述
我正在尝试抓取url。但是输出不是所需的格式。我只需要分行名称和地址。如何从 p 标签中拆分此信息。
import re
import requests
from bs4 import BeautifulSoup
page = requests.get(url)
Branch_list=[]
soup = BeautifulSoup(page.content, 'html.parser')
for i in soup.find_all('div',class_="col-md-9 text-left"):
Branch=i.find_all('p') if i.find_all('p') else ''
for k in Branch:
k=re.sub(r'<(.*?)>','', str(k))
Branch_list.append(k)
解决方案
尝试这个:
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.bukopin.co.id/page/jaringankantor")
soup = BeautifulSoup(page.text, 'html.parser').find_all('div', class_="col-md-9 text-left")
paragraphs = [re.sub(r"Tel.+", "", p.find("p").getText(strip=True)) for p in soup]
for paragraph in paragraphs:
print(paragraph)
输出:
KCP Rasuna SaidGd. Kementerian Koperasi & UKM, Lt. 1. Jl. HR. Rasuna Said Kav. 3 - 5, Jakarta Selatan 12940
KCP Plaza AsiaJl. Jend. Sudirman Kav. 59 No. 77 Lt. GF No. GF - D Blok A Senayan, Kebayoran Baru, Jakarta Selatan
KCP Bulog IIGedung Diklat Bulog II Jl. Kuningan Timur Blok M2 No.5 Jakarta Selatan 12950
KCP Pondok Indah Plaza VPlaza V Pondok Indah Kav.A11 Jl. Marga Guna Raya - Pondok Indah Jakarta Selatan
KCP Kebayoran LamaJl. Raya Kebayoran Lama No.10 Jakarta Selatan 12220
KCP Kebayoran BaruJl. RS. Fatmawati No.7 Blok A Kebayoran Baru Jakarta Selatan12140
KCP MelawaiJl. Melawai Raya Kebayoran Baru No. 66 Jakarta Selatan 12160
KK PLN Lenteng AgungJl. Raya Tanjung Barat No. 55 Jakarta Selatan 12610
and so on...
编辑:要pandas dataframe
试试这个:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
page = requests.get("https://www.bukopin.co.id/page/jaringankantor")
soup = BeautifulSoup(page.text, 'html.parser').find_all('div', class_="col-md-9 text-left")
data = []
for div in soup:
branch = div.find("strong").getText()
address = div.find("p").getText(strip=True)
data.append([branch, re.sub(r"Telp.+", "", address[len(branch):])])
print(pd.DataFrame(data, columns=["Branch", "Address"]))
输出:
Branch Address
0 KCP Rasuna Said Gd. Kementerian Koperasi & UKM, Lt. 1. Jl. HR....
1 KCP Plaza Asia Jl. Jend. Sudirman Kav. 59 No. 77 Lt. GF No. G...
2 KCP Bulog II Gedung Diklat Bulog II Jl. Kuningan Timur Blok...
3 KCP Pondok Indah Plaza V Plaza V Pondok Indah Kav.A11 Jl. Marga Guna Ra...
4 KCP Kebayoran Lama Jl. Raya Kebayoran Lama No.10 Jakarta Selatan ...
5 KCP Kebayoran Baru Jl. RS. Fatmawati No.7 Blok A Kebayoran Baru J...
...
推荐阅读
- c++ - 无符号短裤的 C++ 除法导致 int
- c++ - Listview 更改选择颜色
- python-3.x - ImportError:无法从“sklearn.utils.validation”导入名称“_deprecate_positional_args”
- c++ - std::find,一种替代方法,它返回所有找到的值,而不仅仅是存在重复的向量的第一个值
- asp.net-core - asp.net core 3.1 asp-page-handler 两个工作一个不(在同一页面上)
- reactjs - 如何使用 React Hook useState 存储函数类型(例如箭头函数)的值?
- javascript - 将 sql 查询转换为 sequelize
- android - android库模块如何仅在调试版本中引入stetho
- typescript - 找不到模块“护照”或其相应的类型声明
- python - 使用组合迭代多个字典