首页 > 解决方案 > BeautifulSoup 抓取数据 - 指定行?子分类?

问题描述

我是 Python 脚本的新手,希望能得到一些帮助。

我一直在使用 beautifulsoup 来清理 Web 数据,我正在尝试在 wikipedia 页面的“spaceport 类别中的页面”部分中提取 spaceports。我设法清理了页面但是我最终也提取了顶级子类别,我一直在使用下面的代码,你能给我一些指示吗?

data = requests.get("https://en.wikipedia.org/wiki/Category:Spaceports").text
soup = BeautifulSoup(data, 'html.parser')
splist = []
sp_df = pd.DataFrame({"Spaceport": splist})
sp_df.head()

输出

Spaceport

 
0
Spaceport 

1
List of rocket launch sites 

2
Alcântara Launch Center 

3
Anheung Proving Ground 

4
Baikonur Cosmodrome 

标签: pythonbeautifulsoup

解决方案


尝试这个:

import requests
from bs4 import BeautifulSoup

data = requests.get("https://en.wikipedia.org/wiki/Category:Spaceports").text
soup = BeautifulSoup(data, 'html.parser').find("div", {"id": "mw-pages"})
spaceports = [f"https://en.wikipedia.org{a['href']}" for a in soup.find_all("a")[3:]]

for spaceport in spaceports:
    print(spaceport)

输出:

https://en.wikipedia.org/wiki/Alc%C3%A2ntara_Launch_Center
https://en.wikipedia.org/wiki/Anheung_Proving_Ground
https://en.wikipedia.org/wiki/Baikonur_Cosmodrome
https://en.wikipedia.org/wiki/Barreira_do_Inferno_Launch_Center
https://en.wikipedia.org/wiki/Biak_Spaceport
https://en.wikipedia.org/wiki/Broglio_Space_Center
https://en.wikipedia.org/wiki/Carnarvon,_Western_Australia
https://en.wikipedia.org/wiki/Churchill_Rocket_Research_Range
https://en.wikipedia.org/wiki/Dombarovsky_Air_Base
https://en.wikipedia.org/wiki/Guiana_Space_Centre
https://en.wikipedia.org/wiki/Hammaguir
https://en.wikipedia.org/wiki/Naro_Space_Center
https://en.wikipedia.org/wiki/Odyssey_(launch_platform)
https://en.wikipedia.org/wiki/Palmachim_Airbase
https://en.wikipedia.org/wiki/Reggane
https://en.wikipedia.org/wiki/Resolute_Bay
https://en.wikipedia.org/wiki/Rocket_Lab_Launch_Complex_1
https://en.wikipedia.org/wiki/Sanirajak
https://en.wikipedia.org/wiki/Satish_Dhawan_Space_Centre
https://en.wikipedia.org/wiki/Semnan_spaceport
https://en.wikipedia.org/wiki/Sonmiani_(space_facility)
https://en.wikipedia.org/wiki/Svobodny_Cosmodrome
https://en.wikipedia.org/wiki/Tanegashima_Space_Center
https://en.wikipedia.org/wiki/Thumba_Equatorial_Rocket_Launching_Station
https://en.wikipedia.org/wiki/Tilla_Satellite_Launch_Centre
https://en.wikipedia.org/wiki/Uchinoura_Space_Center
https://en.wikipedia.org/wiki/Vostochny_Cosmodrome
https://en.wikipedia.org/wiki/Yoshinobu_Launch_Complex

编辑:要获取空间端口的名称,请更改最后一行:

    print(spaceport)

对此:

    print(spaceport.rsplit("/")[-1])

推荐阅读