python - BeautifulSoup 抓取数据 - 指定行?子分类?
问题描述
我是 Python 脚本的新手,希望能得到一些帮助。
我一直在使用 beautifulsoup 来清理 Web 数据,我正在尝试在 wikipedia 页面的“spaceport 类别中的页面”部分中提取 spaceports。我设法清理了页面但是我最终也提取了顶级子类别,我一直在使用下面的代码,你能给我一些指示吗?
data = requests.get("https://en.wikipedia.org/wiki/Category:Spaceports").text
soup = BeautifulSoup(data, 'html.parser')
splist = []
sp_df = pd.DataFrame({"Spaceport": splist})
sp_df.head()
输出
Spaceport
0
Spaceport
1
List of rocket launch sites
2
Alcântara Launch Center
3
Anheung Proving Ground
4
Baikonur Cosmodrome
解决方案
尝试这个:
import requests
from bs4 import BeautifulSoup
data = requests.get("https://en.wikipedia.org/wiki/Category:Spaceports").text
soup = BeautifulSoup(data, 'html.parser').find("div", {"id": "mw-pages"})
spaceports = [f"https://en.wikipedia.org{a['href']}" for a in soup.find_all("a")[3:]]
for spaceport in spaceports:
print(spaceport)
输出:
https://en.wikipedia.org/wiki/Alc%C3%A2ntara_Launch_Center
https://en.wikipedia.org/wiki/Anheung_Proving_Ground
https://en.wikipedia.org/wiki/Baikonur_Cosmodrome
https://en.wikipedia.org/wiki/Barreira_do_Inferno_Launch_Center
https://en.wikipedia.org/wiki/Biak_Spaceport
https://en.wikipedia.org/wiki/Broglio_Space_Center
https://en.wikipedia.org/wiki/Carnarvon,_Western_Australia
https://en.wikipedia.org/wiki/Churchill_Rocket_Research_Range
https://en.wikipedia.org/wiki/Dombarovsky_Air_Base
https://en.wikipedia.org/wiki/Guiana_Space_Centre
https://en.wikipedia.org/wiki/Hammaguir
https://en.wikipedia.org/wiki/Naro_Space_Center
https://en.wikipedia.org/wiki/Odyssey_(launch_platform)
https://en.wikipedia.org/wiki/Palmachim_Airbase
https://en.wikipedia.org/wiki/Reggane
https://en.wikipedia.org/wiki/Resolute_Bay
https://en.wikipedia.org/wiki/Rocket_Lab_Launch_Complex_1
https://en.wikipedia.org/wiki/Sanirajak
https://en.wikipedia.org/wiki/Satish_Dhawan_Space_Centre
https://en.wikipedia.org/wiki/Semnan_spaceport
https://en.wikipedia.org/wiki/Sonmiani_(space_facility)
https://en.wikipedia.org/wiki/Svobodny_Cosmodrome
https://en.wikipedia.org/wiki/Tanegashima_Space_Center
https://en.wikipedia.org/wiki/Thumba_Equatorial_Rocket_Launching_Station
https://en.wikipedia.org/wiki/Tilla_Satellite_Launch_Centre
https://en.wikipedia.org/wiki/Uchinoura_Space_Center
https://en.wikipedia.org/wiki/Vostochny_Cosmodrome
https://en.wikipedia.org/wiki/Yoshinobu_Launch_Complex
编辑:要获取空间端口的名称,请更改最后一行:
print(spaceport)
对此:
print(spaceport.rsplit("/")[-1])
推荐阅读
- vba - 使用 KeyPress 事件“键入时搜索”
- google-bigquery - 为什么 BigQuery 在保存表时会耗尽内存,而不是对于不保存的相同查询?
- keycloak - 尝试在两个 Keycloak 之间联合用户时出现意外错误(来自令牌的错误受众)
- python - 列表中的重复字符串不会被删除,除非最相似的字符串在子列表中
- openssl - 将普通公钥转换为 PEM
- java - 在项目中添加 Java 代码模板 XML 文件
- sql - 如何强制生成新的 sysdatetime()?
- authentication - 如何使用 Postman 创建 oAuth 随机数、时间戳和签名?
- c# - 为什么 jquery datable 显示旧记录而不是新记录?
- powershell - 如何快速更改 shell 文件夹以匹配当前打开的文件