python - 如何从 URL 中提取某些内容?
问题描述
我是一名学生,正在做作业。我被要求使用 BeautifulSoup 库来分析页面(https://www.edb.gov.hk/en/about-edb/press/press-releases/index.html)并提取表格或列表;然后将数据存储在 python 列表或 dict 或 pandas 数据框中。(这是要求)。
我使用带有标签“a”和“a href”的“for loop”成功提取了链接和标题名称。但是,我不知道如何从网上提取“日期”。
有人可以通过使用“div:nth-of-type”或其他方法给我一些建议吗?
解决方案
要获取数据框的日期、标题和链接,您可以使用下一个示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.edb.gov.hk/en/about-edb/press/press-releases/index.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for row in soup.select(".circulars_result_row:has(.table_text_mobile)"):
data = [d.get_text(strip=True) for d in row.select(".table_text_mobile")]
all_data.append(data + [row.a["href"]])
df = pd.DataFrame(all_data, columns=["Date", "Title", "Link"])
print(df)
印刷:
Date Title Link
0 11 Oct 2021 EDB provides latest guidelines on display of national flag and conduct of national flag raising ceremony in schools https://www.info.gov.hk/gia/general/202110/11/P2021101100304.htm?fontSize=1
1 11 Oct 2021 Hong Kong Scholarship for Excellence Scheme opens for applications https://www.info.gov.hk/gia/general/202110/11/P2021100800399.htm?fontSize=1
2 11 Oct 2021 Profiles of kindergartens posted online https://www.info.gov.hk/gia/general/202110/11/P2021101100297.htm
3 05 Oct 2021 "Active Students, Active People" Campaign cum e-Gallery launching ceremony https://www.info.gov.hk/gia/general/202110/05/P2021100400564.htm?fontSize=1
4 30 Sep 2021 EDB launches "SENSE" information website https://www.info.gov.hk/gia/general/202109/30/P2021092900520.htm?fontSize=1
5 23 Sep 2021 Launch of School Nominations Direct Admission Scheme for local universities https://www.info.gov.hk/gia/general/202109/23/P2021092300346.htm?fontSize=1
6 21 Sep 2021 Study Subsidy Scheme for Designated Professions/Sectors for 2022/23 cohort announced https://www.info.gov.hk/gia/general/202109/21/P2021092000818.htm
7 20 Sep 2021 EDB announces arrangements for student grant of 2021/22 school year https://www.info.gov.hk/gia/general/202109/20/P2021092000578.htm
8 16 Sep 2021 Parents reminded to submit application form for admission to Primary One https://www.info.gov.hk/gia/general/202109/16/P2021091600299.htm?fontSize=1
9 13 Sep 2021 Junior Secondary History e-Reading Award Scheme fosters students' positive values and attitudes https://www.info.gov.hk/gia/general/202109/13/P2021091300317.htm?fontSize=1
10 03 Sep 2021 SED on Primary One admission https://www.info.gov.hk/gia/general/202109/03/P2021090300473.htm?fontSize=1
11 02 Sep 2021 EDB introduces newly developed Curriculum Framework on Parent Education (Kindergarten) https://www.info.gov.hk/gia/general/202109/02/P2021090200238.htm?fontSize=1
12 01 Sep 2021 Appointments to Curriculum Development Council https://www.info.gov.hk/gia/general/202109/01/P2021090100172.htm
13 01 Sep 2021 SED speaks on first school day https://www.info.gov.hk/gia/general/202109/01/P2021090100352.htm
推荐阅读
- angular - 添加 Stripe 后 Angular Karma 测试失败
- django - 在 Django 模板中访问查询集的“values_list”
- java - mybatis+oracle,程序卡在插入
- jquery - jQuery:如何在 Ajax 调用中更新 cookie 后刷新从 cookie 中读取的数据
- angular - NGINX 根据位置 URL 提供角度和反应捆绑文件?
- python - 在 PySide2 中设置垂直和水平对齐
- c++ - ./a.out 结果 '.' 未被识别为内部或外部命令、可运行程序或批处理文件
- javascript - 使用带有 Websockets 的 JavaScript MQTT 客户端连接到 MQTT 代理时出错
- amazon-web-services - terraform db 实例和 ec2 安全组位于不同的 vpc 中
- reactjs - 使用 React 的内联 CSS 样式块