python - 如何使用python获取html页面中的标题和url
问题描述
我想去department
并且只想选择/打印name
and url
。我尝试了以下方法,但我无法理解如何进入department
并选择这两个特定的东西。如何获取所有链接的“名称”和“网址”?
import json
import urllib.request
from bs4 import BeautifulSoup
def getContent():
# target site url
url = "www.xyz.com"
# requesting the url for data
request = urllib.request.Request(url)
# get the html, whole page
htmlpage = urllib.request.urlopen(request).read()
bsoup = BeautifulSoup(htmlpage, "html.parser")
# print(bsoup.prettify())
# main_table = bsoup.find("div",attrs)
# print(main_table)
# print(bsoup.find_all('name'))
# nav = bsoup.nav
# print(bsoup.title.department.url)
# for url in find_all('a'):
# print(url.get('href'))
for link in bsoup.find_all("a"):
print("Title: {}".format(link.get("name")))
print("href: {}".format(link.get("href")))
解决方案
您可以使用以下模块获取name
/ :url
json
import json
import urllib.request
from bs4 import BeautifulSoup
def get_content():
url = "http://www.ucdenver.edu/pages/ucdwelcomepage.aspx"
request = urllib.request.Request(url)
html_page = urllib.request.urlopen(request).read()
soup = BeautifulSoup(html_page, 'html.parser')
json_data = json.loads(soup.find("script", type="application/ld+json").string)
for data in json_data["department"]:
print("{:<60} {}".format(data["name"], data["url"]))
get_content()
输出:
Center for Undergraduate Exploration and Advising https://www.ucdenver.edu/center-for-undergraduate-exploration-and-advising
Commencement https://www.ucdenver.edu/commencement
Counseling Center https://www.ucdenver.edu/counseling-center
First Year Experiences https://www.ucdenver.edu/first-year-experiences
Health Programs https://www.ucdenver.edu/programs/health-programs
Housing and Dining https://www.ucdenver.edu/housing-and-dining
...
推荐阅读
- python - Pyplot 拒绝显示网格
- google-workspace - 是否可以通过 API 将联系人委托给新的 contacts.google.com?
- batch-file - 循环浏览当前文件夹中的文件并根据更大或更小重命名它们
- jenkins - 推送到 github 不会触发 jenkins 管道作业
- azure - 如何修复“错误:无效的 id_token。OpenSSL 无法验证数据”
- regex - python中第n次出现的字符与RegEx之间的匹配字符串
- java - 为什么 Arquillian 不再有嘲笑?
- .htaccess - RewriteRule 在 SSL 访问日志中创建 404
- algorithm - 给定 3D 中的 2 个非交叉多边形,均由视野光线 Oz 可见,确定哪个在前面
- python-3.x - How do I alias python2 to python3 in a docker container?