python-3.x - 如何使用 python 3 从网站中提取所有页面 URL?
问题描述
我想要一个网站中所有页面 URL 的列表。以下代码不返回任何内容:
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.techadvisorblog.com'
response = requests.get(base_url + '/a')
soup = BeautifulSoup(response.text, 'html.parser')
urls = []
for tr in soup.select('tbody tr'):
urls.append(base_url + tr.td.a['href'])
解决方案
来自后端的响应是 406。您可以通过指定用户代理来克服它。
>>> response = requests.get(base_url + '/a', headers={"User-Agent": "XY"})
你可以得到网址
>>> for link in soup.find_all('a'):
... print(link.get('href'))
...
#content
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://www.instagram.com/techadvisorblog
//www.pinterest.com/pin/create/button/?url=https://techadvisorblog.com/about-us/
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/
https://techadvisorblog.com/what-is-world-wide-web-www/
https://techadvisorblog.com/best-free-password-manager-for-windows-10/
https://techadvisorblog.com/solved-failed-to-start-emulator-the-emulator-was-not-properly-closed/
https://techadvisorblog.com/is-telegram-safe/
https://techadvisorblog.com/will-technology-ever-rule-the-world/
https://techadvisorblog.com/category/android/
https://techadvisorblog.com/category/knowledge/basic-computer/
https://techadvisorblog.com/category/games/
https://techadvisorblog.com/category/knowledge/
https://techadvisorblog.com/category/security/
http://Techadvisorblog.com/
http://Techadvisorblog.com
None
None
None
None
None
>>>
推荐阅读
- sql - 使用 SQL 识别具有开始日期和结束日期的时间段
- r - 在嵌套列表中查找包含子元素的顶级元素
- javascript - 如何在 Cypress 中执行自定义命令?
- javascript - window.location.href 从 2019 年 3 月起在 chrome 中被屏蔽
- html - 使用链接媒体属性加载条件样式表失败
- sql - 加入日历表
- mysql - SQL 查询以选择具有 ID 的多行
- php - Wordpress yoast seo 插件 - 在元描述中解析片段 var
- c# - 你如何将 JWT 传递给服务?
- vue.js - 由于 v-model,Vue js 选择选项未出现