首页 > 解决方案 > 如何使用 python 3 从网站中提取所有页面 URL?

问题描述

我想要一个网站中所有页面 URL 的列表。以下代码不返回任何内容:

from bs4 import BeautifulSoup
import requests

base_url = 'http://www.techadvisorblog.com'
response = requests.get(base_url + '/a')
soup = BeautifulSoup(response.text, 'html.parser')

urls = []

for tr in soup.select('tbody tr'):
    urls.append(base_url + tr.td.a['href'])

标签: python-3.xbeautifulsouppython-requests

解决方案


来自后端的响应是 406。您可以通过指定用户代理来克服它。

>>> response = requests.get(base_url + '/a', headers={"User-Agent": "XY"})

Python 请求 HTTP 响应 406

你可以得到网址

>>> for link in soup.find_all('a'):
...     print(link.get('href'))
...
#content
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://www.instagram.com/techadvisorblog
//www.pinterest.com/pin/create/button/?url=https://techadvisorblog.com/about-us/
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/
https://techadvisorblog.com/what-is-world-wide-web-www/
https://techadvisorblog.com/best-free-password-manager-for-windows-10/
https://techadvisorblog.com/solved-failed-to-start-emulator-the-emulator-was-not-properly-closed/
https://techadvisorblog.com/is-telegram-safe/
https://techadvisorblog.com/will-technology-ever-rule-the-world/
https://techadvisorblog.com/category/android/
https://techadvisorblog.com/category/knowledge/basic-computer/
https://techadvisorblog.com/category/games/
https://techadvisorblog.com/category/knowledge/
https://techadvisorblog.com/category/security/
http://Techadvisorblog.com/
http://Techadvisorblog.com
None
None
None
None
None
>>>

推荐阅读