python - Python中的Beautiful Soup没有给我页面上正确数量的链接
问题描述
我正在尝试使用以下代码计算网页上的链接数:
import requests
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
import pandas as pd
webpage = "https://www.isode.com/products/index.html"
try:
response = requests.get(webpage)
#response.raise_for_status()
except HTTPError:
print("A HTTP Error has occured")
except Exception as err:
print(err)
else:
print("The request of the webpage was a success!")
contents = response.content
contents
soup = BeautifulSoup(contents, features = "html.parser")
a = 0
for link in soup.find_all("a"):
if link.get("href"):
a=a+1
print(link.get("href")
我的预期答案是 86,但这段代码给了我 83,所以我不知道哪里出错了?
此外,就拥有一个计数变量而言 - 肯定有更好的方法来做到这一点吗?
解决方案
import requests
from bs4 import BeautifulSoup
links = []
with requests.Session() as req:
r = req.get('https://www.isode.com/products/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
if r.status_code == 200:
for item in soup.findAll('a'):
item = item.get('href')
if item is not None:
links.append(item)
print(len(links))
输出:
83
但是,如果您删除了if item is not None:
so 的条件,您将得到86
深版:
import requests
from bs4 import BeautifulSoup
links = []
with requests.Session() as req:
r = req.get('https://www.isode.com/products/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
if r.status_code == 200:
count = 0
for item in soup.findAll('a'):
item = item.get('href')
if item is not None:
if item.startswith('..'):
item = item.replace('..', 'https://www.isode.com')
elif item.startswith('http'):
pass
else:
item = (f"https://www.isode.com/"+item)
print(item)
links.append(item)
else:
count += 1
print(f"Total Links: {len(links)}")
print(f"Total None: {count}")
推荐阅读
- deployment - Jelastic Cloud 上的 Nuxt 应用程序不可用
- python - 如何使用 datefunction 或将 python 日期放在硒编码下面。现在我正在输入硬编码值 '2021-05-01' 和 2021-05-31'
- c++ - QFile 不允许更改文件的权限
- python - 字符串和 numba
- java - 如何将 Snackbar 锚定到 android 软导航栏?
- css - 使用 flex 居中所有 div 标签
- c++ - WinAPI GUI:状态栏是否可能带有对话框?
- asp.net - ASP.NET Web API 自己的响应代码模式建议
- mysql - 数据库设计 - 唯一属性的外键
- assembly - 如何从 nasm 转换为 gas 这段代码?