首页 > 解决方案 > Python中的Beautiful Soup没有给我页面上正确数量的链接

问题描述

我正在尝试使用以下代码计算网页上的链接数:

import requests
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
import pandas as pd 


webpage = "https://www.isode.com/products/index.html"


try:
    response = requests.get(webpage)
    #response.raise_for_status()
except HTTPError:
    print("A HTTP Error has occured")
except Exception as err:
    print(err)
else:
    print("The request of the webpage was a success!")

contents = response.content
contents

soup = BeautifulSoup(contents, features = "html.parser")    

a = 0
for link in soup.find_all("a"):
    if link.get("href"):
        a=a+1
        print(link.get("href") 

我的预期答案是 86,但这段代码给了我 83,所以我不知道哪里出错了?

此外,就拥有一个计数变量而言 - 肯定有更好的方法来做到这一点吗?

标签: pythonbeautifulsouprequest

解决方案


import requests
from bs4 import BeautifulSoup

links = []
with requests.Session() as req:
    r = req.get('https://www.isode.com/products/index.html')
    soup = BeautifulSoup(r.text, 'html.parser')
    if r.status_code == 200:
        for item in soup.findAll('a'):
            item = item.get('href')
            if item is not None:
                links.append(item)
print(len(links))

输出:

83

但是,如果您删除了if item is not None:so 的条件,您将得到86

在此处输入图像描述

深版:

import requests
from bs4 import BeautifulSoup

links = []
with requests.Session() as req:
    r = req.get('https://www.isode.com/products/index.html')
    soup = BeautifulSoup(r.text, 'html.parser')
    if r.status_code == 200:
        count = 0
        for item in soup.findAll('a'):
            item = item.get('href')
            if item is not None:
                if item.startswith('..'):
                    item = item.replace('..', 'https://www.isode.com')
                elif item.startswith('http'):
                    pass
                else:
                    item = (f"https://www.isode.com/"+item)
                print(item)
                links.append(item)
            else:
                count += 1

print(f"Total Links: {len(links)}")
print(f"Total None: {count}")

推荐阅读