首页 > 解决方案 > BeautifulSoup“查找”方法莫名其妙地返回NoneType

问题描述

我正在使用 BeautifulSoup 模块来查找不同种类的果冻真菌的图像和站点链接,将它们写入 html 文件,并将它们显示给用户。这是我的代码:

import os
import cfscrape
import webbrowser
from bs4 import BeautifulSoup

spider = cfscrape.CloudflareScraper()

#Creating a session.
with spider:
    #Scraping the contents of the main page.
    data = spider.get("https://en.wikipedia.org/wiki/Jelly_fungus").content

    #Grabbing data on each of the types of jelly fungi.
    soup = BeautifulSoup(data, "lxml")
    ul_tags = soup.find_all("ul")
    mushroom_hrefs = ul_tags[1]

    #Creating list to store page links.
    links = []

    #Grabbing the page links for each jelly fungi, and appending them to the links list.
    for mushroom in mushroom_hrefs.find_all("li"):
        for link in mushroom.find_all("a", href=True):
            links.append(link["href"])

    #Creating list to store image links    .
    images = []

    #Grabbing the image links from each jelly fungi's page, and appending them to the images list.
    for i, link in enumerate(links, start=1):
        link = "https://en.wikipedia.org/" + link
        data = spider.get(link).content

        soup = BeautifulSoup(data, "lxml")
        fungus_info = soup.find("table", {"class": "infobox biota"})
        print(i)

        img = fungus_info.find("img")
        images.append("https:" + img["src"])

#Checking for an existing html file, if there is one, delete it.
if os.path.isfile("fungus images.html"):
    os.remove("fungus images.html")

#Iterating through the jelly fungi images and placing them accordingly in the html file.
for i, img in enumerate(images):
    links[i] = "https://en.wikipedia.org" + links[i]
    with open("fungus images.html", "a") as html:
        if i == 0:
            html.write(f"""
<DOCTYPE! html
<html>
<head>
<title>Fungus</title>
</head>
<body>
<h1>Fungus Images</h1>
<a href="{links[i]}">
<img src="{img}">
</a>
            """)

        elif i < len(images):
            html.write(f"""
<a href="{links[i]}">
<img src="{img}">
</a>
            """)

        else:
            html.write(f"""
<a href="{links[i]}">
<img src="{img}">
</a>
</body>
</html>
            """)

webbrowser.open("fungus images.html")

在第 45 行,我开始遍历每个真菌的页面,以便找到包含其图片的信息表。这适用于前 17 页,但由于某种原因,在 Tremeldendron 真菌上返回 NoneType 值。我不知道为什么会这样,因为它的桌子与其他真菌具有相同的类别。

标签: pythonbeautifulsoup

解决方案


NoneType 来自您正在抓取的维基百科页面。此图像中的红色圆圈显示您认为您的 Tremeldendron 真菌链接在索引处的链接是什么。 维基百科 它的 href 本身#cite-note-3并没有链接到维基百科页面,因此您的抓取错误。确保您的链接指向页面而不是参考;)


推荐阅读