首页 > 解决方案 > 如何使用 python 跳过网页抓取中的错误?

问题描述

我是 python 新手,想学习用 python 抓取网页。我的第一个项目是德国的黄页。

执行我的代码时,我在抓取 12 页后得到以下 IndexError:

('回溯(最近一次通话最后):文件“C:/Users/Zorro/PycharmProjects/scraping/venv/Lib/site-packages/pip-19.0.3-py3.6.egg/pip/_vendor/pytoml/test .py",第 25 行,在 city = city_container[0].text.strip() IndexError: list index out of range

进程以退出代码 1') 结束

我想知道如何跳过这个错误,以便 python 不会停止抓取。

我尝试使用 try 和 except 块,但没有成功。

from bs4 import BeautifulSoup as soup
import requests


page_title = "/Seite-"
page_number = 1

for i in range(25):

my_url = "https://www.gelbeseiten.de/Branchen/Italienisches%20Restaurant/Berlin"

page_html = requests.get(my_url + page_title + str(page_number))
page_soup = soup(page_html.text, "html.parser")

containers = page_soup.findAll("div", {"class": "table"})

for container in containers:
    name_container = container.findAll("div", {"class": "h2"})
    name = name_container[0].text.strip()

    street_container = container.findAll("span", {"itemprop": "streetAddress"})
    street = street_container[0].text.strip()

    city_container = container.findAll("span", {"itemprop": "addressLocality"})
    city = city_container[0].text.strip()

    plz_container = container.findAll("span", {"itemprop": "postalCode"})
    plz_name = plz_container[0].text.strip()

    tele_container = container.findAll("li", {"class": "phone"})
    tele = tele_container[0].text.strip()

    print(name, "\n" + street, "\n" + plz_name + " " + city, "\n" + tele)
    print()

page_number += 1

标签: pythonexceptionweb-scraping

解决方案


好的,格式似乎在发布代码时受到了一些影响。两件事情:

1)当网络抓取时,通常建议在连续抓取之间增加一些停机时间,以免被服务器抛出,也不会阻塞太多资源。我time.sleep(5)在每个页面请求之间添加了等待 5 秒钟,然后再加载另一个页面。

2)对我来说,try except工作得很好,如果你添加pass到异常部分。当然,您可以在处理异常方面变得更加老练。

from bs4 import BeautifulSoup as soup
import requests
import time


page_title = "/Seite-"
page_number = 1

for i in range(25):
    print(page_number)
    time.sleep(5)
    my_url = "https://www.gelbeseiten.de/Branchen/Italienisches%20Restaurant/Berlin"

    page_html = requests.get(my_url + page_title + str(page_number))
    page_soup = soup(page_html.text, "html.parser")

    containers = page_soup.findAll("div", {"class": "table"})

    for container in containers:

        try:
            name_container = container.findAll("div", {"class": "h2"})
            name = name_container[0].text.strip()

            street_container = container.findAll("span", {"itemprop": "streetAddress"})
            street = street_container[0].text.strip()

            city_container = container.findAll("span", {"itemprop": "addressLocality"})
            city = city_container[0].text.strip()

            plz_container = container.findAll("span", {"itemprop": "postalCode"})
            plz_name = plz_container[0].text.strip()

            tele_container = container.findAll("li", {"class": "phone"})
            tele = tele_container[0].text.strip()

            print(name, "\n" + street, "\n" + plz_name + " " + city, "\n" + tele)
            print()

        except:
            pass

    page_number += 1

推荐阅读