首页 > 解决方案 > 使用 Python BeautifulSoup 进行 Web 抓取时出错:从 github 配置文件中提取内容

问题描述

这是使用 BeautifulSoup 库从 github 存储库中抓取内容的 python 代码。我面临错误:

“NoneType”对象没有属性“文本””

在这个简单的代码中。我在代码中注释的两行中遇到错误。

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://github.com/DURGESHBARWAL?tab=repositories"
r = requests.get(URL) 

soup = BeautifulSoup(r.text, 'html.parser') 

repos = []
table = soup.find('ul', attrs = {'data-filterable-for':'your-repos-filter'}) 

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text
    #First Error Position
        repo['desc'] = row.find('div').p.text
        #Second Error Postion
    repo['lang'] = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'}).text
    repos.append(repo) 

filename = 'extract.csv'
with open(filename, 'w') as f: 
    w = csv.DictWriter(f,['name','desc','lang'])
    w.writeheader() 
    for repo in repos: 
        w.writerow(repo)

输出

Traceback(最近一次调用最后一次):文件“webscrapping.py”,第 16 行,在 repo['desc'] = row.find('div').p.text AttributeError:'NoneType' 对象没有属性 'text'

标签: pythonweb-scrapingbeautifulsoup

解决方案


发生这种情况的原因是当您通过 BeautifulSoup 查找元素时,它的行为就像一个dict.get()调用。当您转到find元素时,它get来自元素树。如果它找不到一个,而不是提高一个Exception,它会返回NoneNone不具有 anElement将具有的属性,例如text,attr等。因此,当您在Element.text没有try/except或没有验证类型的情况下拨打电话时,您就是在赌该元素将始终存在。

我可能只是先将给您带来问题的元素保留在临时变量中,这样您就可以进行类型检查。要么实现try/except

类型检查

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text


    p = row.find('div').p
    if p is not None:
        repo['desc'] = p.text
    else:
        repo['desc'] = None

    lang = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'})

    if lang is not None
        # Do something to pass here
        repo['lang'] = lang.text
    else:
        repo['lang'] = None
    repos.append(repo)

尝试/除

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text
    #First Error Position
    try:
        repo['desc'] = row.find('div').p.text
    except TypeError:
        repo['desc'] = None
        #Second Error Postion
    try:
        repo['lang'] = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'}).text
    except TypeError:
         repo['lang'] = None
    repos.append(repo)

我个人倾向于尝试/除外,因为它更简洁,异常捕获是提高程序健壮性的好习惯


推荐阅读