首页 > 解决方案 > 使用 Beautiful Soup AttributeError: 'NoneType' 在 Python 中抓取表格

问题描述

我正在从网页上抓取两张表:https ://www.transfermarkt.com/premier-league/legionaereeinsaetze/wettbewerb/GB1/plus/?option=spiele&saison_id=2017&altersklasse=alle

我正在尝试获取许多国家和多年的数据,并设置了包括国家 URL 的列表。

这是我的代码:

for l in range(0, len(league_urls)):
    time.sleep(0.5)
    #The second loop is for each year we want to scrape
    for n in range(2007,2020):
        time.sleep(0.5)
        df_soccer1 = None
        url = league_urls[l] + str(n) + str('&altersklasse=alle')
        headers = {"User-Agent":"Mozilla/5.0"}
        response = requests.get(url, headers=headers, verify=False)
        time.sleep(0.5)
        soup = BeautifulSoup(response.text, 'html.parser')

        #Table 1 with information about the value
        table = soup.find("table", {"class" : "items"})

        team = []
        players_used = []
        minutes_nonforeign = []
        minutes_foreign = []

        for row in table.find_all('tr')[1:]:
            try:
                col = row.find_all('td')
                team_ = col[1].text
                players_used_ = col[2].text
                minutes_nonforeign_ = col[3].text
                minutes_foreign_ = col[4].text
                team.append(team_)
                players_used.append(players_used_)
                minutes_nonforeign.append(minutes_nonforeign_)
                minutes_foreign.append(minutes_foreign_)
            except:
                team.append('')
                players_used.append('')
                minutes_nonforeign.append('')
                minutes_foreign.append('')

        team = [elem.replace('\n','').replace('\xa0','').strip() for elem in team]
        
 #Table 2 with information about placement, goals and points
        df_soccer2 = None

        table2 = soup.find("div", {"class" : "box tab-print"})

        team2 = []
        place = []
        matches = []
        difference = []
        pts = []

        for row in table2.find_all('tr'):
            try:
                col = row.findAll('td')
                team2_ = col[2].text
                place_  = col[0].text
                matches_ = col[3].text
                difference_ = col[4].text
                pts_ = col[5].text
                team2.append(team2_)
                place.append(place_)
                matches.append(matches_)
                difference.append(difference_)
                pts.append(pts_)
            except:
                team2.append('')
                place.append('')
                matches.append('')
                difference.append('')
                pts.append('')
               

        team2 = [elem.replace('\n','').replace('\xa0','').strip() for elem in team2]

        df_soccer1 = pd.DataFrame({'Team': team[1:], 'Season': [n]*(len(team)-1), 'Players used': players_used[1:], 
                                    'Minutes nonforeign': minutes_nonforeign[1:], 'Minutes foreign': minutes_foreign[1:]})
        
        df_soccer2 = pd.DataFrame({'Team': team2, 'Place': place, 'Matches': matches, 'Difference': difference,
                                     'Points': pts})

刮第一张桌子时我遇到了这个问题:

AttributeError                            Traceback (most recent call last)
<ipython-input-46-b4cd681f68e8> in <module>
     21         minutes_foreign = []
     22 
---> 23         for row in table.find_all("tr")[1:]:
     24             try:
     25                 col = row.find_all('td')

AttributeError: 'NoneType' object has no attribute 'find_all'

需要注意的是,league_urls 是一个长长的 URL 列表。

我在网站的另一部分使用了类似的代码,效果很好。我似乎无法弄清楚为什么它不适用于这个。

此外,当我只使用一个 URL 运行代码时,它的效果很好。是否有可能存在一些问题,因为我循环了 12 年以获取 55 个不同的 URL?

标签: pythonweb-scrapingbeautifulsoup

解决方案


测试表是否为无,例如

import requests
from bs4 import BeautifulSoup

url = 'https://www.transfermarkt.com/remier-liga/legionaereeinsaetze/wettbewerb/RU1/plus/?option=spiele&saison_id=2011&altersklasse=alle'
headers = {"User-Agent":"Mozilla/5.0"}
response = requests.get(url, headers=headers, verify=False)
#time.sleep(0.5)
soup = BeautifulSoup(response.text, 'html.parser')

#Table 1 with information about the value
table = soup.find("table", {"class" : "items"})

team = []
players_used = []
minutes_nonforeign = []
minutes_foreign = []

if not table is None:
    for row in table.find_all('tr')[1:]:
            try:
                col = row.find_all('td')
                team_ = col[1].text
                players_used_ = col[2].text
                minutes_nonforeign_ = col[3].text
                minutes_foreign_ = col[4].text
                team.append(team_)
                players_used.append(players_used_)
                minutes_nonforeign.append(minutes_nonforeign_)
                minutes_foreign.append(minutes_foreign_)
            except:
                team.append('')
                players_used.append('')
                minutes_nonforeign.append('')
                minutes_foreign.append('')
else:
    team.append('')
    players_used.append('')
    minutes_nonforeign.append('')
    minutes_foreign.append('')

推荐阅读