首页 > 解决方案 > 用 bs4 Beautiful Soup 刮多页 - 只刮第一页

问题描述

*** 我的代码仅供练习!

我正在尝试从他们的网站https://www.premierleague.com/上抓取 FPL 中每个球员的姓名和球队,但我遇到了一些代码问题。

问题是它只在 url 末尾获取带有“-1”的页面,我什至没有在我的页面列表中包含!

页面没有任何逻辑 - 基本网址是https://www.premierleague.com/players?se=363&cl=而“=”后面的数字似乎是随机的。所以我创建了一个数字列表并使用 for 循环将其添加到 url:

我的代码:

import requests
from bs4 import BeautifulSoup
import pandas

plplayers = []

pl_url = 'https://www.premierleague.com/players?se=363&cl='
pages_list = ['1', '2', '131', '34']
for page in pages_list:
    r = requests.get(pl_url + page)
    c = r.content
    soup = BeautifulSoup(c, 'html.parser')
    player_names = soup.find_all('a', {'class': 'playerName'})



    for x in player_names:
        player_d = {}
        player_teams = []
        player_href = x.get('href')
        player_info_url = 'https://www.premierleague.com/' + player_href
        player_r = requests.get(player_info_url, headers=headers)
        player_c = player_r.content
        player_soup = BeautifulSoup(player_c, 'html.parser')
        team_tag = player_soup.find_all('td', {'class': 'team'})
        for team in team_tag:
            try:
                team_name = team.find('span', {'class': 'long'}).text
                if '(Loan)' in team_name:
                    team_name.replace('  (Loan) ', '')
                if team_name not in player_teams:
                    player_teams.append(team_name)
                player_d['NAME'] = x.text
                player_d['TEAMS'] = player_teams
            except:
                pass
        plplayers.append(player_d)


df = pandas.DataFrame(plplayers)
df.to_csv('plplayers.txt')

标签: pythonweb-scrapingbeautifulsoup

解决方案


我会对此发表评论,但我是新手,没有足够的声誉,所以我必须将其保留在答案中。

看起来当您发出存储请求时,player_r您指定了 headers 参数,但实际上并没有创建 headers 变量。

如果您改为player_r = requests.get(player_info_url, headers=headers)替换player_r = requests.get(player_info_url),您的代码应该可以完美运行。至少,它在我的机器上做到了。


推荐阅读