首页 > 解决方案 > 从延迟加载页面中提取href

问题描述

我正在尝试href从英超联赛网站中提取,但是除了第一页之外,我似乎无法获得所有 html 链接:

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.premierleague.com/players/')
soup = BeautifulSoup(r.content, 'lxml')
#get the player index
table = soup.find('div', {'class': 'table playerIndex'})
#<a> is where the href is stored
href_names = [link.get('href') for link in table.findAll('a')]
football_string = 'https://www.premierleague.com'
#concatenate to get the full html link
[football_string + str(x) for x in href_names]

仅返回第一页-我尝试过使用,selenium但是英超联赛网站每次使用时都会出现一个广告,这会阻止它正常工作。关于如何获取所有html链接的任何想法?

标签: pythonweb-scrapingbeautifulsoup

解决方案


如果我正确理解了您的问题,则应采用以下方法:

import requests

base = 'https://www.premierleague.com/players/{}/'
link = 'https://footballapi.pulselive.com/football/players'
payload = {
    'pageSize': '30',
    'compSeasons': '418',
    'altIds': 'true',
    'page': 0,
    'type': 'player',
    'id': '-1',
    'compSeasonId': '418'
}
with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    s.headers['referer'] = 'https://www.premierleague.com/'
    s.headers['origin'] = 'https://www.premierleague.com'

    while True:
        res = s.get(link,params=payload)
        if not res.json()['content']:break
        for item in res.json()['content']:
            print(base.format(int(item['id'])))

        payload['page']+=1

结果类似于(截断):

https://www.premierleague.com/players/19970/
https://www.premierleague.com/players/13279/
https://www.premierleague.com/players/13286/
https://www.premierleague.com/players/10905/
https://www.premierleague.com/players/4852/
https://www.premierleague.com/players/4328/
https://www.premierleague.com/players/90665/

推荐阅读