首页 > 解决方案 > 美丽的汤:试图得到一个 div 的孩子

问题描述

我正在尝试从以下位置获取守望先锋联赛的球队名称和比分: https ://overwatchleague.com/en-us/schedule?stage=regular_season&week=1

我需要做的是从一个更大的 div 的孩子中抓取一系列孩子

到目前为止,我有:

matches = bs.find_all('div', {'class': 'schedule-boardstyles__ContainerCards-j4x5cc-8 jcvNlt'})

    for match in matches:
        rows = match.find_all('div', {'class': 'schedule-boardstyles__ContainerMatchCard-j4x5cc-9 esCuul match-cardstyles__Container-sc-1rgscfz-0 doBeIs'})
        print("here")
        for row in rows:
            print('here2')
            team1 = row.find('p', {'class': 'match-cardstyles__MiddleText-sc-1rgscfz-12 hueupq'})
            score1 = row.find('p', {'class': 'match-cardstyles__ScoreText-sc-1rgscfz-23 gOtrSB'})
            score2 = row.find('p', {'class': 'match-cardstyles__ScoreText-sc-1rgscfz-23 jRejaZ'})
            team2 = row.find('p', {'class': 'match-cardstyles__MiddleText-sc-1rgscfz-12 cLYgmY'})
            temp = 'team_1:{}, score":{}-{}", team_2:{}'.format(team1.text, score1.text, score2.text,team2.text)
            print(temp)
            match_schedule.append(temp)

但它没有返回任何东西,即使是从最初的比赛刮起,我做错了什么?

标签: pythonhtmlbeautifulsoup

解决方案


信息是动态生成的,因此通常需要浏览器来构建它。然而,它也可以使用站点的 API 分两步提取。首先访问主页以确定所需的日程 ID。然后可以使用它来请求相关的匹配。信息以 JSON 格式返回。

例如:

import requests
from bs4 import BeautifulSoup
import json

url = "https://overwatchleague.com/en-us/schedule?stage=regular_season&week=1"
session = requests.Session()

r_main = session.get(url)
soup = BeautifulSoup(r_main.content, "html.parser")
js = soup.find('script', id="__NEXT_DATA__")
data_main = json.loads(js.string)
schedule = data_main['props']['pageProps']['blocks'][2]['schedule']['uid']

headers = {
    "Referer" : "https://overwatchleague.com/",
    "x-origin" : "overwatchleague.com",
    "Origin" : "https://overwatchleague.com",
    "DNT": "1",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
}

r_schedule = session.get(f'https://wzavfvwgfk.execute-api.us-east-2.amazonaws.com/production/v2/content-types/schedule/{schedule}/week/1?locale=en-us', headers=headers)
data_schedule = r_schedule.json()

matches = []

for match in data_schedule['data']['tableData']['events'][0]['matches']:
    competitors = [c['name'] for c in match['competitors']]
    scores = match['scores']
    row = (competitors[0], scores[0], competitors[1], scores[1])
    matches.append(row)
    
    print(f"{row[0]:25}  {row[1]:2}  {row[2]:25}  {row[3]}")

给你:

Houston Outlaws             3  Dallas Fuel                2
Los Angeles Gladiators      1  San Francisco Shock        3
Guangzhou Charge            0  Shanghai Dragons           3
Los Angeles Valiant         1  Chengdu Hunters            3
Philadelphia Fusion         3  Seoul Dynasty              1
Toronto Defiant             3  Vancouver Titans           1
Atlanta Reign               1  Florida Mayhem             3
Dallas Fuel                 3  Los Angeles Gladiators     1
Guangzhou Charge            0  Seoul Dynasty              3
Chengdu Hunters             3  Shanghai Dragons           0
Philadelphia Fusion         3  Los Angeles Valiant        0
Houston Outlaws             3  San Francisco Shock        2
Florida Mayhem              3  Vancouver Titans           1
Toronto Defiant             3  Atlanta Reign              2

我强烈建议您打印出 JSON,例如data_schedule,以便更好地了解返回的所有信息。脚本中的其他详细信息是通过使用浏览器的开发人员功能获得的,以查看在页面加载时发出了哪些请求。


推荐阅读