首页 > 解决方案 > 如何使用 bs4 搜索缩进到另一个 div 属性的 div 属性?

问题描述

我正在尝试构建一个 python 脚本来抓取 UEFA 网站的实时比分,但我找不到包含匹配分数的属性,因为它位于另一个 div 属性中。

这是代码:

from datetime import date
import requests
from bs4 import BeautifulSoup

today= date.today()
d= today.strftime("%Y-%m-%d")

page = requests.get("https://www.uefa.com/livescores/?date=" + d) 

soup = BeautifulSoup(page.content, "html.parser")

matches_list = soup.find_all("div", class_="matches-list")

print(matches_list) 

我想知道我是否可以直接从顶部搜索该属性而无需向下搜索三个。

标签: pythonhtmlweb-scrapingbeautifulsoup

解决方案


此站点使用 API 调用:

GET https://match.uefa.com/v2/matches

带有日期、分页和竞争标识的一些查询参数

它需要一个嵌入在 javascript 标签中的 api 密钥。一种解决方案是使用正则表达式提取此 api 密钥,然后用于requests进行调用:

from datetime import date
import requests
import re

today = date.today()
d = today.strftime("%Y-%m-%d")

r = requests.get("https://www.uefa.com/livescores/?date=" + d)
reg = re.search("apiKey.*['\"](.*)['\"]", r.text, re.MULTILINE)
apiKey = reg.group(1)

r = requests.get("https://match.uefa.com/v2/matches",
                 params={
                     "fromDate": today,
                     "toDate": today,
                     "order": "ASC",
                     "offset": 0,
                     "limit": 100,
                     "competitionId": "18,39,14,27,38,22,19,2014,2017,5,28,9,1,13,3,2018,101,17,2008,23"
                 },
                 headers={
                     "x-api-key": apiKey
                 })
result = r.json()
data = [{
    "awayTeam": t["awayTeam"]["internationalName"],
    "homeTeam": t["homeTeam"]["internationalName"],
    "datetime": t["kickOffTime"]["dateTime"],
    "score": t["score"]["total"] if t.get("score") else {},
    "winner": {
        "reason": t["winner"]["match"]["reason"],
        "team": t["winner"]["match"]["team"]["internationalName"] if t["winner"]["match"].get("team") else ""
    } if t.get("winner") else {}
}
    for t in result
]
print(data)

如果此时可用,它将打印带有分数的比赛信息

[{
    'awayTeam': 'Turkey',
    'homeTeam': 'Switzerland',
    'datetime': '2021-06-20T16:00:00Z',
    'score': {},
    'winner': {}
}, {
    'awayTeam': 'Wales',
    'homeTeam': 'Italy',
    'datetime': '2021-06-20T16:00:00Z',
    'score': {},
    'winner': {}
}]

在 repl.it 上试试这个

编辑

看来您甚至不需要更简单的 api 密钥:

from datetime import date
import requests

today = date.today()
d = today.strftime("%Y-%m-%d")

r = requests.get("https://match.uefa.com/v2/matches",
                 params={
                     "fromDate": today,
                     "toDate": today,
                     "order": "ASC",
                     "offset": 0,
                     "limit": 100,
                     "competitionId": "18,39,14,27,38,22,19,2014,2017,5,28,9,1,13,3,2018,101,17,2008,23"
                 })
result = r.json()
data = [{
    "awayTeam": t["awayTeam"]["internationalName"],
    "homeTeam": t["homeTeam"]["internationalName"],
    "datetime": t["kickOffTime"]["dateTime"],
    "score": t["score"]["total"] if t.get("score") else {},
    "winner": {
        "reason": t["winner"]["match"]["reason"],
        "team": t["winner"]["match"]["team"]["internationalName"] if t["winner"]["match"].get("team") else ""
    } if t.get("winner") else {}
}
    for t in result
]
print(data)

推荐阅读