首页 > 解决方案 > Web Scraping - 了解顶级玩家数据

问题描述

我一直在尝试从 Understat 网站 ( https://understat.com/league/EPL ) 上抓取数据,虽然我可以轻松抓取顶级球员的数据,但我无法为顶级球队的数据做同样的事情。请帮我解决一下这个。这是我的代码。

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen

scrape_url="https://understat.com/league/EPL/2020"
page_connect = urlopen(scrape_url)

page_html=BeautifulSoup(page_connect, 'html.parser')
page_html.findAll(name="script")

json_raw_string= page_html.findAll(name="script")[1].string
json_raw_string

start_ind = json_raw_string.index("\\")
stop_ind = json_raw_string.index("')")

data = json_raw_string[start_ind:stop_ind]
data = data.encode("utf8").decode("unicode_escape")
json.loads(data)

df = pd.json_normalize(json.loads(data))
df.head()```

标签: pythonhtmlweb-scraping

解决方案


数据在索引 2 中,但您需要计算最终表。例如:

import json
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen

scrape_url = "https://understat.com/league/EPL/2020"
page_connect = urlopen(scrape_url)

page_html = BeautifulSoup(page_connect, "html.parser")
page_html.findAll(name="script")

json_raw_string = page_html.findAll(name="script")[2].string

start_ind = json_raw_string.index("\\")
stop_ind = json_raw_string.index("')")

data = json_raw_string[start_ind:stop_ind]
data = data.encode("utf8").decode("unicode_escape")

data = json.loads(data)

df = pd.DataFrame(data.values())
df = df.explode("history")
h = df.pop("history")
df = pd.concat([df.reset_index(drop=True), pd.DataFrame(h.tolist())], axis=1)

# for example print xG column:
print(df.groupby("title")["xG"].sum().sort_values(ascending=False))

打印xG降序排列的列:

title
Manchester City            77.715218
Liverpool                  72.207518
Chelsea                    68.655594
Manchester United          63.172237
West Ham                   60.338271
Leeds                      59.258638
Leicester                  58.800116
Aston Villa                56.715489
Tottenham                  56.676279
Brighton                   53.819028
Arsenal                    52.247381
Everton                    49.237118
Southampton                45.284568
Newcastle United           43.959188
Fulham                     41.055309
Wolverhampton Wanderers    38.619038
Burnley                    38.127929
Crystal Palace             35.286608
West Bromwich Albion       34.971290
Sheffield United           33.159177
Name: xG, dtype: float64

推荐阅读