首页 > 解决方案 > 在奥运会数据集中显示多年来赢得的奖牌增加的最佳图是什么

问题描述

我正在尝试通过玩 Kaggle 的 Olympics 数据集来学习熊猫。Plotly express 的动画散布(来自 gapminder 数据集)看起来非常令人印象深刻。我正在尝试制作一个类似的图来显示多年来不同国家在奥运会上获得的奖牌总数的趋势。

以下是经过一些聚合步骤后数据框的外观

这就是我尝试过的:

px.scatter(df3, x='Cum_Total', 
           y='Total', 
           animation_group='Country',
           animation_frame='Year',
           size='Cum_Total', size_max=100,
           color='Country', hover_name='Country',
           range_y=[1,300], range_x=[1,3000])

动画似乎不是一个连续的动画——它看起来像是出现和消失的离散点。在绘图之前,我尝试按“Year”和“Cum_Total”对数据框进行排序,但仍然对输出不满意。有人可以帮助我了解我哪里出错了吗?

标签: pandasplotly

解决方案


  • 一切都与数据准备有关 -熊猫
  • 数据框需要跨年份、国家和奖牌统一
  • 这实际上是夏季和冬季奥运会奖牌的总和
import kaggle.cli
import sys, math
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import plotly.express as px

# download data set
sys.argv = [
    sys.argv[0]
] + "datasets download heesoo37/120-years-of-olympic-history-athletes-and-results".split(
    " "
)
kaggle.cli.main()

zfile = ZipFile("120-years-of-olympic-history-athletes-and-results.zip")
zfile.infolist()

# use CSV
df = pd.read_csv(zfile.open(zfile.infolist()[0]))
dfnoc = pd.read_csv(zfile.open(zfile.infolist()[1]))

idcols = [
    "Year",
    "Medal",
    "NOC",
]
# make data uniform.  i.e. every year has every country and every medal
dfp = (
    pd.merge(
        # calc medals for each year and country
        df.dropna(subset=["Medal"]).groupby(idcols).size().to_frame(),
        pd.DataFrame(
            index=pd.MultiIndex.from_product(
                [df[c].dropna().unique() for c in idcols],
                names=idcols,
            )
        ),
        on=idcols,
        how="right",
    )
    .fillna(0)
    .sort_values(idcols)
    # calc cum sums across years
    .groupby(idcols[1:])
    .cumsum()
    .rename(columns={0: "Cum Count"})
    .reset_index()
    .merge(dfnoc, on="NOC")
)

# what countries are top performing... too many countries to fit on y-axis
topx = (
    dfp.loc[dfp["Year"].eq(dfp["Year"].max())]
    .groupby("NOC")
    .agg({"Cum Count": "sum"})
    .sort_values("Cum Count", ascending=False)
    .head(50)
)
# px.scatter(df2, x="Year", y="Team", size=0, color="Medal", animation_frame="Year")
fig = px.scatter(
    dfp.loc[dfp["NOC"].isin(topx.index)],
    x="region",
    y="Cum Count",
    log_y=True,
    size="Cum Count",
    size_max=100,
    color_discrete_sequence=["brown","gold","silver"],
    color="Medal",
    animation_frame="Year",
)
fig.update_layout(
    margin={"l": 0, "r": 0, "t": 0, "r": 0},
    yaxis={"range": [0, math.ceil(math.log10(dfp["Cum Count"].max()))]},
)

在此处输入图像描述


推荐阅读