首页 > 解决方案 > 按月和年对熊猫的数据框系列进行排序?

问题描述

我对 python 和使用 pandas 非常陌生,并且正在努力与时间打交道。

我有一个数据框,其中包含不同事件发生的日期。我想根据事件发生的时间(月和年)对事件进行分类/排序。每个事件都有一个对应的计数。我想及时对数据进行排序,得到每个月发生的事件的总和。我用来尝试的代码如下所示。

df = df.sort_values(by='Date')
all_events_dates=df.groupby(df['Date'].dt.strftime('%B %Y'))['Count'].sum()
month_year = ['January 2013','February 2013','March 2013', 'April 2013','May 2013','June 2013','July 2013','August 2013','September 2013','October 2013','November 2013','December 2013','January 2014','February 2014','March 2014', 'April 2014','May 2014','June 2014','July 2014','August 2014','September 2014','October 2014','November 2014','December 2014','January 2015','February 2015','March 2015', 'April 2015','May 2015','June 2015','July 2015','August 2015','September 2015','October 2015','November 2015','December 2015']
all_events_dates.index = pd.CategoricalIndex(set(df['Date'].dt.strftime('%B %Y')), categories=month_year, ordered=True)
all_events_dates = all_events_dates.sort_index()
print(all_events_dates)

输出如下(去掉一些月份):

-January 2013       2
-February 2013      6
-March 2013        12
-April 2013        11
-May 2013           2
-June 2013          4
-July 2013          2
-September 2013     1
...
-August 2015        3
-September 2015     8
-October 2015       7
-November 2015      4
-December 2015      2

哪个被错误地分箱(每月的频率不正确)。当我打开一个新项目并想尝试同样的事情时,我不是按事件排序,而是按时间位置排序,我使用了以下代码:

df = df.sort_values(by='Date')
all_loc_dates=df.groupby(df['Date'].dt.strftime('%B %Y'))['Count'].sum()
month_year = ['January 2013','February 2013','March 2013', 'April 2013','May 2013','June 2013','July 2013','August 2013','September 2013','October 2013','November 2013','December 2013','January 2014','February 2014','March 2014', 'April 2014','May 2014','June 2014','July 2014','August 2014','September 2014','October 2014','November 2014','December 2014','January 2015','February 2015','March 2015', 'April 2015','May 2015','June 2015','July 2015','August 2015','September 2015','October 2015','November 2015','December 2015']
all_loc_dates.index = pd.CategoricalIndex(set(df['Date'].dt.strftime('%B %Y')), categories=month_year, ordered=True)
all_loc_dates = all_loc_dates.sort_index()
print(all_loc_dates)

这似乎是相同的代码,但我得到以下输出:

January 2013       9
February 2013      1
March 2013         7
April 2013         7
May 2013           4
June 2013          9
July 2013          4
September 2013    10
October 2013       1
November 2013      8
December 2013      5
...
June 2015          4
July 2015          2
August 2015       12
September 2015     4
October 2015       3
November 2015      2
December 2015      2

两种情况下的 df 是相同的(只是在不同的项目中)。我不确定为什么每年每月的频率会不同且不正确?

我发现问题似乎出在以下代码行中:

all_loc_dates.index = pd.CategoricalIndex(set(df['Date'].dt.strftime('%B %Y')), categories=month_year, ordered=True)
all_loc_dates = all_loc_dates.sort_index()

排除这些行时会给出正确的输出。虽然顺序不正确。输出如下:

April 2013         4
April 2014         4
April 2015         7
August 2014        5
August 2015        6
December 2013      2
December 2014      3
December 2015     10
February 2013      4
February 2014      4
February 2015      3
January 2013       7
January 2014       8
January 2015       5
...
September 2013     2
September 2014     1
September 2015     3

这些似乎是正确的值,但当然现在排序不正确。不确定为什么 pd.CategoricalIndex 函数会更改值。另外,如果不使用此功能,如何根据日期对系列进行排序?

标签: pythonpandaspandas-groupby

解决方案


您需要将您的月份名称更改为月份编号,例如 Jan 2013 到 01 2013。然后对其进行排序,然后再次将其更改为月份名称 - 年份。

df['date value'] = pd.to_datetime(df['date value'], format='%b%Y')

df = df.sort_values('date value', ascending = True)

推荐阅读