首页 > 解决方案 > 获取熊猫数据框中特定列的累积最频繁状态

问题描述

我有一个数据框:

# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7,8])
df['ID'] = [1,1,1,1,2,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=8, freq="M")
df['status'] = ['b','a','b','c','a','d','d','b']

# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)

df 

ID  election_date   status
0   2   2010-08-31    b
1   2   2010-07-31    d
2   2   2010-06-30    d
3   2   2010-05-31    a
4   1   2010-04-30    c
5   1   2010-03-31    b
6   1   2010-02-28    a
7   1   2010-01-31    b

我想获得每个列的累积最频繁状态。这是我所期望的:statusID

    ID  election_date   status  cum_most_freq_status
0   2   2010-08-31        b          d
1   2   2010-07-31        d          d
2   2   2010-06-30        d          a
3   2   2010-05-31        a          NaN
4   1   2010-04-30        c          b
5   1   2010-03-31        b          a 
6   1   2010-02-28        a          b
7   1   2010-01-31        b          NaN

解释:

你会怎么做?

标签: pandasdataframenumpycountcumulative-sum

解决方案


您可以先创建一个 DataFrame,并使用IDelection_date作为其索引,以及 one-hot-encodedstatus值,然后计算cumsum.

cumsum如果计数相同,我们想选择最近的状态,所以我要为当前状态添加一个小数字(小于 1) ,所以当我们应用idxmax它时,它将选择最近的状态,以防万一一个领带。

在找到最频繁的累积状态后,idxmax我们可以merge使用原始 DataFrame:

# make one-hot-encoded status dataframe
z = (df
         .groupby(['ID', 'election_date', 'status'])
         .size().unstack().fillna(0))

# break ties to choose most recent
z = z.groupby(level=0).cumsum() + (z * 1e-4)

# shift by 1 row, since we only count previous status occurrences
z = z.groupby(level=0).shift()

# merge
df.merge(z.idxmax(axis=1).to_frame('cum_most_freq_status').reset_index())

输出:

   ID election_date status cum_most_freq_status
0   2    2010-08-31      b                    d
1   2    2010-07-31      d                    d
2   2    2010-06-30      d                    a
3   2    2010-05-31      a                  NaN
4   1    2010-04-30      c                    b
5   1    2010-03-31      b                    a
6   1    2010-02-28      a                    b
7   1    2010-01-31      b                  NaN

推荐阅读