首页 > 解决方案 > Python pandas沿两个条件变量添加序列

问题描述

R中,可以使用 轻松地沿两个(甚至更多)条件变量添加序列ave(),如下所示:

# create a dataframe
dat = data.frame(
    FactorA = c(rep('a1', 10), rep('a2', 10)),
    FactorB = c(rep('b1', 5), rep('b2', 5), rep('b1', 5), rep('b2', 5)),
    DependentVar = rnorm(20)
)

# add ordering given combination of two factors
dat$Order <- ave(dat$DependentVar, dat$FactorA, dat$FactorB,
    FUN=seq_along)

Python中的类似物是pandas什么?


2020 年 6 月 22 日补充:

此外,如果您要通过“改组”它们来使 FactorA 和 FactorB 的级别交错,例如:

# a slightly "shuffled" dataframe
dat2 = data.frame(
    FactorA = c(rep('a1', 6), rep('a2', 6),
                rep('a1', 4), rep('a2', 4)),
    FactorB = c(rep('b1', 3), rep('b2', 3), rep('b1', 3), rep('b2', 3),
                rep('b1', 2), rep('b2', 2), rep('b1', 2), rep('b2', 2)),
    DependentVar = rnorm(20)
)

ave()将继续对它们进行排序:

dat2$Order <- ave(dat2$DependentVar, dat2$FactorA, dat2$FactorB,
    FUN=seq_along)
dat2
   FactorA FactorB DependentVar Order
1       a1      b1    1.3814360     1
2       a1      b1    1.0702582     2
3       a1      b1   -1.1974390     3
4       a1      b2   -1.1687711     1
5       a1      b2   -0.7584645     2
6       a1      b2   -0.5541912     3
7       a2      b1   -0.3083331     1
8       a2      b1    0.7707984     2
9       a2      b1    2.4709730     3
10      a2      b2    0.1768273     1
11      a2      b2    0.5687605     2
12      a2      b2    0.7360105     3
13      a1      b1    0.9253223     4
14      a1      b1   -0.3190011     5
15      a1      b2   -0.2657454     4
16      a1      b2   -0.1617810     5
17      a2      b1    0.9634501     4
18      a2      b1   -0.6749173     5
19      a2      b2    0.8138765     4
20      a2      b2   -1.1075720     5

Python可以(1)标记组合的“外观”,并且(2)重置排序,如下所示:

   FactorA FactorB DependentVar Order OrderReset WhichAppearance
1       a1      b1    1.3814360     1          1               1
2       a1      b1    1.0702582     2          2               1
3       a1      b1   -1.1974390     3          3               1
4       a1      b2   -1.1687711     1          1               1
5       a1      b2   -0.7584645     2          2               1
6       a1      b2   -0.5541912     3          3               1
7       a2      b1   -0.3083331     1          1               1
8       a2      b1    0.7707984     2          2               1
9       a2      b1    2.4709730     3          3               1
10      a2      b2    0.1768273     1          1               1
11      a2      b2    0.5687605     2          2               1
12      a2      b2    0.7360105     3          3               1
13      a1      b1    0.9253223     4          1               2
14      a1      b1   -0.3190011     5          2               2
15      a1      b2   -0.2657454     4          1               2
16      a1      b2   -0.1617810     5          2               2
17      a2      b1    0.9634501     4          1               2
18      a2      b1   -0.6749173     5          2               2
19      a2      b2    0.8138765     4          1               2
20      a2      b2   -1.1075720     5          2               2

标签: pythonpandas

解决方案


在带有熊猫的 Python 中,您可以这样做:

df['Order'] = df_data.groupby(['FactorA', 'FactorB']).cumcount() + 1

MVCE:

import pandas as pd
from io import StringIO
dat_text = StringIO("""   FactorA  FactorB  DependentVar
1       a1      b1   -1.1435908
2       a1      b1   -0.5799404
3       a1      b1    0.0680380
4       a1      b1    0.1143230
5       a1      b1    0.7673287
6       a1      b2    1.4769585
7       a1      b2   -1.3399984
8       a1      b2   -0.4832071
9       a1      b2   -2.3764355
10      a1      b2    0.2668480
11      a2      b1   -0.7376859
12      a2      b1   -0.4141878
13      a2      b1   -0.5159797
14      a2      b1   -1.3888258
15      a2      b1    0.1497270
16      a2      b2    0.1803052
17      a2      b2    0.8547880
18      a2      b2    0.2372080
19      a2      b2    0.3139455
20      a2      b2    0.7266356""")

df_data = pd.read_csv(dat_text, sep='\s\s+', engine='python')

print(df_data)

输出:

   FactorA FactorB  DependentVar
1       a1      b1     -1.143591
2       a1      b1     -0.579940
3       a1      b1      0.068038
4       a1      b1      0.114323
5       a1      b1      0.767329
6       a1      b2      1.476958
7       a1      b2     -1.339998
8       a1      b2     -0.483207
9       a1      b2     -2.376435
10      a1      b2      0.266848
11      a2      b1     -0.737686
12      a2      b1     -0.414188
13      a2      b1     -0.515980
14      a2      b1     -1.388826
15      a2      b1      0.149727
16      a2      b2      0.180305
17      a2      b2      0.854788
18      a2      b2      0.237208
19      a2      b2      0.313945
20      a2      b2      0.726636

groupby与 一起使用cumcount

df_data['Order'] = df_data.groupby(['FactorA', 'FactorB']).cumcount() + 1

print(df_data)

输出:

   FactorA FactorB  DependentVar  Order
1       a1      b1     -1.143591      1
2       a1      b1     -0.579940      2
3       a1      b1      0.068038      3
4       a1      b1      0.114323      4
5       a1      b1      0.767329      5
6       a1      b2      1.476958      1
7       a1      b2     -1.339998      2
8       a1      b2     -0.483207      3
9       a1      b2     -2.376435      4
10      a1      b2      0.266848      5
11      a2      b1     -0.737686      1
12      a2      b1     -0.414188      2
13      a2      b1     -0.515980      3
14      a2      b1     -1.388826      4
15      a2      b1      0.149727      5
16      a2      b2      0.180305      1
17      a2      b2      0.854788      2
18      a2      b2      0.237208      3
19      a2      b2      0.313945      4
20      a2      b2      0.726636      5

更新以回答“2020 年 6 月 22 日添加”:

#Let's create a helper column to define new groups in order of appearance
df['newgroup'] = (df[['FactorA', 'FactorB']] != df[['FactorA', 'FactorB']].shift()).any(axis=1).cumsum()

#Use cumcount to count rows in groups
df['Order Reset'] = df.groupby('newgroup').cumcount() + 1

#Use factorize to count appearances of groups
df['Appearance'] = df.groupby(['FactorA', 'FactorB'])['newgroup'].transform(lambda x: x.factorize()[0]+1)

df

输出:

   FactorA FactorB  DependentVar  Order  newgroup       Order Reset  Appearance
1       a1      b1      1.381436      1         1                 1           1
2       a1      b1      1.070258      2         1                 2           1
3       a1      b1     -1.197439      3         1                 3           1
4       a1      b2     -1.168771      1         2                 1           1
5       a1      b2     -0.758465      2         2                 2           1
6       a1      b2     -0.554191      3         2                 3           1
7       a2      b1     -0.308333      1         3                 1           1
8       a2      b1      0.770798      2         3                 2           1
9       a2      b1      2.470973      3         3                 3           1
10      a2      b2      0.176827      1         4                 1           1
11      a2      b2      0.568761      2         4                 2           1
12      a2      b2      0.736010      3         4                 3           1
13      a1      b1      0.925322      4         5                 1           2
14      a1      b1     -0.319001      5         5                 2           2
15      a1      b2     -0.265745      4         6                 1           2
16      a1      b2     -0.161781      5         6                 2           2
17      a2      b1      0.963450      4         7                 1           2
18      a2      b1     -0.674917      5         7                 2           2
19      a2      b2      0.813877      4         8                 1           2
20      a2      b2     -1.107572      5         8                 2           2

推荐阅读