首页 > 解决方案 > 具有连接列的 Pandas 数据框

问题描述

我有一个看起来像下面的代码的 Pandas 数据框。我需要添加一个动态列,在给定行之前连接序列中的每个值。循环听起来像是合乎逻辑的解决方案,但在非常大的数据帧(1M+ 行)上效率非常低。

user_id=[1,1,1,1,2,2,2,3,3,3,3,3]
variable=["A","B","C","D","A","B","C","A","B","C","D","E"]
sequence=[0,1,2,3,0,1,2,0,1,2,3,4]
df=pd.DataFrame(list(zip(ID,variable,sequence)),columns =['User_ID', 'Variables','Seq'])

# Need to add a column dynamically 
df['dynamic_column']=["A","AB","ABC","ABCD","A","AB","ABC","A","AB","ABC","ABCD","ABCDE"]

我需要能够基于 user_id 和序列号以有效的方式创建动态列。我玩过 pandas shift 函数,这只会导致必须创建一个循环。寻找一些简单有效的方法来创建动态连接列。

标签: pythonpandas

解决方案


这是cumsum

df['dynamic_column'] = df.groupby('User_ID').Variables.apply(lambda x: x.cumsum())

输出:

0         A
1        AB
2       ABC
3      ABCD
4         A
5        AB
6       ABC
7         A
8        AB
9       ABC
10     ABCD
11    ABCDE
Name: Variables, dtype: object

推荐阅读