首页 > 解决方案 > 通过 pandas 变换使用多个函数

问题描述

我有一个如下所示的数据集:

   entity_id transaction_date transaction_month  net_flow    inflow   outflow
0         51       2018-07-02        2018-07-01  10161.06  20161.06  10000.00
1         51       2018-07-03        2018-07-01   5823.73   5867.37     43.64
2         51       2018-07-05        2018-07-01  17835.79  24107.29   6271.50
3         51       2018-07-06        2018-07-01  -3544.72  31782.84  35327.56
4         51       2018-07-09        2018-07-01  18252.42  18332.42     80.00

我正在尝试entity_id使用rolling和计算跨领域的滚动指标transform。我有多个要创建的变量,并且希望在一次调用中运行它们。

例如,如果我要使用 来创建这些度量agg,我会执行如下操作:

transactions = (
    raw_transactions
    .groupby(['entity_id','transaction_month'])[['inflow','outflow']]
    .agg([
        'sum','skew',
        ( 'coef_var', lambda x: x.std() / x.mean() ),
        ( 'kurtosis', lambda x: x.kurtosis() )
        ])
    .reset_index()
)

但是,我无法使用transform. 当我尝试使用 dict 或 list 传递函数时,由于 list 或 dict 不可散列,我得到一个 TypeError。

>>> transactions.groupby(['entity_id'])[['inflow','outflow']].transform(['skew','mean'])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-62-4ef49d836b3f> in <module>
----> 1 transactions.groupby(['entity_id'])[['inflow','outflow']].transform(['skew','mean'])

/jupyter/packages/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
   1354 
   1355         # optimized transforms
-> 1356         func = self._get_cython_func(func) or func
   1357 
   1358         if not isinstance(func, str):

/jupyter/packages/pandas/core/base.py in _get_cython_func(self, arg)
    335         if we define an internal function for this argument, return it
    336         """
--> 337         return self._cython_table.get(arg)
    338 
    339     def _is_builtin_func(self, arg):

TypeError: unhashable type: 'list'

标签: pythonpandas

解决方案


我不认为这是可能的transform。您有两种解决方法(至少)。原始数据帧merge的结果groupby.agg

tmp_ = (
    raw_transactions
    .groupby(['entity_id','transaction_month'])[['inflow','outflow']]
    .agg([
        'sum','skew',
        ( 'coef_var', lambda x: x.std() / x.mean() ),
        ( 'kurtosis', lambda x: x.kurtosis() )
        ]) #no reset_index here
)
# need to flatten multiindex columns
tmp_.columns = ['_'.join(cols) for cols in tmp_.columns] 

# then merge with original dataframe
res = raw_transactions.merge(tmp_, on=['entity_id','transaction_month'])

或者对不同的函数使用列表理解来转换concat原始数据

# group once
gr = raw_transactions.groupby(['entity_id'])[['inflow','outflow']]

#concat each dataframe of transformed function with otiginal data
res = pd.concat([raw_transactions] + 
                [gr.transform(func) 
                 for func in ('skew', 'mean', lambda x: x.std() / x.mean() )], 
                axis=1, keys=('', 'skew', 'mean', 'coef_var'))

然后你可以处理列名


推荐阅读