首页 > 解决方案 > 如何将函数应用于遍历每一行的多个列

问题描述

数据

我有一个数据集,显示按公司和月份分组的最新预订数据(空值是 NaN)

company    month  year_ly  bookings_ly  year_ty  bookings_ty 
company a  1      2018     432          2019     253         
company a  2      2018     265          2019     635         
company a  3      2018     345          2019     525         
company a  4      2018     233          2019              
company a  5      2018     7664         2019             
...        ...    ...      ...          ...      ...         
company a  12     2018     224          2019     321         
company b  1      2018     543          2019     576        
company b  2      2018     23           2019     43          
company b  3      2018     64           2019     156
company b  4      2018     143          2019     
company b  5      2018     41           2019              
company b  6      2018     90           2019     
...        ...    ...      ...          ...      ...             

我想要的是

我想创建一个列或更新bookings_tyvalue 所在的列NaN(以更容易的为准),为每一行应用以下计算(按公司分组):

((SUM of previous 3 rows (or months) of bookings_ty) /(SUM of previous 3 rows (or months) of bookings_ly)) * bookings_ly

如果一行bookings_ty是 NaN,我希望公式的迭代将新计算的字段作为其一部分,bookings_ty所以基本上公式应该做的是将 NaN 值填充到bookings_ty.

我的尝试

df_bkgs.set_index(['operator', 'month'], inplace=True)

def calc(df_bkgs):
    df_bkgs['bookings_calc'] = df_bkgs['bookings_ty'].copy
    df_bkgs['bookings_ty_l3m'] = df_bkgs.groupby(level=0)['bookings_ty'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
    df_bkgs['bookings_ly_l3m'] = df_bkgs.groupby(level=0)['bookings_ly'].transform(lambda x: x.shift(1) + x.shift(2) + x.shift(3) )
    df_bkgs['bookings_factor'] = df_bkgs['bookings_ty_l3m']/df_bkgs['bookings_ly_l3m']
    df_bkgs['bookings_calc'] = df_bkgs['bookings_factor'] * df_bkgs['bookings_ly']
    return df_bkgs

df_bkgs.groupby(level=0).apply(calc)

import numpy as np
df['bookings_calc'] = np.where(df['bookings_ty']isna(), df['bookings_calc'], df['bookings_ty'])

此代码的问题是它仅为第一个 empty/NaN 生成计算字段bookings_ty。我想要的是有一个迭代或循环类型的过程,然后取该组中的前 3 行,如果bookings_ty为空/NaN,则取该行的计算字段。

谢谢

标签: pythonpandas

解决方案


这是一个解决方案:

import numpy as np
import pandas as pd

#sort values if not
df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)

def process(x):
    while x['bookings_ty'].isnull().any():
        x['bookings_ty'] = np.where((x['bookings_ty'].isnull()),
                                    (x['bookings_ty'].shift(1) +
                                     x['bookings_ty'].shift(2) +
                                    x['bookings_ty'].shift(3)) /
                                   (x['bookings_ly'].shift(1) +
                                    x['bookings_ly'].shift(2) +
                                    x['bookings_ly'].shift(3)) *
                                    x['bookings_ly'],  x['bookings_ty'])
    return x

df = df.groupby(['company']).apply(lambda x: process(x))

#convert to int64 if needed or stay with float values
df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)

初始DF:

      company  month  year_ly  bookings_ly  year_ty  bookings_ty
0   company_a      1     2018          432     2019          253
1   company_a      2     2018          265     2019          635
2   company_a      3     2018          345     2019          525
3   company_a      4     2018          233     2019          NaN
4   company_a      5     2018         7664     2019          NaN
5   company_a     12     2018          224     2019          321
6   company_b      1     2018          543     2019          576
7   company_b      2     2018           23     2019           43
8   company_b      3     2018           64     2019          156
9   company_b      4     2018          143     2019          NaN
10  company_b      5     2018           41     2019          NaN
11  company_b      6     2018           90     2019          NaN

结果:

      company  month  year_ly  bookings_ly  year_ty  bookings_ty   
0   company_a      1     2018          432     2019          253   
1   company_a      2     2018          265     2019          635   
2   company_a      3     2018          345     2019          525   
3   company_a      4     2018          233     2019          315 **
4   company_a      5     2018         7664     2019        13418 **
5   company_a     12     2018          224     2019          321   
6   company_b      1     2018          543     2019          576   
7   company_b      2     2018           23     2019           43   
8   company_b      3     2018           64     2019          156   
9   company_b      4     2018          143     2019          175 **
10  company_b      5     2018           41     2019           66 **
11  company_b      6     2018           90     2019          144 **

如果您想要另一个滚动月份,或者每个公司开始时可能存在 NaN 值,您可以使用这个通用解决方案:

df = df.sort_values(['company', 'year_ty', 'month']).reset_index(drop=True)

def process(x, m):
    idx = (x.loc[x['bookings_ty'].isnull()].index.to_list())
    for i in idx:
        id = i - x.index[0]
        start = 0 if id < m  else id - m
        sum_ty = sum(x['bookings_ty'].to_list()[start:id])
        sum_ly = sum(x['bookings_ly'].to_list()[start:id])
        ly = x.at[i, 'bookings_ly']
        x.at[i, 'bookings_ty'] = sum_ty / sum_ly * ly
    return x

rolling_month = 3
df = df.groupby(['company']).apply(lambda x: process(x, rolling_month))

df['bookings_ty'] = df['bookings_ty'].astype(np.int64)
print(df)

初始df:

      company  month  year_ly  bookings_ly  year_ty  bookings_ty
0   company_a      1     2018          432     2019        253.0
1   company_a      2     2018          265     2019        635.0
2   company_a      3     2018          345     2019          NaN
3   company_a      4     2018          233     2019          NaN
4   company_a      5     2018         7664     2019          NaN
5   company_a     12     2018          224     2019        321.0
6   company_b      1     2018          543     2019        576.0
7   company_b      2     2018           23     2019         43.0
8   company_b      3     2018           64     2019        156.0
9   company_b      4     2018          143     2019          NaN
10  company_b      5     2018           41     2019          NaN
11  company_b      6     2018           90     2019          NaN     

最后结果:

  company  month  year_ly  bookings_ly  year_ty  bookings_ty
0   company_a      1     2018          432     2019          253
1   company_a      2     2018          265     2019          635
2   company_a      3     2018          345     2019          439  ** work only with 2 previous rows
3   company_a      4     2018          233     2019          296  **
4   company_a      5     2018         7664     2019        12467  **
5   company_a     12     2018          224     2019          321
6   company_b      1     2018          543     2019          576
7   company_b      2     2018           23     2019           43
8   company_b      3     2018           64     2019          156
9   company_b      4     2018          143     2019          175 **
10  company_b      5     2018           41     2019           66 **
11  company_b      6     2018           90     2019          144 **

如果你想加快这个过程,你可以尝试:

df.set_index(['company'], inplace=True)
df = df.groupby(level=(0)).apply(lambda x: process(x))

代替

df = df.groupby(['company']).apply(lambda x: process(x))

推荐阅读