首页 > 解决方案 > 如何最好地在熊猫数据框列上执行递归

问题描述

我正在尝试计算熊猫数据框中时间序列的索引值。该索引取决于前一行的结果来计算第一次迭代后的每一行。我试图在对数据框行的迭代中递归地执行此操作,但我发现计算的前两行是正确的,但第三行和后续行不准确。

我认为这是因为在初始值之后,后续索引计算出错,然后将所有其他后续计算设置错误。

是什么导致了这种不准确。有没有比我采用的方法更好的方法?

输出示例如下所示:


ticket_cat   Sector   Year       factor        Incorrect_index_value  correct_index_value     prev_row
Revenue      LSE      Jan 2004                 100.00                 100.00                  
Revenue      LSE      Jan 2005   4.323542894   104.3235               104.3235                100.00
Revenue      LSE      Jan 2006   3.096308080   98.823                 107.5537      <--incorrect row        
Revenue      LSE      Jan 2007   6.211666      107.476                114.2345  <--incorrect row              
Revenue      LD       Jan 2004                 100.00                 100.0000
Revenue      LD       Jan 2005   3.5218        103.5218               103.5218
Revenue      LD       Jan 2006   2.7417        99.2464                106.3602   <--- incorrect row
Revenue      LD       Jan 2007   3.3506        104.1353               109.9239  <--- incorrect row                           

我的代码片段如下: stpassrev 是数据框

#insert initial value for index
stpassrev['index_value'] = np.where(
       (stpassrev['Year'] == 'Jan 2004' ) & (stpassrev['Ticket_cat']=='Revenue'),
        100.00,np.nan )

#set up initial values for prec_row column
stpassrev['prev_row'] = np.where(
              #only have relevant row impacted
                (stpassrev['Year'] == 'Jan 2005' ) & (stpassrev['Ticke_cat']=='Revenue'),
        100.00,
        np.nan
        )

#calculate the index_value
for i in range(1,len(stpassrev)):
        stpassrev.loc[i,'passrev'] = np.where(
            (stpassrev.loc[i,'Ticket_cat']=='Revenue'  )  & (pd.isna(stpassrev.loc[i,'factor'])==False),
                ((100+stpassrev.loc[i,'factor'] ) /stpassrev.loc[i-1,'index_value'])*100,
                stpassrev.loc[i,'index_value'])


     stpassrev.loc[i,'prev_row'] = stpassrev.loc[i-1,'index_value']

标签: pythonpandasrecursion

解决方案


根据您更新的问题,您只需要这样做:

# assign a new temp_factor with initial values and prep for cumprod
stpassrev['temp_factor'] = np.where(stpassrev['factor'].isna(), 1, stpassrev['factor'].add(100).div(100))

# calculate the cumprod based on the temp_factor (grouped by Sector) and multiply by 100 for index_value
stpassrev['index_value'] = stpassrev.groupby('Sector')['temp_factor'].cumprod().mul(100)

结果:

  ticket_cat Sector      Year    factor  temp_factor  index_value
0    Revenue    LSE  Jan 2004       NaN     1.000000   100.000000
1    Revenue    LSE  Jan 2005  4.323543     1.043235   104.323543
2    Revenue    LSE  Jan 2006  3.096308     1.030963   107.553721
3    Revenue    LSE  Jan 2007  6.211666     1.062117   114.234599
4    Revenue     LD  Jan 2004       NaN     1.000000   100.000000
5    Revenue     LD  Jan 2005  3.521800     1.035218   103.521800
6    Revenue     LD  Jan 2006  2.741700     1.027417   106.360057
7    Revenue     LD  Jan 2007  3.350600     1.033506   109.923757

如果您需要四舍五入到 4 位精度,请在.round(4)后面添加.mul(100)

stpassrev['index_value'] = stpassrev.groupby('Sector')['temp_factor'].cumprod().mul(100).round(4)

  ticket_cat Sector      Year    factor  temp_factor  index_value
0    Revenue    LSE  Jan 2004       NaN     1.000000     100.0000
1    Revenue    LSE  Jan 2005  4.323543     1.043235     104.3235
2    Revenue    LSE  Jan 2006  3.096308     1.030963     107.5537
3    Revenue    LSE  Jan 2007  6.211666     1.062117     114.2346
4    Revenue     LD  Jan 2004       NaN     1.000000     100.0000
5    Revenue     LD  Jan 2005  3.521800     1.035218     103.5218
6    Revenue     LD  Jan 2006  2.741700     1.027417     106.3601
7    Revenue     LD  Jan 2007  3.350600     1.033506     109.9238

推荐阅读