首页 > 解决方案 > 根据另一列(熊猫)中的值计算一列中的天数差异

问题描述

我有一个像这样的熊猫 df(称为 df2):

id | orderdate |
___________________
123|2020-11-01 |
123|2020-08-01 |
233|2020-07-01 |
233|2020-11-04 |
444|2020-11-04 |
444|2020-05-03 |
444|2020-04-01 |
444|2020-11-25 |

orderdate 的值是日期时间,格式为 '%Y%m%d'。它们代表客户的订单。我想为每个 id(每个客户端)计算第一个订单和第二个订单之间的增量时间。

我想出:

for i in list(set(df2.id)):
  list_sorted=list(set((df2.loc[df2['id']==i, 'orderdate'] ))) 
  list_sorted= sorted(list_sorted) #get sorted list of the order dates in ascending order
  min_list= list_sorted[0] # first element is first order
  
  df2.loc[df2['id']==i, 'First Order']= min_list

  if len(list_sorted)>1:
    penultimate_list= list_sorted[1]
    df2.loc[df2['id']==i, 'Second Order']= penultimate_list # second element is second order


    df2.loc[df2['id']==i, 'Delta orders']= min_list - penultimate_list #calculate delta


    
  
  else:
    df2.loc[df2['id_user']==i, 'Delta orders']= None

我的预期结果是:

id | orderdate | First Order | Second Order| Delta Orders
______________________________________________
123|2020-11-01 |2020-08-01   | 2020-11-01  | 92 days
123|2020-08-01 |2020-08-01   | 2020-11-01  | 92 days
233|2020-07-01 |2020-07-01   | 2020-11-04  | 126 days
233|2020-11-04 |2020-07-01   | 2020-11-04  | 126 days
444|2020-11-04 |2020-04-01   | 2020-05-03  | 32 days
444|2020-05-03 |2020-04-01   | 2020-05-03  | 32 days
444|2020-04-01 |2020-04-01   | 2020-05-03  | 32 days
444|2020-11-25 |2020-04-01   | 2020-05-03  | 32 days

它有效,但我觉得它很麻烦。有什么更简单的方法吗?

标签: pythonpandaspython-datetime

解决方案


与您想要的略有不同,但这是一个开始:

import pandas as pd
from io import StringIO

data = StringIO(
"""id|orderdate
123|2020-11-01 
123|2020-08-01 
233|2020-07-01 
233|2020-11-04 
444|2020-11-04 
444|2020-05-03 
444|2020-04-01 
444|2020-11-25 """)

df = pd.read_csv(data, sep='|')

df['orderdate'] = pd.to_datetime(df['orderdate'], infer_datetime_format=True)
df = df.sort_values(['id', 'orderdate'], ascending=False)

def date_diff(df):
    df['order_time_diff'] = (df['orderdate'] - df['orderdate'].shift(-1)).dt.days
    df = df.dropna()
    return df

# this calculates all order differences
df.groupby('id').apply(date_diff)

# this will get the data as requested
df.groupby('id', as_index=False).apply(date_diff).groupby('id').tail(1)

推荐阅读