首页 > 解决方案 > 从数据框中分组多个日期

问题描述

在 python 中进行波纹管转换时遇到了困难。正在使用 excel,但表格变得太大而无法计算。该表由多行流程及其阶段(20 个阶段)组成,每行的日期和状态如下:

ID 第一阶段状态 第一阶段完成 第 2 阶段状态 第 2 阶段完成
1 完全的 2021-01-15 14:15:00 丢失 2021-02-28 13:48:00
2 完全的 2021-01-21 15:00:00 完全的 2021-01-21 15:00:00
3 完全的 2021-02-03 14:08:00 打开

我的目标是每年和每月对数据进行分组,并计算有多少进程处于如下阶段:

完全的

阶段1 第二阶段
1月21日 2 1
2月21日 1 0

打开

阶段1 第二阶段
1月21日 0 0
2月21日 0 1

第一次在这里发帖,所以对提出的问题的任何反馈都非常受欢迎。先感谢您!

标签: pythonpandasdataframe

解决方案


这是一个 Python/Pandas 工作流程,可以帮助您获得所需的内容。代码中的注释将为您提供指导。

我建议在解释器中逐行运行并检查变量的更改以查看发生了什么。

请注意,在 Pandas 中做事的方式通常不止一种:有时这只是风格问题,但 Pandas 总是在不断发展并提供更好的功能。欢迎提出改进建议。

import numpy as np
import pandas as pd
from datetime import datetime

# create dataframe from example
df = pd.DataFrame(data={
  'Stage 1 Status': ['Complete'] * 3,
  'Stage 1 Completion': ['2021-01-15 14:15:00', '2021-01-21 15:00:00', '2021-02-03 14:08:00'],
  'Stage 2 Status': ['Lost', 'Complete', 'Open'],
  'Stage 2 Completion': ['2021-02-28 13:48:00', '2021-01-21 15:00:00', np.nan]
})

# convert string to a suitable date-time type
df['Stage 1 Completion'] = pd.to_datetime(df['Stage 1 Completion'])
df['Stage 2 Completion'] = pd.to_datetime(df['Stage 2 Completion'])

# === PART 1: get a basic result dataframe with correct groups and counts ===

# fill blanks (NaNs) in Stage 2 completion (presumably all from Open status) with the Stage 1 completion date
df['Stage 2 Completion'] = df['Stage 2 Completion'].fillna(df['Stage 1 Completion'])

# format completion times as 01-21, 02-21, etc.
df['Stage 1 Completion Formatted'] = df['Stage 1 Completion'].dt.strftime('%m-%y')
df['Stage 2 Completion Formatted'] = df['Stage 2 Completion'].dt.strftime('%m-%y')

# temporary dataframes for each stage with the counts
df1 = df.groupby(['Stage 1 Status', 'Stage 1 Completion Formatted'])['Stage 1 Completion'].count().rename('Stage 1 Count').to_frame()
df1.index.names = ['Status', 'Completion']
df2 = df.groupby(['Stage 2 Status', 'Stage 2 Completion Formatted'])['Stage 2 Completion'].count().rename('Stage 2 Count').to_frame()
df2.index.names = ['Status', 'Completion']

result_df = df1.join(df2, how='outer').fillna(0) # missing => zero count


# === PART 2: make the results look right ===

# first, split the results on Open vs Complete; ignore Lost for now
complete_df = result_df.loc[('Complete', slice(None),), :].reset_index(level=0, drop=True)
open_df = result_df.loc[('Open', slice(None),), :].reset_index(level=0, drop=True)

# next, find the union of MonthYears (i.e., strings like '02-21') across all status types and make that the index for each DF
# note that lexicographical string sorting matches the natural ordering (i.e., '01-21' comes before '02-21')
t = sorted(result_df.index.get_level_values('Completion').unique())

# with the new common index, make sure that missing entries are represented with zero count
complete_df = complete_df.reindex(t).fillna(0.0)
open_df = open_df.reindex(t).fillna(0.0)

# counts are integers (preferably), not floats
complete_df['Stage 1 Count'] = complete_df['Stage 1 Count'].astype("Int32")
complete_df['Stage 2 Count'] = complete_df['Stage 2 Count'].astype("Int32")
open_df['Stage 1 Count'] = open_df['Stage 1 Count'].astype("Int32")
open_df['Stage 2 Count'] = open_df['Stage 2 Count'].astype("Int32")

# next, create a column with date formatted as Jan-21, Feb-21, etc. and make that the index
complete_df['MonthYear'] = complete_df.index.map(lambda z: datetime.strptime(z, "%m-%y").strftime("%b-%y"))
complete_df = complete_df.reset_index(drop=True).set_index('MonthYear')
open_df['MonthYear'] = open_df.index.map(lambda z: datetime.strptime(z, "%m-%y").strftime("%b-%y"))
open_df = open_df.reset_index(drop=True).set_index('MonthYear')

print('=== Completed ===')
print(complete_df)

print('')

print('=== Open ===')
print(open_df)

推荐阅读