python - 从数据框中分组多个日期
问题描述
在 python 中进行波纹管转换时遇到了困难。正在使用 excel,但表格变得太大而无法计算。该表由多行流程及其阶段(20 个阶段)组成,每行的日期和状态如下:
ID | 第一阶段状态 | 第一阶段完成 | 第 2 阶段状态 | 第 2 阶段完成 |
---|---|---|---|---|
1 | 完全的 | 2021-01-15 14:15:00 | 丢失 | 2021-02-28 13:48:00 |
2 | 完全的 | 2021-01-21 15:00:00 | 完全的 | 2021-01-21 15:00:00 |
3 | 完全的 | 2021-02-03 14:08:00 | 打开 |
我的目标是每年和每月对数据进行分组,并计算有多少进程处于如下阶段:
完全的
阶段1 | 第二阶段 | |
---|---|---|
1月21日 | 2 | 1 |
2月21日 | 1 | 0 |
打开
阶段1 | 第二阶段 | |
---|---|---|
1月21日 | 0 | 0 |
2月21日 | 0 | 1 |
第一次在这里发帖,所以对提出的问题的任何反馈都非常受欢迎。先感谢您!
解决方案
这是一个 Python/Pandas 工作流程,可以帮助您获得所需的内容。代码中的注释将为您提供指导。
我建议在解释器中逐行运行并检查变量的更改以查看发生了什么。
请注意,在 Pandas 中做事的方式通常不止一种:有时这只是风格问题,但 Pandas 总是在不断发展并提供更好的功能。欢迎提出改进建议。
import numpy as np
import pandas as pd
from datetime import datetime
# create dataframe from example
df = pd.DataFrame(data={
'Stage 1 Status': ['Complete'] * 3,
'Stage 1 Completion': ['2021-01-15 14:15:00', '2021-01-21 15:00:00', '2021-02-03 14:08:00'],
'Stage 2 Status': ['Lost', 'Complete', 'Open'],
'Stage 2 Completion': ['2021-02-28 13:48:00', '2021-01-21 15:00:00', np.nan]
})
# convert string to a suitable date-time type
df['Stage 1 Completion'] = pd.to_datetime(df['Stage 1 Completion'])
df['Stage 2 Completion'] = pd.to_datetime(df['Stage 2 Completion'])
# === PART 1: get a basic result dataframe with correct groups and counts ===
# fill blanks (NaNs) in Stage 2 completion (presumably all from Open status) with the Stage 1 completion date
df['Stage 2 Completion'] = df['Stage 2 Completion'].fillna(df['Stage 1 Completion'])
# format completion times as 01-21, 02-21, etc.
df['Stage 1 Completion Formatted'] = df['Stage 1 Completion'].dt.strftime('%m-%y')
df['Stage 2 Completion Formatted'] = df['Stage 2 Completion'].dt.strftime('%m-%y')
# temporary dataframes for each stage with the counts
df1 = df.groupby(['Stage 1 Status', 'Stage 1 Completion Formatted'])['Stage 1 Completion'].count().rename('Stage 1 Count').to_frame()
df1.index.names = ['Status', 'Completion']
df2 = df.groupby(['Stage 2 Status', 'Stage 2 Completion Formatted'])['Stage 2 Completion'].count().rename('Stage 2 Count').to_frame()
df2.index.names = ['Status', 'Completion']
result_df = df1.join(df2, how='outer').fillna(0) # missing => zero count
# === PART 2: make the results look right ===
# first, split the results on Open vs Complete; ignore Lost for now
complete_df = result_df.loc[('Complete', slice(None),), :].reset_index(level=0, drop=True)
open_df = result_df.loc[('Open', slice(None),), :].reset_index(level=0, drop=True)
# next, find the union of MonthYears (i.e., strings like '02-21') across all status types and make that the index for each DF
# note that lexicographical string sorting matches the natural ordering (i.e., '01-21' comes before '02-21')
t = sorted(result_df.index.get_level_values('Completion').unique())
# with the new common index, make sure that missing entries are represented with zero count
complete_df = complete_df.reindex(t).fillna(0.0)
open_df = open_df.reindex(t).fillna(0.0)
# counts are integers (preferably), not floats
complete_df['Stage 1 Count'] = complete_df['Stage 1 Count'].astype("Int32")
complete_df['Stage 2 Count'] = complete_df['Stage 2 Count'].astype("Int32")
open_df['Stage 1 Count'] = open_df['Stage 1 Count'].astype("Int32")
open_df['Stage 2 Count'] = open_df['Stage 2 Count'].astype("Int32")
# next, create a column with date formatted as Jan-21, Feb-21, etc. and make that the index
complete_df['MonthYear'] = complete_df.index.map(lambda z: datetime.strptime(z, "%m-%y").strftime("%b-%y"))
complete_df = complete_df.reset_index(drop=True).set_index('MonthYear')
open_df['MonthYear'] = open_df.index.map(lambda z: datetime.strptime(z, "%m-%y").strftime("%b-%y"))
open_df = open_df.reset_index(drop=True).set_index('MonthYear')
print('=== Completed ===')
print(complete_df)
print('')
print('=== Open ===')
print(open_df)