python - 从数据框中分组多个日期

问题描述

在 python 中进行波纹管转换时遇到了困难。正在使用 excel，但表格变得太大而无法计算。该表由多行流程及其阶段（20 个阶段）组成，每行的日期和状态如下：

ID	第一阶段状态	第一阶段完成	第 2 阶段状态	第 2 阶段完成
1	完全的	2021-01-15 14:15:00	丢失	2021-02-28 13:48:00
2	完全的	2021-01-21 15:00:00	完全的	2021-01-21 15:00:00
3	完全的	2021-02-03 14:08:00	打开

我的目标是每年和每月对数据进行分组，并计算有多少进程处于如下阶段：

完全的

	阶段1	第二阶段
1月21日	2	1
2月21日	1	0

打开

	阶段1	第二阶段
1月21日	0	0
2月21日	0	1

第一次在这里发帖，所以对提出的问题的任何反馈都非常受欢迎。先感谢您！

标签： pythonpandasdataframe

这是一个 Python/Pandas 工作流程，可以帮助您获得所需的内容。代码中的注释将为您提供指导。

我建议在解释器中逐行运行并检查变量的更改以查看发生了什么。

请注意，在 Pandas 中做事的方式通常不止一种：有时这只是风格问题，但 Pandas 总是在不断发展并提供更好的功能。欢迎提出改进建议。

import numpy as np
import pandas as pd
from datetime import datetime

# create dataframe from example
df = pd.DataFrame(data={
  'Stage 1 Status': ['Complete'] * 3,
  'Stage 1 Completion': ['2021-01-15 14:15:00', '2021-01-21 15:00:00', '2021-02-03 14:08:00'],
  'Stage 2 Status': ['Lost', 'Complete', 'Open'],
  'Stage 2 Completion': ['2021-02-28 13:48:00', '2021-01-21 15:00:00', np.nan]
})

# convert string to a suitable date-time type
df['Stage 1 Completion'] = pd.to_datetime(df['Stage 1 Completion'])
df['Stage 2 Completion'] = pd.to_datetime(df['Stage 2 Completion'])

# === PART 1: get a basic result dataframe with correct groups and counts ===

# fill blanks (NaNs) in Stage 2 completion (presumably all from Open status) with the Stage 1 completion date
df['Stage 2 Completion'] = df['Stage 2 Completion'].fillna(df['Stage 1 Completion'])

# format completion times as 01-21, 02-21, etc.
df['Stage 1 Completion Formatted'] = df['Stage 1 Completion'].dt.strftime('%m-%y')
df['Stage 2 Completion Formatted'] = df['Stage 2 Completion'].dt.strftime('%m-%y')

# temporary dataframes for each stage with the counts
df1 = df.groupby(['Stage 1 Status', 'Stage 1 Completion Formatted'])['Stage 1 Completion'].count().rename('Stage 1 Count').to_frame()
df1.index.names = ['Status', 'Completion']
df2 = df.groupby(['Stage 2 Status', 'Stage 2 Completion Formatted'])['Stage 2 Completion'].count().rename('Stage 2 Count').to_frame()
df2.index.names = ['Status', 'Completion']

result_df = df1.join(df2, how='outer').fillna(0) # missing => zero count


# === PART 2: make the results look right ===

# first, split the results on Open vs Complete; ignore Lost for now
complete_df = result_df.loc[('Complete', slice(None),), :].reset_index(level=0, drop=True)
open_df = result_df.loc[('Open', slice(None),), :].reset_index(level=0, drop=True)

# next, find the union of MonthYears (i.e., strings like '02-21') across all status types and make that the index for each DF
# note that lexicographical string sorting matches the natural ordering (i.e., '01-21' comes before '02-21')
t = sorted(result_df.index.get_level_values('Completion').unique())

# with the new common index, make sure that missing entries are represented with zero count
complete_df = complete_df.reindex(t).fillna(0.0)
open_df = open_df.reindex(t).fillna(0.0)

# counts are integers (preferably), not floats
complete_df['Stage 1 Count'] = complete_df['Stage 1 Count'].astype("Int32")
complete_df['Stage 2 Count'] = complete_df['Stage 2 Count'].astype("Int32")
open_df['Stage 1 Count'] = open_df['Stage 1 Count'].astype("Int32")
open_df['Stage 2 Count'] = open_df['Stage 2 Count'].astype("Int32")

# next, create a column with date formatted as Jan-21, Feb-21, etc. and make that the index
complete_df['MonthYear'] = complete_df.index.map(lambda z: datetime.strptime(z, "%m-%y").strftime("%b-%y"))
complete_df = complete_df.reset_index(drop=True).set_index('MonthYear')
open_df['MonthYear'] = open_df.index.map(lambda z: datetime.strptime(z, "%m-%y").strftime("%b-%y"))
open_df = open_df.reset_index(drop=True).set_index('MonthYear')

print('=== Completed ===')
print(complete_df)

print('')

print('=== Open ===')
print(open_df)

python - 从数据框中分组多个日期

问题描述

解决方案

推荐阅读