python - 重塑具有开始和结束日期的数据集,以按天/月/季度创建时间序列计数汇总总和
问题描述
我有一个完全一样的数据集:
ProjectID Start End Type
Project 1 01/01/2019 27/04/2019 HR
Project 2 15/01/2019 11/11/2019 Marketing
Project 3 25/02/2019 30/07/2019 Finance
Project 4 22/02/2019 15/04/2019 HR
Project 5 05/03/2019 29/09/2019 HR
Project 6 11/04/2019 01/12/2019 Marketing
Project 7 29/07/2019 23/08/2019 Finance
Project 8 25/08/2019 23/12/2019 Operations
Project 9 31/10/2019 29/11/2019 Operations
Project 10 10/12/2019 25/12/2019 Operations
我想知道随着时间的推移,通过创建每日/每月/每季度的时间序列有多少项目是优秀的。我首先想创建一个整体项目的总和,然后还要知道有多少项目类型是优秀的。通过在 excel 中手动执行此操作,我相信我必须以某种方式重新采样数据,但我不确定如何以及在哪些维度上......当我在 excel 中执行此操作时,输出最终应如下所示:
和
如何使用 pandas 重塑数据以进行此分析?
解决方案
一种方法是获取一个日期范围(例如 1 年),然后将所有项目加入所有日期。我正在使用hvplot创建一个漂亮的最终结果的交互式绘图。
这是您的示例数据的工作示例:
from io import StringIO
import pandas as pd
import hvplot.pandas
text = """
ProjectID Start End Type
Project1 01/01/2019 27/04/2019 HR
Project2 15/01/2019 11/11/2019 Marketing
Project3 25/02/2019 30/07/2019 Finance
Project4 22/02/2019 15/04/2019 HR
Project5 05/03/2019 29/09/2019 HR
Project6 11/04/2019 01/12/2019 Marketing
Project7 29/07/2019 23/08/2019 Finance
Project8 25/08/2019 23/12/2019 Operations
Project9 31/10/2019 29/11/2019 Operations
Project10 10/12/2019 25/12/2019 Operations
"""
df = pd.read_csv(StringIO(text), header=0, sep='\s+')
df['Start'] = pd.to_datetime(df['Start'], dayfirst=True)
df['End'] = pd.to_datetime(df['End'], dayfirst=True)
# create a dummy key with which we can join all projects with all dates
df['key'] = 'key'
# create a daterange so that we can count all open projects for all days
df2 = pd.DataFrame(pd.date_range(start='01-01-2019', periods=365, freq='d'), columns=['date'])
# create a dummy key with which we can join all projects with all dates
df2['key'] = 'key'
# join all dates with all projects on dummy key = cartesian product
df3 = pd.merge(df, df2, on=['key'])
# check if date is within project dates
df3['count_projects'] = df3['date'].ge(df3['Start']) & df3['date'].le(df3['End'])
# group per day: count all open projects
group_overall = df3.groupby(
'date', as_index=False)['count_projects'].sum()
# group per day per department: count all projects
group_per_department = df3.groupby(
['date', 'Type'], as_index=False)['count_projects'].sum()
# plot overall result
plot_overall = group_overall.hvplot.line(
x='date', y='count_projects',
title='Open projects Overall',
width=1000,
)
# plot per department
plot_per_department = group_per_department.hvplot.line(
x='date', y='count_projects',
by='Type',
title='Open projects per Department',
width=1000,
)
# show both plots using hvplot
(plot_overall + plot_per_department).cols(1)
结果图:
推荐阅读
- c++ - 如何在 qt main() 中运行发布请求
- c# - 如何修改下拉列表生成的 Html?
- python - 为绘图的两个轴删除区间外的数据点
- asp.net-core - 在 json 反序列化时允许在 asp.net 核心应用程序中使用特定的日期时间格式
- spring - 无法在 Eclipse 中启动 springboot,获取 BeanCreationException 创建名称为“defaultValidator”的 bean 时出错
- asp.net - 如何使用来自不同视图的 EntityFramework 使用 TPT 进行继承?
- prolog - 无限循环推理 Prolog
- casting - teradata 获取浮点值的 varchar 表示
- pip - 使用 pip 安装 cairocffi 时出现 CERTIFICATE_VERIFY_FAILED
- python - 按下按钮后如何显示 tkinter 文本?