首页 > 解决方案 > 想要为 2 列的组合聚合一个小时

问题描述

我们门户的用户流量记录在数据库中,我们提取了以下格式的流量信息

LOB  timestamp            Transaction Hits
PRO  2020-09-03 17:51:16  LOGIN       1
PRO  2020-09-03 17:51:15    ELG       1
PRO  2020-09-03 17:51:12  LOGIN       4
PRO  2020-09-03 17:51:13    ELG      11
PRO  2020-09-03 17:51:14  LOGIN       3
PRO  2020-09-03 17:51:11    ELG       2

我想找到 LOB 和事务组合的门户每小时点击量。输出需要采用这种格式

2020-09-03 17:00:00 14    POS  ELG
                    8     POS  LOGIN

如何使用 PANDAS 做到这一点?

标签: pythonpandas

解决方案


您可以通过在分组之前将日期时间值重新调整为每小时来做到这一点。我选择使用“固定”时间创建一个新列并按它分组:

创建演示df:

import pandas as pd
from io import StringIO

csv_string = StringIO("""LOB,  timestamp,            Transaction, Hits
PRO,  2020-09-03 17:51:16,  LOGIN,       1
PRO,  2020-09-03 17:51:15,    ELG,       1
PRO,  2020-09-03 17:51:12,  LOGIN,       4
PRO,  2020-09-03 17:51:13,    ELG,      11
PRO,  2020-09-03 17:51:14,  LOGIN,       3
PRO,  2020-09-03 17:51:11,    ELG,       2
PRO,  2020-09-03 18:51:12,  LOGIN,      24
PRO,  2020-09-03 18:51:13,    ELG,      21
PRO,  2020-09-03 18:51:14,  LOGIN,      23
PRO,  2020-09-03 18:51:11,    ELG,      22""" )

df = pd.read_csv(csv_string, sep=",", skipinitialspace=True)

并使用它:

# convert timestamp column to datetime
df["timestamp"] = pd.to_datetime(df["timestamp"])

# create a fixed time column with hours
# cudos: https://stackoverflow.com/a/43400370/7505395
df["by_hour"] = pd.to_datetime(df["timestamp"].dt.date) + \
                pd.to_timedelta(df["timestamp"].dt.hour, unit="H")

print(df)

# group by, use as index
grouped = df.groupby(by=["by_hour", "Transaction"], as_index=True) 
# sum and print
print(grouped.sum())

输出:

  LOB           timestamp Transaction  Hits             by_hour
0  PRO 2020-09-03 17:51:16       LOGIN     1 2020-09-03 17:00:00
1  PRO 2020-09-03 17:51:15         ELG     1 2020-09-03 17:00:00
2  PRO 2020-09-03 17:51:12       LOGIN     4 2020-09-03 17:00:00
3  PRO 2020-09-03 17:51:13         ELG    11 2020-09-03 17:00:00
4  PRO 2020-09-03 17:51:14       LOGIN     3 2020-09-03 17:00:00
5  PRO 2020-09-03 17:51:11         ELG     2 2020-09-03 17:00:00
6  PRO 2020-09-03 18:51:12       LOGIN    24 2020-09-03 18:00:00
7  PRO 2020-09-03 18:51:13         ELG    21 2020-09-03 18:00:00
8  PRO 2020-09-03 18:51:14       LOGIN    23 2020-09-03 18:00:00
9  PRO 2020-09-03 18:51:11         ELG    22 2020-09-03 18:00:00
                                Hits

by_hour             Transaction
2020-09-03 17:00:00 ELG            14
                    LOGIN           8
2020-09-03 18:00:00 ELG            43
                    LOGIN          47

推荐阅读