首页 > 解决方案 > Python - 如何将数据框中的每日值与字典中的每小时百分比相乘以获取具有每小时值的数据框

问题描述

我有一个城市每个车站的每日过境乘客数据数据框,我还有一本字典,其中包含每小时乘客人数分布的百分比。

我想通过将数据框中的每日客流量值与字典中的每小时预测相乘,为每个车站创建一个每小时公交客流量的数据框。

例如,数据框如下所示:

    Austin-Forest Park  Harlem-Lake
date        
2018-11-01  2248.0  4021.0
2018-11-02  1983.0  3850.0
2018-11-03  837.0   2308.0
2018-11-04  604.0   1443.0

每小时的乘客百分比分布看起来像这样,每个键/值组合都是特定的小时和每日乘客的百分比。

hourly_distribution = {0:0.017, 1:0.017, 2:0.008, 3:0.008, 4:0.004, 
                          5:0.004, 6:0.008, 7:0.021, 8:0.051, 9:0.042,
                          10:0.042, 11:0.038, 12:0.034, 13:0.038, 14:0.051, 
                          15:0.068, 16:0.084, 17:0.11, 18:0.101, 19:0.084,
                          20:0.059, 21:0.051, 22:0.034, 23:0.025}


hourly_distribution_weekend_days = {0:0.015, 1:0.015, 2:0.008, 3:0.008,4:0.008, 5:0.008, 
                         6:0.015, 7:0.023, 8:0.038, 9:0.046, 10:0.054, 
                         11:0.077, 12:0.092, 13:0.092, 14:0.092, 15:0.092,
                         16:0.062, 17:0.054, 18:0.054, 19:0.054, 20:0.031, 
                         21:0.031, 22:0.015, 23:0.015}

我的预期结果将是 2018 年 11 月 1 日奥斯汀森林公园的结果:

    Austin-Forest Park
Date    
2018-11-01 00:00:00 38.2
2018-11-01 01:00:00 38.2
2018-11-01 02:00:00 18.0
2018-11-01 03:00:00 18.0
2018-11-01 04:00:00 9.0
2018-11-01 05:00:00 9.0
2018-11-01 06:00:00 18.0
2018-11-01 07:00:00 47.2
2018-11-01 08:00:00 114.6
2018-11-01 09:00:00 94.4
2018-11-01 10:00:00 94.4
2018-11-01 11:00:00 85.4
2018-11-01 12:00:00 76.4
2018-11-01 13:00:00 85.4
2018-11-01 14:00:00 114.6
2018-11-01 15:00:00 152.9
2018-11-01 16:00:00 188.8
2018-11-01 17:00:00 247.3
2018-11-01 18:00:00 227.0
2018-11-01 19:00:00 188.8
2018-11-01 20:00:00 132.6
2018-11-01 21:00:00 114.6
2018-11-01 22:00:00 76.4
2018-11-01 23:00:00 56.2

从这个小样本中,新数据框的预期形状将是 (96,2),包含 2 列和 4 天 x 24 小时每小时客流量值。

有人知道如何用 Python 编写这个吗?

谢谢!

标签: pythondataframedictionary

解决方案


您可以使用numpy.outer产品和列表理解pandas.to_datetime来构建新的日期时间索引,如下所示:

import pandas as pd
import numpy as np
import datetime

idx = pd.to_datetime(['2018-11-01', '2018-11-02', '2018-11-03', '2018-11-04'])
df_daily = pd.DataFrame({'Austin-Forest Park': [2248.0, 1983.0, 837.0, 604.0],
                         'Harlem-Lake': [4021.0, 3850.0, 2308.0, 1443.0]},
                         index=idx)
df_daily.index.name = 'date'


hourly_distribution = {0:0.017, 1:0.017, 2:0.008, 3:0.008, 4:0.004,
                          5:0.004, 6:0.008, 7:0.021, 8:0.051, 9:0.042,
                          10:0.042, 11:0.038, 12:0.034, 13:0.038, 14:0.051,
                          15:0.068, 16:0.084, 17:0.11, 18:0.101, 19:0.084,
                          20:0.59, 21:0.051, 22:0.034, 23:0.025}

distrib = [hourly_distribution[key] for key in hourly_distribution]

datetime_idx = pd.to_datetime([datetime.datetime(i.year, i.month, i.day, key) for i in idx for key in hourly_distribution])
data = np.outer(df_daily['Austin-Forest Park'], distrib).ravel()

df = pd.DataFrame({'Austin-Forest Park': data}, index=datetime_idx)
df.index.name = 'date'

哪个输出

                     Austin-Forest Park
date                                   
2018-11-01 00:00:00              38.216
2018-11-01 01:00:00              38.216
2018-11-01 02:00:00              17.984
2018-11-01 03:00:00              17.984
2018-11-01 04:00:00               8.992
...                                 ...
2018-11-04 19:00:00              50.736
2018-11-04 20:00:00             356.360
2018-11-04 21:00:00              30.804
2018-11-04 22:00:00              20.536
2018-11-04 23:00:00              15.100

[96 rows x 1 columns]

推荐阅读