首页 > 解决方案 > 如何序列化不规则的熊猫每小时时间戳?

问题描述

问题:

如何将带有值的熊猫系列转换为['2020-11-11 00:00:00', ..., '2020-11-13 12:00:00']某个时代的小时数?

示例代码的详细信息:

我有一个这样的数据框:

     date_day            date_hour  value date_day_timestamp
0  2020-11-11  2020-11-11 00:00:00      1         2020-11-11
1  2020-11-12  2020-11-11 12:00:00      2         2020-11-12
2  2020-11-14  2020-11-12 00:00:00      4         2020-11-14
3  2020-11-15  2020-11-12 12:00:00      2         2020-11-15
4  2020-11-21  2020-11-13 00:00:00      3         2020-11-21
5  2020-11-22  2020-11-13 12:00:00      2         2020-11-22

该列date_day_timestamp是一个时间戳系列,其中第一个观察值是Timestamp('2020-11-11 00:00:00')类型pandas._libs.tslibs.timestamps.Timestamp,并使用从列date_day(其中观察值是类型str)计算得出df['date_day_serial'] = [(d-datetime.datetime(1970,1,1)).days for d in df['date_day_timestamp']]

这给了我一个 pandas 系列类型inumpy.int64nt,其值[18577, 18578, 18580, 18581, 18587, 18588]是从 开始的日期的序列化表示1970-01-01

我怎样才能得到一个基于小时数的类似列1970-01-01,或者其他一些纪元?在数据样本中,我每天在00:00:00和处有两次观察12:00:00,但我希望这适用于一天中的其他时间,甚至每天的小时数不规则。

我天真地替换dayshoursin df['date_day_serial'] = [(d-datetime.datetime(1970,1,1)).hours for d in df['date_day_timestamp']],但这引发了AttributeError: 'Timedelta' object has no attribute 'hours'. 如果我们计算一个单一的 timedelta timedelta = df['date_day_timestamp'].iloc[1]-datetime.datetime(1970,1,1),我们可以看到dir(timedelta)它只返回Timedelta('18578 days 00:00:00')。没有多少小时准备在那里抓...

那么,我如何使用另一种方法来计算呢?有什么我可以(d-datetime.datetime(1970,1,1)).hours在下面的设置中替换来获得我想要的东西吗?

这是我最近的尝试:

df['date_hour_serial'] = [(d-datetime.datetime(1970,1,1)).hours for d in df['date_hour_timestamp']]

完整代码:

import pandas as pd
import numpy as np
import datetime

df=pd.DataFrame({'date_day': {0: '2020-11-11',
                      1: '2020-11-12',
                      2: '2020-11-14',
                      3: '2020-11-15',
                      4: '2020-11-21',
                      5: '2020-11-22'},
                 'date_hour': {0: '2020-11-11 00:00:00',
                      1: '2020-11-11 12:00:00',
                      2: '2020-11-12 00:00:00',
                      3: '2020-11-12 12:00:00',
                      4: '2020-11-13 00:00:00',
                      5: '2020-11-13 12:00:00'},
                     'value': {0: 1, 1: 2, 2: 4, 3: 2, 4: 3, 5: 2,}})

df['date_day_timestamp'] = pd.to_datetime(df['date_day'])
df['date_day_serial'] = [(d-datetime.datetime(1970,1,1)).days for d in df['date_day_timestamp']]

df['date_hour_timestamp'] =  pd.to_datetime(df['date_hour'])
timedelta = df['date_day_timestamp'].iloc[1]-datetime.datetime(1970,1,1)

df

# the next line raises an error
#df['date_hour_serial'] = [(d-datetime.datetime(1970,1,1)).hours for d in df['date_hour_timestamp']]

标签: pythonpandas

解决方案


我认为最简单的是将 timedeltas 转换为 days bySeries.dt.days或 seconds bySeries.dt.total_seconds然后通过乘或除常量来计算另一个值,例如24, 60, 3600, 因为.hoursor.minutes没有为 timedeltas 实现:

df['date_hour'] = pd.to_datetime(df['date_hour'])

s = df['date_hour'].sub(pd.Timestamp('1971-01-01'))
#counts only days without hours
df['date_hour_serial1'] = s.dt.days.mul(24)
#count days with hours
df['date_hour_serial2'] = s.dt.total_seconds().div(3600)
print (df)
     date_day           date_hour  value  date_hour_serial1  date_hour_serial2
0  2020-11-11 2020-11-11 00:00:00      1             437088           437088.0
1  2020-11-12 2020-11-11 12:00:00      2             437088           437100.0
2  2020-11-14 2020-11-12 00:00:00      4             437112           437112.0
3  2020-11-15 2020-11-12 12:00:00      2             437112           437124.0
4  2020-11-21 2020-11-13 00:00:00      3             437136           437136.0
5  2020-11-22 2020-11-13 12:00:00      2             437136           437148.0

推荐阅读