python - 时间戳重采样和附加值
问题描述
我有一个包含两列的 csv 文件。第一列是约 5 分钟的时间戳分辨率,第二列是如下值:
time,values
2021-07-30 00:00:00,0.9667
2021-07-30 00:03:54,0.5663
..
..
..
注意这里第二行是 3 分 54 秒。我正在尝试将时间戳列准确地重新采样为 1 分钟分辨率,然后按如下方式填写值列:
time,values
2021-07-30 00:00:00,0.9667
2021-07-30 00:01:00,0.9667
2021-07-30 00:02:00,0.9667
2021-07-30 00:03:00,0.9667
2021-07-30 00:04:00,0.5663
我的方法
我能够创建一分钟的时间戳列。现在下一步是根据新的时间戳分辨率分配值列中的值。我的想法是获取每个时间戳行之间的时间差,将结果存储在新列中,然后根据时间差结果将每个值附加到值列中。例如,如果时差结果为 3,我将取 values 列中的第一个值,然后追加 3 次,依此类推。这是结果的一个片段:
time,real-power,hours_min
0,2021-07-30 00:00:00,0.9667,0
1,2021-07-30 00:03:54,0.5663,00:03:54
2,2021-07-30 00:08:51,0.9887,00:04:57
3,2021-07-30 00:13:53,0.23334,00:05:02
我从 hours_min 列中提取了分钟,结果如下:
,time,real-power,hours_min
0,2021-07-30 00:00:00,0.9667,0.0
1,2021-07-30 00:03:54,0.5663,3.0
2,2021-07-30 00:08:51,0.9887,4.0
3,2021-07-30 00:13:53,0.23334,5.0
一分钟时间戳文件有 1440 行。当我在值列中附加值时,我得到了 1319 个值。后来我发现问题是当我提取分钟时,我没有考虑秒数,这导致值列中的值存在差异。
这是我的尝试:
# 2021/07/28
# The script converts 5 minutes timestamp csv file to 1 minute timestamp csv file. The timestamp resolution in the input csv file
# is not consistant. The idea is to read the minutes from the first two timestamp rows, subtract them, and the result will be the number of rows
# that will be added between the first two timestamp rows.
# Cavaet:
# To read minutes from timestamp columns, we have to use dt accessor. The dt accessor can be used only when the timestamp column is
# datetime object like or Timestamp object like. Therefore, we will
#################################################################################
#################################################################################
#################################################################################
import pandas as pd
from datetime import datetime
from datetime import timedelta
import numpy as np
#################################################################################
########################## Create Timestamp Column #############################
#################################################################################
# Function creates timestamp column.
# Credits: Stackoverflow
def create_timestamp(length): # When calling this function, pass the length of the timestamp in minutes. (1 day = 1440 minutes)
time_str = '2021-07-30 00:00:00' # starting date.
date_format_str = '%Y-%m-%d %H:%M:%S' # timestamp format
given_time = datetime.strptime(time_str, date_format_str)
timestamp = []
for length in range(0,1440): # length is the number of minutes in a day
final_time = given_time + timedelta(minutes=length)
final_time_str = final_time.strftime('%Y-%m-%d %H:%M:%S')
timestamp.append(final_time_str)
df = pd.DataFrame(timestamp) # create a dataframe for the new time stamp
# df.to_csv("one_minute.csv")
return df
#################################################################################
########################## WRITE TO CSV #####################
#################################################################################
def write_data(data,file_name):
data.to_csv(file_name,index = False,header= False)
#################################################################################
########################## Extract minutes #####################
#################################################################################
def extract_minutes(df):
df2 = pd.read_csv("data-2.csv",parse_dates=True) # Type DataFrame
df2['time'] = pd.to_datetime(df2['time'], errors = 'coerce') # converted column type to datetime-like object
# print(df2['time'].dtype) # Double-check the type.
df2['hours_min'] =df2['time'].diff() # Subtract current row from previous row and assign result in new column.
df2['hours_min'] = df2['hours_min'].astype(str).str.split('0 days ').str[-1] # remove 0 days from hour_min column
df2['hours_min'] = pd.to_datetime(df2['hours_min'])
df2['hours_min'] = df2['hours_min'].dt.minute.fillna(0) # Get only minutes and convert NaN values to zeroes.
# df2['hours_min'] = pd.to_datetime(df2['hours_min'])
# df2['hours_min'] = df2['hours_min'].dt.minute.fillna(0)
df2.to_csv('check_minutes.csv')
return df2
#################################################################################
####### repeat power values to match new timestamp resolution #########
#################################################################################
def read_power_vals(df,df2):
new_pow_values = []
# df2.set_index(['time'])['real-power'].repeat(df['hours_min'].astype(int)).reset_index()
# df2['hours_min'].astype(int)
# print(df2['real-power'].repeat(df2['hours_min']).reset_index())
# print(df.iloc[:10])
# print(df2)
# dff = pd.DataFrame(np.repeat(df2['real-power'].values,3,axis=0))
# print(dff)
# print(df2)
# p_vals = df2['real-power'].to_list()
# minutes = df2['hours_min'].to_list()
# counter = 0
# for i,k in zip(p_vals,minutes):
# # print(i,k)
# new_pow_values.append(i)
# print(f'this is counter {counter}.\n This is i {i}\n And this is k {k}\n This is the array {new_pow_values}\n')
# # counter = counter + 1
# if counter == k:
# i = i + 1
# # k = k + 1
# counter = 0
# else:
# new_pow_values.append(i)
# counter = counter + 1
# if counter == 5:
# break
# print(p_vals,len(p_vals))
# print(minutes,len(minutes))
# product = []
# j = 0
# for i in minutes:
# product.extend(int(i) * [p_vals[j]])
# # print(int(i) * [p_vals[j]], "should be of length: ",int(i))
# j+=1
# print(product,len(product))
# print(minutes[-10:])
# print(len(product))
# print(len(df))
# print(len(new_pow_values))
x = create_timestamp(1440)
# write_data(x,'trial.csv')
y = extract_minutes(x)
z = read_power_vals(x,y)
我想我要问的是,有没有办法有效地做到这一点?我认为我的方法不会得到我想要的。有什么选择吗?
谢谢大家。
解决方案
推荐阅读
- javascript - NodeJS“fork”子进程正在执行,但在退出时没有给父进程任何指示
- javascript - 使用innerHTML在javascript中添加选项元素的问题
- mongodb - 将 Qlik Sense 连接到 MongoDB 时出现问题
- grails - RabbitMQ 在我的队列中写入错误的字符
- c++ - Visual Studio 不会将更改 VC++ 保存到资源文件
- google-data-studio - 如何正确计算 Google Data Studio 中的介质
- angular - 对角材料表中的可观察数据进行排序
- pandas - 如何从 for 循环的绘图结果创建一个图形
- c++ - 如何在 Visual Studio 上尽可能自动地通过 vcpkg 安装 wxWidgets?
- javascript - 如何确定 Javascript 中意外滚动事件的原因?