首页 > 解决方案 > 在条件下填充 df 的行

问题描述

我有一个重新采样的df:

import pandas as pd
import numpy as np

nat = np.datetime64('NaT')


df = pd.DataFrame({"Time": [nat, nat, nat, '2020-04-09 06:45:38.559871', '2020-04-09 06:45:38.559871', nat, nat, nat, '2020-04-09 06:50:16.268515', '2020-04-09 06:50:16.268515'],
              "Power": [0, 0, 0, 4200, 4200, 0, 0, 0, 4200, 4200],
              "Total Energy": [5300, 5300, 5300, 5500, 5600, 5600, 5600, 5600, 5900, 6100],
              "ID": ['-', '-', '-', 1, 1, '-', '-', '-', 2, 2],
              "Energy": [0, 0, 0, 200, 300, 0, 0, 0, 300, 500]},
              index=pd.date_range(start = "2020-04-09 6:45", periods = 10, freq = 'T'))

df['Time'] = pd.to_datetime(df['Time'])
df['Power'] = pd.to_numeric(df['Power'], errors = 'ignore')
df['Total Energy'] = pd.to_numeric(df['Total Energy'], errors = 'coerce')
df['ID'] = pd.to_numeric(df['ID'], errors = 'coerce')
df['Energy'] = pd.to_numeric(df['Energy'], errors = 'coerce')

df

输出:

                                          Time  Power   Total Energy      ID    Energy
2020-04-09 06:45:00                        NaT      0           5300     NaN         0
2020-04-09 06:46:00                        NaT      0           5300     NaN         0
2020-04-09 06:47:00                        NaT      0           5300     NaN         0
2020-04-09 06:48:00 2020-04-09 06:45:38.559871   4200           5500     1.0       200
2020-04-09 06:49:00 2020-04-09 06:45:38.559871   4200           5600     1.0       300
2020-04-09 06:50:00                        NaT      0           5600     NaN         0
2020-04-09 06:51:00                        NaT      0           5600     NaN         0
2020-04-09 06:52:00                        NaT      0           5600     NaN         0
2020-04-09 06:53:00 2020-04-09 06:50:16.268515   4200           5900     2.0       300
2020-04-09 06:54:00 2020-04-09 06:50:16.268515   4200           6100     2.0       500

我必须填写df.index < df['Time'](四舍五入)的行,df['Time'] == NaT如下所示:

这里是期望的结果:

                                          Time  Power   Total Energy      ID    Energy
2020-04-09 06:45:00                        NaT      0           5300     NaN         0
2020-04-09 06:46:00 2020-04-09 06:45:38.559871      0           5300     1.0         0
2020-04-09 06:47:00 2020-04-09 06:45:38.559871   6000           5400     1.0       100
2020-04-09 06:48:00 2020-04-09 06:45:38.559871   4200           5500     1.0       200
2020-04-09 06:49:00 2020-04-09 06:45:38.559871   4200           5600     1.0       300
2020-04-09 06:50:00                        NaT      0           5600     NaN         0
2020-04-09 06:51:00 2020-04-09 06:50:16.268515      0           5600     2.0         0
2020-04-09 06:52:00 2020-04-09 06:50:16.268515   9000           5750     2.0       150
2020-04-09 06:53:00 2020-04-09 06:50:16.268515   4200           5900     2.0       300
2020-04-09 06:54:00 2020-04-09 06:50:16.268515   4200           6100     2.0       500

该列df['Time']也可以更改为四舍五入的值:

                                          Time  Power   Total Energy      ID    Energy
2020-04-09 06:45:00                        NaT      0           5300     NaN         0
2020-04-09 06:46:00        2020-04-09 06:46:00      0           5300     1.0         0
2020-04-09 06:47:00        2020-04-09 06:46:00   6000           5400     1.0       100
2020-04-09 06:48:00        2020-04-09 06:46:00   4200           5500     1.0       200
2020-04-09 06:49:00        2020-04-09 06:46:00   4200           5600     1.0       300
2020-04-09 06:50:00                        NaT      0           5600     NaN         0
2020-04-09 06:51:00        2020-04-09 06:51:00      0           5600     2.0         0
2020-04-09 06:52:00        2020-04-09 06:51:00   9000           5750     2.0       150
2020-04-09 06:53:00        2020-04-09 06:51:00   4200           5900     2.0       300
2020-04-09 06:54:00        2020-04-09 06:51:00   4200           6100     2.0       500

谢谢你的帮助 :)

编辑

df['Time']发现了这个:

df['Time'] = df['Time'].dt.ceil('1min')

编辑 2

调整列df['Time']我进行如下:

dates = df['Time'].unique()    
for date in dates:
        for index, row in df.iterrows():
            if index == date:
                df.loc[index, 'Time'] = date
            

我如何获得df[ID]相应列 ( df['Time']) 的我还不知道。我还填写了如下行:

#scheme for filling the nan-values
s = df['Time'].ffill()
x = df['Time'].bfill()
g = df['Time'].mask(s.eq(x), s)

#Filling time
df['Time'] = df['Time'].groupby(g).ffill()

#Filling ID
df3['ID'] = df2['ID'].groupby(df2['Time']).bfill()

输出:

                                          Time  Power   Total Energy      ID    Energy
2020-04-09 06:45:00                        NaT      0           5300     NaN         0
2020-04-09 06:46:00        2020-04-09 06:46:00      0           5300     1.0         0
2020-04-09 06:47:00        2020-04-09 06:46:00      0           5300     1.0         0
2020-04-09 06:48:00        2020-04-09 06:46:00   4200           5500     1.0       200
2020-04-09 06:49:00        2020-04-09 06:46:00   4200           5600     1.0       300
2020-04-09 06:50:00                        NaT      0           5600     NaN         0
2020-04-09 06:51:00        2020-04-09 06:51:00      0           5600     2.0         0
2020-04-09 06:52:00        2020-04-09 06:51:00      0           5600     2.0         0
2020-04-09 06:53:00        2020-04-09 06:51:00   4200           5900     2.0       300
2020-04-09 06:54:00        2020-04-09 06:51:00   4200           6100     2.0       500

仍然缺少: 和 的值df['Power']/ df['Energy']必须df['Total Energy']如上所述计算和更改。

标签: pythonpandasdataframe

解决方案


鉴于您想要的输出以及您在评论中告诉我的内容,我这样做了:

time_bfill = df['Time'].bfill()
df['Time2'] = df['Time'].mask(df.index.to_series().ge(time_bfill), time_bfill)

id_bfill = df['ID'].bfill()
df['ID2'] = df['ID'].mask(df.index.to_series().ge(time_bfill), id_bfill)

df['Energy2'] = df['Energy'].mask((df['Time'].isnull()) & (df['Time2'].notna()) & (df.index.to_series().ne(df['Time2'])), np.NaN)
df['Energy3'] = df['Energy2'].interpolate(limit_direction='both', limit_area='inside')

df['Power2'] = df['Power'].mask(df['Power'] == 0, 60 * df['Energy3'])
df['Total Energy2'] = df['Total Energy'].mask(df['Power'] == 0, df['Total Energy'] + df['Energy3'])

df

并得到了这个数据框:

计算后的DF

考虑具有最大后缀的列。我把它们留在这里给你看中间步骤。

您可以调整该代码以避免一些中间列,但要小心,因为在某些情况下,在它们之后生成的其他列需要原始值。

摆脱这些中间列的快速方法是在最后运行:

df[['Time', 'Power', 'Total Energy', 'ID', 'Energy']] = df[['Time2', 'Power2', 'Total Energy2', 'ID2', 'Energy3']]
df.drop(['Time2', 'Power2', 'Total Energy2', 'ID2', 'Energy3', 'Energy2'], axis=1, inplace=True)

df

你得到:

替换后的DF


推荐阅读