首页 > 解决方案 > 数据清洗 - ifelse 语句

问题描述

我正在尝试清理以下数据框,以便每个日期只有一个值。数据根据日期每天更改。因此,如果今天是 2020 年 1 月 1 日,数据将显示以下内容;

            Value   Type
01/01/2010  38.3    Forecast
01/01/2020  31.85   Actual
01/01/2020  6.45    Delta
02/01/2010  31.08   Actual
03/01/2020  29      Forecast
04/01/2020  23.4    Forecast
05/01/2020  24.5    Forecast
06/01/2020  19.4    Forecast
07/01/2020  21.1    Forecast
08/01/2020  22.3    Forecast
09/01/2020  25.6    Forecast

我需要尝试清理此数据框,以便如果日期具有“实际”值,则使用该值,如果不是,则使用预测值。

我一直在使用下面的方法来提取预测,但是我错过了在可用的地方获得“实际”值的准确性,然后这也将排除 D+1,因为没有预测值。

sel = ['Forecast'] 
df = df.loc[df['Type'].isin(sel)]

最终结果会是这样的;


            Value   Type
01/01/2020  31.85   Actual
02/01/2010  31.08   Actual
03/01/2020  29      Forecast
04/01/2020  23.4    Forecast
05/01/2020  24.5    Forecast
06/01/2020  19.4    Forecast
07/01/2020  21.1    Forecast
08/01/2020  22.3    Forecast
09/01/2020  25.6    Forecast

非常感谢任何帮助!

标签: pythonpandasdatetimedata-cleaning

解决方案


咱们试试吧pd.Categorical

df['Forecast'] = pd.Categorical(df['Forecast'],
               ['Actual','Forecast','Delta'],
              ordered=True)

print(df['Forecast']) # looks the same but take note of the order at the bottom.

0     Forecast
1       Actual
2        Delta
3       Actual
4     Forecast
5     Forecast
6     Forecast
7     Forecast
8     Forecast
9     Forecast
10    Forecast
Name: Forecast, dtype: category
Categories (3, object): [Actual < Forecast < Delta]


#assuming your date column is called date.
df.sort_values('Forecast').drop_duplicates(subset=['date'],keep='first')

         date  Value  Forecast
1  2020-01-01  31.85    Actual
3  2010-02-01  31.08    Actual
0  2010-01-01  38.30  Forecast
4  2020-03-01  29.00  Forecast
5  2020-04-01  23.40  Forecast
6  2020-05-01  24.50  Forecast
7  2020-06-01  19.40  Forecast
8  2020-07-01  21.10  Forecast
9  2020-08-01  22.30  Forecast
10 2020-09-01  25.60  Forecast

推荐阅读