python - 重新索引 Pandas 数据框
问题描述
我有一个熊猫df。我希望根据每个名称的输入重新索引日期列。
date value1 value2 name
0 1992-08-27 07:30:00 28.0 NaN A
1 1992-08-27 08:00:00 28.2 27.0 A
2 1992-08-27 09:00:00 28.8 27.5 A
3 1992-08-27 09:30:00 29.0 NaN A
4 1992-08-27 10:30:00 29.6 NaN A
5 1992-08-27 11:00:00 29.8 27.0 A
6 1992-08-27 11:30:00 30.0 27.0 A
7 1992-08-27 08:00:00 29.2 29.0 B
8 1992-08-27 09:30:00 30.0 37.0 B
9 1992-08-27 10:30:00 24.6 37.0 B
10 1992-08-27 11:00:00 24.8 37.0 B
我希望根据每个名称的日期列重新索引 pandas df。
这就是我正在做的事情:
import datetime
s_date = datetime.datetime(1992, 8, 27, 7)
e_date = datetime.datetime(1992, 8, 27, 12)
df_time = pd.date_range(start=s_date, end=e_date,
freq='0.50H').to_frame(index=False, name='date')
df_time.date = pd.to_datetime(df_time.date)
df = pd.merge(df, df_time, on=['date'], how='outer')
我预期的df是这样的:
date value1 value2 name
1992-08-27 07:00:00 NaN NaN A
1992-08-27 07:30:00 28.0 NaN A
1992-08-27 08:00:00 28.2 27.0 A
1992-08-27 08:30:00 28.2 27.0 A
1992-08-27 09:00:00 28.8 27.5 A
1992-08-27 09:30:00 29.0 NaN A
1992-08-27 10:00:00 29.0 NaN A
1992-08-27 10:30:00 29.6 NaN A
1992-08-27 11:00:00 29.8 27.0 A
1992-08-27 11:30:00 30.0 27.0 A
1992-08-27 12:00:00 30.0 27.0 A
1992-08-27 07:00:00 NaN NaN B
1992-08-27 07:30:00 28.0 NaN B
1992-08-27 08:00:00 29.2 29.0 B
1992-08-27 08:30:00 28.2 27.0 B
1992-08-27 09:00:00 28.8 27.5 B
1992-08-27 09:30:00 30.0 37.0 B
1992-08-27 10:00:00 29.6 37.0 B
1992-08-27 10:30:00 24.6 37.0 B
1992-08-27 11:00:00 24.8 37.0 B
1992-08-27 11:30:00 30.0 27.0 B
1992-08-27 12:00:00 30.0 27.0 B
我究竟做错了什么?
解决方案
您可以使用pyjanitor的完整功能来公开缺失值:
创建一个包含完整日期时间范围的字典
new_dates = {"date" : lambda df: pd.date_range("1992-08-27 07:00:00",
"1992-08-27 12:00:00",
freq="30T")
}
传递new_dates
变量来完成:
# pip install https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
df.complete([new_dates], by = 'name')
name date value1 value2
0 A 1992-08-27 07:00:00 NaN NaN
1 A 1992-08-27 07:30:00 28.0 NaN
2 A 1992-08-27 08:00:00 28.2 27.0
3 A 1992-08-27 08:30:00 NaN NaN
4 A 1992-08-27 09:00:00 28.8 27.5
5 A 1992-08-27 09:30:00 29.0 NaN
6 A 1992-08-27 10:00:00 NaN NaN
7 A 1992-08-27 10:30:00 29.6 NaN
8 A 1992-08-27 11:00:00 29.8 27.0
9 A 1992-08-27 11:30:00 30.0 27.0
10 A 1992-08-27 12:00:00 NaN NaN
11 B 1992-08-27 07:00:00 NaN NaN
12 B 1992-08-27 07:30:00 NaN NaN
13 B 1992-08-27 08:00:00 29.2 29.0
14 B 1992-08-27 08:30:00 NaN NaN
15 B 1992-08-27 09:00:00 NaN NaN
16 B 1992-08-27 09:30:00 30.0 37.0
17 B 1992-08-27 10:00:00 NaN NaN
18 B 1992-08-27 10:30:00 24.6 37.0
19 B 1992-08-27 11:00:00 24.8 37.0
20 B 1992-08-27 11:30:00 NaN NaN
21 B 1992-08-27 12:00:00 NaN NaN
complete只是 Pandas 函数的抽象,使这样的过程变得简单(也有助于重复索引)。您可以忽略它并坚持仅使用 Pandas 的方法:
创建完整日期时间的索引
new_index = pd.date_range("1992-08-27 07:00:00",
"1992-08-27 12:00:00",
freq="30T")
new_index = new_index.rename("date")
运行 groupby,并用于apply
重新索引每个组。
(df
.set_index("date")
.groupby("name")
.apply (lambda df: df.reindex(new_index))
.drop(columns="name")
.reset_index()
)
name date value1 value2
0 A 1992-08-27 07:00:00 NaN NaN
1 A 1992-08-27 07:30:00 28.0 NaN
2 A 1992-08-27 08:00:00 28.2 27.0
3 A 1992-08-27 08:30:00 NaN NaN
4 A 1992-08-27 09:00:00 28.8 27.5
5 A 1992-08-27 09:30:00 29.0 NaN
6 A 1992-08-27 10:00:00 NaN NaN
7 A 1992-08-27 10:30:00 29.6 NaN
8 A 1992-08-27 11:00:00 29.8 27.0
9 A 1992-08-27 11:30:00 30.0 27.0
10 A 1992-08-27 12:00:00 NaN NaN
11 B 1992-08-27 07:00:00 NaN NaN
12 B 1992-08-27 07:30:00 NaN NaN
13 B 1992-08-27 08:00:00 29.2 29.0
14 B 1992-08-27 08:30:00 NaN NaN
15 B 1992-08-27 09:00:00 NaN NaN
16 B 1992-08-27 09:30:00 30.0 37.0
17 B 1992-08-27 10:00:00 NaN NaN
18 B 1992-08-27 10:30:00 24.6 37.0
19 B 1992-08-27 11:00:00 24.8 37.0
20 B 1992-08-27 11:30:00 NaN NaN
21 B 1992-08-27 12:00:00 NaN NaN
然后,您可以ffill
或fillna
,取决于您的标准。