首页 > 解决方案 > 如何重新索引月份和年份列以插入丢失的数据?

问题描述

考虑以下数据框:

df = pd.read_csv("data.csv")
print(df)
  Category  Year     Month  Count1  Count2
0        a  2017  December       5       9
1        a  2018   January       3       5
2        b  2017   October       7       6
3        b  2017  November       4       1
4        b  2018     March       3       3

我想实现这一点:

   Category  Year     Month  Count1  Count2
0         a  2017   October               
1         a  2017  November              
2         a  2017  December       5       9
3         a  2018   January       3       5
4         a  2018  February              
5         a  2018     March              
6         b  2017   October       7       6
7         b  2017  November       4       1
8         b  2017  December              
9         b  2018   January              
10        b  2018  February              
11        b  2018     March       3       3

到目前为止,我已经完成了:

months = {"January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6, "July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12}
df["Date"] = pd.to_datetime(10000 * df["Year"] + 100 * df["Month"].apply(months.get) + 1, format="%Y%m%d")
date_min = df["Date"].min()
date_max = df["Date"].max()
new_index = pd.MultiIndex.from_product([df["Category"].unique(), pd.date_range(date_min, date_max, freq="M")], names=["Category", "Date"])
df = df.set_index(["Category", "Date"]).reindex(new_index).reset_index()
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month_name()
df = df[["Category", "Year", "Month", "Count1", "Count2"]]

在上个月(3 月)生成的数据框中缺失,所有“Count1”、“Count2”值均为 NaN

标签: pythonpandas

解决方案


由于您要填写类别以及缺少的日期,这使情况变得复杂。一种解决方案是为每个类别创建一个单独的数据框,然后将它们连接在一起。

df['Date'] = pd.to_datetime('1 '+df.Month.astype(str)+' '+df.Year.astype(str))

df_ix = pd.Series(1, index=df.Date.sort_values()).resample('MS').first().reset_index()

df_list = []
for cat in df.Category.unique():
    df_temp = (df.query('Category==@cat')
                 .merge(df_ix, on='Date', how='right')
                 .get(['Date','Category','Count1','Count2'])
                 .sort_values('Date')
        )
    df_temp.Category = cat
    df_temp = df_temp.fillna(0)
    df_temp.loc[:,['Count1', 'Count2']] = df_temp.get(['Count1', 'Count2']).astype(int)
    df_list.append(df_temp)

df2 = pd.concat(df_list, ignore_index=True)
df2['Month'] = df2.Date.apply(lambda x: x.strftime('%B'))
df2['Year'] = df2.Date.apply(lambda x: x.year)
df2.drop('Date', axis=1)
# returns:
   Category  Count1  Count2     Month  Year
0         a       0       0   October  2017
1         a       0       0  November  2017
2         a       5       9  December  2017
3         a       3       5   January  2018
4         a       0       0  February  2018
5         a       0       0     March  2018
6         b       7       6   October  2017
7         b       4       1  November  2017
8         b       0       0  December  2017
9         b       0       0   January  2018
10        b       0       0  February  2018
11        b       3       3     March  2018

推荐阅读