首页 > 解决方案 > ValueError:在 Pandas 中移动一列时无法从重复的轴重新索引

问题描述

给定df具有日期索引的数据框,如下所示:

              value
2017-03-31      NaN
2017-04-01  27863.7
2017-04-02  27278.5
2017-04-03  27278.5
2017-04-04  27278.5
             ...
2021-10-27      NaN
2021-10-28      NaN
2021-10-29      NaN
2021-10-30      NaN
2021-10-31      NaN

我可以将value列移动一年使用df['value'].shift(freq=pd.DateOffset(years=1))

出去:

2018-03-31        NaN
2018-04-01    27863.7
2018-04-02    27278.5
2018-04-03    27278.5
2018-04-04    27278.5
               ...   
2022-10-27        NaN
2022-10-28        NaN
2022-10-29        NaN
2022-10-30        NaN
2022-10-31        NaN

但是当我用它来替换原始值时df['value'] = df['value'].shift(freq=pd.DateOffset(years=1)),它会引发一个错误:

ValueError: cannot reindex from a duplicate axis

NaN由于下面的代码运行顺利,所以我认为列中的s引起的问题value

import pandas as pd
import numpy as np

np.random.seed(2021)
dates = pd.date_range('20130101', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
df

df.B = df.B.shift(freq=pd.DateOffset(years=1))

我也尝试使用df['value'].shift(freq=relativedelta(years=+1)),但它会生成:pandas.errors.NullFrequencyError: Cannot shift with no freq

有人可以帮助解决这个问题吗?真诚的感谢。

标签: python-3.xpandasnumpydate

解决方案


由于下面的代码运行顺利,所以我认为值列中的 NaN 引起的问题

不,我不这么认为。这可能是因为在您的第二个样本中您只有 1 个闰年。

具有 2 个闰年的可重现误差:

# 2018 (366 days), 2019 (365 days) and 2020 (366 days)
dates = pd.date_range('20180101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
                  index=dates, columns=list('ABC'))

df.B = df.B.shift(freq=pd.DateOffset(years=1))
...
ValueError: cannot reindex from a duplicate axis
...

下面的例子有效:

# 2017 (365 days), 2018 (366 days) and 2019 (365 days)
dates = pd.date_range('20170101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
                  index=dates, columns=list('ABC'))

df.B = df.B.shift(freq=pd.DateOffset(years=1))

看看value_counts

# 2018 -> 2020
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2021-02-28    2  # The duplicated index
2020-12-29    1
2021-01-04    1
2021-01-03    1
2021-01-02    1
             ..
2020-01-07    1
2020-01-08    1
2020-01-09    1
2020-01-10    1
2021-12-31    1
Length: 1095, dtype: int64


# 2017 -> 2019
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2018-01-01    1
2019-12-30    1
2020-01-05    1
2020-01-04    1
2020-01-03    1
             ..
2019-01-07    1
2019-01-08    1
2019-01-09    1
2019-01-10    1
2021-01-01    1
Length: 1096, dtype: int64

解决方案

resample('D')显然,解决方案是通过使用聚合函数first, last, min, max,mean或自定义函数来删除重复索引,在我们的例子中为“2021-02-28” sum

>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28']
2021-02-28    41
2021-02-28    96
Name: B, dtype: int64

>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28'] \
        .resample('D').agg(('first', 'last', 'min', 'max', 'mean', 'sum')).T

       2021-02-28
first        41.0
last         96.0
min          41.0
max          96.0
mean         68.5
sum         137.0

# Choose `last` for example
df.B = df.B.shift(freq=pd.DateOffset(years=1)).resample('D').last()

请注意,您可以替换.resample(...).func.loc[lambda x: x.index.duplicated()]


推荐阅读