首页 > 解决方案 > 从重复轴重新索引

问题描述

我有以下代码:

import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime

TOKEN = "d0d2a3295349c625be6c0cbe23f9136221eb45ef"
con = fxcmpy.fxcmpy(access_token=TOKEN, log_level='error')
symbols = con.get_instruments()

start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles('NGAS', period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)

最后一行data = data.reindex(full_dates)给了我以下错误:

ValueError: cannot reindex from a duplicate axis

我想要做的是填充缺失的日期并重新索引该列。


正如@jezrael 所提到的“问题是DatetimeIndex 中的重复值,所以这里不能使用reindex”

我之前使用过相同的代码,效果很好。好奇为什么它在这种情况下不起作用

import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import numpy as np

stock = 'F'
start = datetime.date(2008,1,1)
end = datetime.date.today()
data = web.DataReader(stock, 'yahoo',start, end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')

full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)

除了提供者之外,代码是相同的,但是这个有效,而上面的无效?

标签: pythonpandasvalueerror

解决方案


所以问题是重复的值DatetimeIndex,所以reindex不能在这里使用。

可能的解决方案是通过所有值DataFrame.join与 helper 一起使用:DataFrame

data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
df = pd.DataFrame({'date':full_dates}).join(data, on='date')
print (df)
           date  bidopen  bidclose  bidhigh  bidlow  askopen  askclose  \
0    2015-01-01      NaN       NaN      NaN     NaN      NaN       NaN   
1    2015-01-02   2.9350     2.947   3.0910   2.860   2.9450     2.957   
2    2015-01-03      NaN       NaN      NaN     NaN      NaN       NaN   
3    2015-01-04      NaN       NaN      NaN     NaN      NaN       NaN   
4    2015-01-05   2.9470     2.912   3.1710   2.871   2.9570     2.922   
        ...      ...       ...      ...     ...      ...       ...   
1797 2019-12-03   2.3890     2.441   2.5115   2.371   2.3970     2.449   
1798 2019-12-04   2.3455     2.392   2.3970   2.341   2.3535     2.400   
1798 2019-12-04   2.4410     2.406   2.4645   2.370   2.4490     2.414   
1799 2019-12-05   2.4060     2.421   2.4650   2.399   2.4140     2.429   
1800 2019-12-06      NaN       NaN      NaN     NaN      NaN       NaN   

      askhigh  asklow  tickqty  
0         NaN     NaN      NaN  
1       3.101  2.8700  12688.0  
2         NaN     NaN      NaN  
3         NaN     NaN      NaN  
4       3.181  2.8810  21849.0  
      ...     ...      ...  
1797    2.519  2.3785  36679.0  
1798    2.406  2.3505   5333.0  
1798    2.473  2.3780  74881.0  
1799    2.473  2.4070  29238.0  
1800      NaN     NaN      NaN  

[1802 rows x 10 columns]

但我认为下一个处理应该是有问题的(因为重复的索引),所以DataFrame.resample在字典中使用聚合函数的天数:

df = data.resample('D').agg({'bidopen': 'first', 
                             'bidclose': 'last',
                             'bidhigh': 'max', 
                             'bidlow': 'min', 
                             'askopen': 'first', 
                             'askclose': 'last',
                             'askhigh': 'max', 
                             'asklow': 'min', 
                             'tickqty':'sum'})

print (df)
            bidopen  bidclose  bidhigh  bidlow  askopen  askclose  askhigh  \
date                                                                         
2015-01-02   2.9350    2.9470   3.0910   2.860   2.9450    2.9570    3.101   
2015-01-03      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
2015-01-04      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
2015-01-05   2.9470    2.9120   3.1710   2.871   2.9570    2.9220    3.181   
2015-01-06   2.9120    2.9400   2.9510   2.807   2.9220    2.9500    2.961   
            ...       ...      ...     ...      ...       ...      ...   
2019-12-01      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
2019-12-02   2.3505    2.3455   2.3670   2.292   2.3590    2.3535    2.375   
2019-12-03   2.3890    2.4410   2.5115   2.371   2.3970    2.4490    2.519   
2019-12-04   2.3455    2.4060   2.4645   2.341   2.3535    2.4140    2.473   
2019-12-05   2.4060    2.4210   2.4650   2.399   2.4140    2.4290    2.473   

            asklow  tickqty  
date                         
2015-01-02  2.8700    12688  
2015-01-03     NaN        0  
2015-01-04     NaN        0  
2015-01-05  2.8810    21849  
2015-01-06  2.8170    17955  
           ...      ...  
2019-12-01     NaN        0  
2019-12-02  2.3000    31173  
2019-12-03  2.3785    36679  
2019-12-04  2.3505    80214  
2019-12-05  2.4070    29238  

[1799 rows x 9 columns]

推荐阅读