python - 从重复轴重新索引
问题描述
我有以下代码:
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
TOKEN = "d0d2a3295349c625be6c0cbe23f9136221eb45ef"
con = fxcmpy.fxcmpy(access_token=TOKEN, log_level='error')
symbols = con.get_instruments()
start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles('NGAS', period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)
最后一行data = data.reindex(full_dates)
给了我以下错误:
ValueError: cannot reindex from a duplicate axis
我想要做的是填充缺失的日期并重新索引该列。
正如@jezrael 所提到的“问题是DatetimeIndex 中的重复值,所以这里不能使用reindex”
我之前使用过相同的代码,效果很好。好奇为什么它在这种情况下不起作用
import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import numpy as np
stock = 'F'
start = datetime.date(2008,1,1)
end = datetime.date.today()
data = web.DataReader(stock, 'yahoo',start, end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)
除了提供者之外,代码是相同的,但是这个有效,而上面的无效?
解决方案
所以问题是重复的值DatetimeIndex
,所以reindex
不能在这里使用。
可能的解决方案是通过所有值DataFrame.join
与 helper 一起使用:DataFrame
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
df = pd.DataFrame({'date':full_dates}).join(data, on='date')
print (df)
date bidopen bidclose bidhigh bidlow askopen askclose \
0 2015-01-01 NaN NaN NaN NaN NaN NaN
1 2015-01-02 2.9350 2.947 3.0910 2.860 2.9450 2.957
2 2015-01-03 NaN NaN NaN NaN NaN NaN
3 2015-01-04 NaN NaN NaN NaN NaN NaN
4 2015-01-05 2.9470 2.912 3.1710 2.871 2.9570 2.922
... ... ... ... ... ... ...
1797 2019-12-03 2.3890 2.441 2.5115 2.371 2.3970 2.449
1798 2019-12-04 2.3455 2.392 2.3970 2.341 2.3535 2.400
1798 2019-12-04 2.4410 2.406 2.4645 2.370 2.4490 2.414
1799 2019-12-05 2.4060 2.421 2.4650 2.399 2.4140 2.429
1800 2019-12-06 NaN NaN NaN NaN NaN NaN
askhigh asklow tickqty
0 NaN NaN NaN
1 3.101 2.8700 12688.0
2 NaN NaN NaN
3 NaN NaN NaN
4 3.181 2.8810 21849.0
... ... ...
1797 2.519 2.3785 36679.0
1798 2.406 2.3505 5333.0
1798 2.473 2.3780 74881.0
1799 2.473 2.4070 29238.0
1800 NaN NaN NaN
[1802 rows x 10 columns]
但我认为下一个处理应该是有问题的(因为重复的索引),所以DataFrame.resample
在字典中使用聚合函数的天数:
df = data.resample('D').agg({'bidopen': 'first',
'bidclose': 'last',
'bidhigh': 'max',
'bidlow': 'min',
'askopen': 'first',
'askclose': 'last',
'askhigh': 'max',
'asklow': 'min',
'tickqty':'sum'})
print (df)
bidopen bidclose bidhigh bidlow askopen askclose askhigh \
date
2015-01-02 2.9350 2.9470 3.0910 2.860 2.9450 2.9570 3.101
2015-01-03 NaN NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN NaN
2015-01-05 2.9470 2.9120 3.1710 2.871 2.9570 2.9220 3.181
2015-01-06 2.9120 2.9400 2.9510 2.807 2.9220 2.9500 2.961
... ... ... ... ... ... ...
2019-12-01 NaN NaN NaN NaN NaN NaN NaN
2019-12-02 2.3505 2.3455 2.3670 2.292 2.3590 2.3535 2.375
2019-12-03 2.3890 2.4410 2.5115 2.371 2.3970 2.4490 2.519
2019-12-04 2.3455 2.4060 2.4645 2.341 2.3535 2.4140 2.473
2019-12-05 2.4060 2.4210 2.4650 2.399 2.4140 2.4290 2.473
asklow tickqty
date
2015-01-02 2.8700 12688
2015-01-03 NaN 0
2015-01-04 NaN 0
2015-01-05 2.8810 21849
2015-01-06 2.8170 17955
... ...
2019-12-01 NaN 0
2019-12-02 2.3000 31173
2019-12-03 2.3785 36679
2019-12-04 2.3505 80214
2019-12-05 2.4070 29238
[1799 rows x 9 columns]
推荐阅读
- javascript - 对象没有被推入上下文 api 中的数组中
- c# - ConfigureAwait(false) 是否从 ThreadPool 触发另一个线程?
- go - 来自我的包外部的结构的多态性
- plugins - Craft CMS 插件 AssetBundle 在后端导致 Craft.js 错误
- python - 获取 Python 中的 if 语句中满足哪个或条件
- javascript - 删除 firebase firestore 查询 onSnapshot
- python - 在 Jupyter Notebook 的绘图中标记一个区域并提取数据/刷
- amazon-s3 - Axios CORS 无法在 chrome 部署的站点上运行
- amazon-web-services - AWS Elastic Beanstalk 是否适合单个 index.html 网站?
- html - 使用 CSS 隐藏边框线