python - 如何加快从 Quandl/SHARADAR api 下载数据的速度
问题描述
我已经构建了一个小型下载管理器来获取 Quandl 中 SHARADAR 表的数据。胃肠道
这运行良好,但对于较大的文件(10 年内最多 2 GB)的下载速度非常慢。
我尝试使用 asyncio 但这并没有加快下载速度。这可能是因为Quandl 不允许并发下载。我是在代码中犯了错误,还是我必须接受 Quandl 的这个限制?
import asyncio
import math
import time
import pandas as pd
import quandl
import update
def segment_dates(table, date_start, date_end):
# Determine the number of days per asyncio loop. Determined by the max size of the
# range of data divided by the size of the files in 100 mb chunks.
# reduce this number for smaller more frequent downloads.
total_days = 40
# Number of days per download should be:
sizer = math.ceil(total_days / update.sharadar_tables[table][2])
# Number of days between start and end.
date_diff = date_end - date_start
loop_count = int(math.ceil(date_diff.days / sizer))
sd = date_start
sync_li = []
for _ in range(loop_count):
ed = sd + pd.Timedelta(days=sizer)
if ed > date_end:
ed = date_end
sync_li.append((sd, ed,))
sd = ed + pd.Timedelta(days=1)
return sync_li
async def get_data(table, kwarg):
"""
Using the table name and kwargs retrieves the most current data.
:param table: Name of table to update.
:param kwarg: Dictionary containing the parameters to send to Quandl.
:return dataframe: Pandas dataframe containing latest data for the table.
"""
return quandl.get_table("SHARADAR/" + table.upper(), paginate=True, **kwarg)
async def main():
table = "SF1"
# Name of the column that has the date field for this particular table.
date_col = update.sharadar_tables[table][0]
date_start = pd.to_datetime("2020-03-15")
date_end = pd.to_datetime("2020-04-01")
apikey = "API Key"
quandl.ApiConfig.api_key = apikey
# Get a list containing the times start and end for loops.
times = segment_dates(table, date_start, date_end)
wait_li = []
for t in times:
kwarg = {date_col: {"gte": t[0].strftime("%Y-%m-%d"), "lte": t[1].strftime("%Y-%m-%d")}}
wait_li.append(loop.create_task(get_data(table, kwarg)))
await asyncio.wait(wait_li)
return wait_li
if __name__ == "__main__":
starter = time.time()
try:
loop = asyncio.get_event_loop()
res = loop.run_until_complete(main())
for r in res:
df = r.result()
print(df.shape)
print(df.head())
except:
raise ValueError("error")
finally:
# loop.close()
print("Finished in {}".format(time.time() - starter))
解决方案
推荐阅读
- reactjs - 如何在点击反应时触发componentdidmount
- php - 使用php对数组进行排序
- twitter-bootstrap - 如何使用 bootstrap 4 制作这种表单选择菜单?
- python - 你能帮我旋转飞船,让它从屏幕左侧开火吗?(它的头应该指向右边而不是朝上)
- r - R:合并数据帧,相同的代码导致错误的结果
- python - Python:提取文件和移动文件
- python - 从字符串中提取复杂的数字范围?
- sharepoint - 来自 sharepoint 表单的 PowerApps 附件不起作用(未保存状态)
- javascript - Openlayers 5 使地图变暗
- ansible - 如何在 Ansible 中有效地使用主机组