首页 > 解决方案 > 如何加快从 Quandl/SHARADAR api 下载数据的速度

问题描述

我已经构建了一个小型下载管理器来获取 Quandl 中 SHARADAR 表的数据。胃肠道

这运行良好,但对于较大的文件(10 年内最多 2 GB)的下载速度非常慢。

我尝试使用 asyncio 但这并没有加快下载速度。这可能是因为Quandl 不允许并发下载。我是在代码中犯了错误,还是我必须接受 Quandl 的这个限制?

import asyncio
import math
import time

import pandas as pd
import quandl

import update

def segment_dates(table, date_start, date_end):

    # Determine the number of days per asyncio loop. Determined by the max size of the
    # range of data divided by the size of the files in 100 mb chunks.
    # reduce this number for smaller more frequent downloads.
    total_days = 40

    # Number of days per download should be:
    sizer = math.ceil(total_days / update.sharadar_tables[table][2])

    # Number of days between start and end.
    date_diff = date_end - date_start

    loop_count = int(math.ceil(date_diff.days / sizer))
    sd = date_start
    sync_li = []
    for _ in range(loop_count):
        ed = sd + pd.Timedelta(days=sizer)
        if ed > date_end:
            ed = date_end
        sync_li.append((sd, ed,))
        sd = ed + pd.Timedelta(days=1)

    return sync_li


async def get_data(table, kwarg):
    """
    Using the table name and kwargs retrieves the most current data.
    :param table: Name of table to update.
    :param kwarg: Dictionary containing the parameters to send to Quandl.
    :return dataframe: Pandas dataframe containing latest data for the table.
    """
    return quandl.get_table("SHARADAR/" + table.upper(), paginate=True, **kwarg)


async def main():

    table = "SF1"

    # Name of the column that has the date field for this particular table.
    date_col = update.sharadar_tables[table][0]

    date_start = pd.to_datetime("2020-03-15")
    date_end = pd.to_datetime("2020-04-01")

    apikey = "API Key"
    quandl.ApiConfig.api_key = apikey

    # Get a list containing the times start and end for loops.
    times = segment_dates(table, date_start, date_end)

    wait_li = []
    for t in times:
        kwarg = {date_col: {"gte": t[0].strftime("%Y-%m-%d"), "lte": t[1].strftime("%Y-%m-%d")}}
        wait_li.append(loop.create_task(get_data(table, kwarg)))

    await asyncio.wait(wait_li)
    return wait_li

if __name__ == "__main__":
    starter = time.time()
    try:
        loop = asyncio.get_event_loop()
        res = loop.run_until_complete(main())
        for r in res:
            df = r.result()
            print(df.shape)
            print(df.head())
    except:
        raise ValueError("error")
    finally:
        # loop.close()

    print("Finished in {}".format(time.time() - starter))

标签: pythonconcurrencydownloadpython-asyncioquandl

解决方案


推荐阅读