首页 > 解决方案 > Python library for reproducible remote data access with file caching

问题描述

In my data analysis I often use an xlsx or csv file from a remote location (a URL). I want that my code is reproducible and understandable so the best would be to download the file in my Python code such that the URL is contained in my script, however running my script it would download the file each time which takes too long. So my question is: Is there a Python library that automatically downloads and caches files, so I can use URLs in my code like so

from remotecaching import r_url

f = open(r_url("https://domain.tld/resource.csv"))

In this example r_url downloads the file (if it's not in the local cache) and returns the file path to the cached file.

Snakemake has a similar system (https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html) which is however unusable outside of the snakemake ecosystem.

标签: pythonfilecaching

解决方案


I wrote a simple wrapper which does what I was looking for. It uses the XDG Cache directory to store the downloaded files

import hashlib
import os
from urllib.parse import urlparse
import zipfile
import tempfile
import subprocess
from pathlib import Path

import pandas as pd
import requests

DATA_DIR = Path(save_cache_path('yourdatadirname'))
if not DATA_DIR.exists():
    os.mkdir(DATA_DIR)
def hash(s):
    try:
        return hashlib.sha256(s).hexdigest()
    except TypeError:
        return hashlib.sha256(s.encode('utf-8')).hexdigest()

def delete_cached_file(url):
    '''
    Helper function to delete an erroneous file or something like that..
    '''
    filename = DATA_DIR / hash(url)

    os.remove(filename)


def ssh_file(url):
    filename = DATA_DIR / hash(url)

    if not filename.exists():
        subprocess.run(['scp', url, filename])

    return filename


def http_file(url, zip_extract_name=None):
    '''
    Automatically downloads a URL (if not cached) and provides the file path;
    Also tries to automatically unzip files (only works with ZIP files containing a single file with the correct naming..)
    :zip_extract_name: extract the specified filename from a ZIP file
    '''
    filename = DATA_DIR / hash(url)

    if not filename.exists():
        r = requests.get(url, allow_redirects=True)

        url_filename = os.path.basename(urlparse(url).path)
        inner_url_filename, ext = os.path.splitext(filename)
        if zip_extract_name:
        # if ext == '.zip':
            zf = tempfile.NamedTemporaryFile(delete=False)
            zf.write(r.content)
            zf.close()

            with zipfile.ZipFile(zf.name, 'r') as zipped_file:
                data = zipped_file.read(zip_extract_name)
        else:
            data = r.content
        with open(filename, 'wb') as f:
            f.write(data)

    return filename

推荐阅读