python - Python library for reproducible remote data access with file caching
问题描述
In my data analysis I often use an xlsx or csv file from a remote location (a URL). I want that my code is reproducible and understandable so the best would be to download the file in my Python code such that the URL is contained in my script, however running my script it would download the file each time which takes too long. So my question is: Is there a Python library that automatically downloads and caches files, so I can use URLs in my code like so
from remotecaching import r_url
f = open(r_url("https://domain.tld/resource.csv"))
In this example r_url downloads the file (if it's not in the local cache) and returns the file path to the cached file.
Snakemake has a similar system (https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html) which is however unusable outside of the snakemake ecosystem.
解决方案
I wrote a simple wrapper which does what I was looking for. It uses the XDG Cache directory to store the downloaded files
import hashlib
import os
from urllib.parse import urlparse
import zipfile
import tempfile
import subprocess
from pathlib import Path
import pandas as pd
import requests
DATA_DIR = Path(save_cache_path('yourdatadirname'))
if not DATA_DIR.exists():
os.mkdir(DATA_DIR)
def hash(s):
try:
return hashlib.sha256(s).hexdigest()
except TypeError:
return hashlib.sha256(s.encode('utf-8')).hexdigest()
def delete_cached_file(url):
'''
Helper function to delete an erroneous file or something like that..
'''
filename = DATA_DIR / hash(url)
os.remove(filename)
def ssh_file(url):
filename = DATA_DIR / hash(url)
if not filename.exists():
subprocess.run(['scp', url, filename])
return filename
def http_file(url, zip_extract_name=None):
'''
Automatically downloads a URL (if not cached) and provides the file path;
Also tries to automatically unzip files (only works with ZIP files containing a single file with the correct naming..)
:zip_extract_name: extract the specified filename from a ZIP file
'''
filename = DATA_DIR / hash(url)
if not filename.exists():
r = requests.get(url, allow_redirects=True)
url_filename = os.path.basename(urlparse(url).path)
inner_url_filename, ext = os.path.splitext(filename)
if zip_extract_name:
# if ext == '.zip':
zf = tempfile.NamedTemporaryFile(delete=False)
zf.write(r.content)
zf.close()
with zipfile.ZipFile(zf.name, 'r') as zipped_file:
data = zipped_file.read(zip_extract_name)
else:
data = r.content
with open(filename, 'wb') as f:
f.write(data)
return filename
推荐阅读
- azure - 如何使用 Azure CLI 创建 Azure 搜索索引?
- php - 隐藏运输方式价格并仅显示运输方式标题
- java - 修改 CONNECT 请求的 User Agent 头
- amazon-web-services - Traefik + 让我们在 AWS Lightsail 上加密
- recaptcha - 在哪里可以找到谷歌验证码配置
- javascript - 如何在 React 中设置后备值?
- java - 尝试在内存映射中训练模型时出现 JVM 错误。DL4J
- automation - 如何阻止木偶点击事件发生两次
- c# - 是否也可以在 Nunit Allure Steps 中显示参数,就像在 Java 中一样?
- android - 绑定和_绑定有什么区别?