pyspark - 如何在 databricks 中读取 kaggle zip 文件数据集
问题描述
我想从 kaggle 中读取 zip 文件数据集,但我无法读取该数据集:
import urllib
urllib.request.urlretrieve("https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants/downloads/zomato-bangalore-restaurants.zip", "/tmp/zomato-bangalore-restaurants.zip")
然后我运行 shell 脚本来提取文件:
%sh
unzip /tmp/zomato-bangalore-restaurants.zip
tail -n +2 zomato-bangalore-restaurants.csv > temp.csv
rm zomato-bangalore-restaurants.csv
然后我得到一个错误:
Archive: /tmp/zomato-bangalore-restaurants.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of /tmp/zomato-bangalore-restaurants.zip or
/tmp/zomato-bangalore-restaurants.zip.zip, and cannot find /tmp/zomato-bangalore-restaurants.zip.ZIP, period.
tail: cannot open 'zomato-bangalore-restaurants.csv' for reading: No such file or directory
rm: cannot remove 'zomato-bangalore-restaurants.csv': No such file or directory
解决方案
注意:尝试从 Kaggle 下载文件被阻止,因为您尚未登录。
这是下载所有比赛数据集的脚本。
from requests import get, post
from os import mkdir, remove
from os.path import exists
from shutil import rmtree
import zipfile
def purge_all_downloads(db_full_path):
# Removes all the downloaded datasets
if exists(db_full_path): rmtree(db_full_path)
def datasets_are_available_locally(db_full_path, datasets):
# Returns True only if all the competition datasets are available locally in Databricks CE
if not exists(db_full_path): return False
for df in datasets:
# Assumes all the datasets end with '.csv' extention
if not exists(db_full_path + df + '.csv'): return False
return True
def remove_zip_files(db_full_path, datasets):
for df in datasets:
remove(db_full_path + df + '.csv.zip')
def unzip(db_full_path, datasets):
for df in datasets:
with zipfile.ZipFile(db_full_path + df + '.csv.zip', 'r') as zf:
zf.extractall(db_full_path)
remove_zip_files(db_full_path, datasets)
def download_datasets(competition, db_full_path, datasets, username, password):
# Downloads the competition datasets if not availible locally
if datasets_are_available_locally(db_full_path, datasets):
print 'All the competition datasets have been downloaded, extraced and are ready for you !'
return
purge_all_downloads(db_full_path)
mkdir(db_full_path)
kaggle_info = {'UserName': username, 'Password': password}
for df in datasets:
url = (
'https://www.kaggle.com/account/login?ReturnUrl=' +
'/c/' + competition + '/download/'+ df + '.csv.zip'
)
request = post(url, data=kaggle_info, stream=True)
# write data to local file
with open(db_full_path + df + '.csv.zip', "w") as f:
for chunk in request.iter_content(chunk_size = 512 * 1024):
if chunk: f.write(chunk)
# extract competition data
unzip(db_full_path, datasets)
print('done !')
更多详情请参考“直接下载比赛数据集”。
希望这可以帮助。
推荐阅读
- android - android - 从用户的图库中保存图片
- vue.js - 在 Vue 页面中嵌入 mapbox-gl 对象?
- python - 想要在 PDF 文档中保存多个绘图
- java - Mapstructs 在 MapperImpl 中生成错误代码
- python - 如何在python中使用和号对数学表达式进行排序
- objective-c - NSString 到 const void *
- python - 使用正则表达式仅打印字符串中的字母
- angular - Angular-Protractor-Headless Chromium:使用 APP_INITIALIZE 时,在页面上找不到 Angular
- c# - 在c#中抛出404去customerror.aspx没有302和301
- react-native - 如何在手机上下载资产并在 React Native 应用程序中使用它