首页 > 解决方案 > 我怎样才能成功下载(使用“获取”)scikit-learn 的真实世界数据集?

问题描述

我是 Scikit-learn 的初学者。如果我运行代码以下载 sklearn.datasets 的“20 个新闻组文本数据集”(代码显示在https://scikit-learn.org/stable/datasets/real_world.html

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

返回以下错误。

OSError                                   Traceback (most recent call last)
<ipython-input-17-ade32d7dd81b> in <module>
      1 from sklearn.datasets import fetch_20newsgroups
----> 2 newsgroups_train = fetch_20newsgroups(subset='train')

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~\anaconda3\lib\site-packages\sklearn\datasets\_twenty_newsgroups.py in fetch_20newsgroups(data_home, subset, categories, shuffle, random_state, remove, download_if_missing, return_X_y)
    257             logger.info("Downloading 20news dataset. "
    258                         "This may take a few minutes.")
--> 259             cache = _download_20newsgroups(target_dir=twenty_home,
    260                                            cache_path=cache_path)
    261         else:

~\anaconda3\lib\site-packages\sklearn\datasets\_twenty_newsgroups.py in _download_20newsgroups(target_dir, cache_path)
     73 
     74     logger.info("Downloading dataset from %s (14 MB)", ARCHIVE.url)
---> 75     archive_path = _fetch_remote(ARCHIVE, dirname=target_dir)
     76 
     77     logger.debug("Decompressing %s", archive_path)

~\anaconda3\lib\site-packages\sklearn\datasets\_base.py in _fetch_remote(remote, dirname)
   1195     checksum = _sha256(file_path)
   1196 
-> 1197     if remote.checksum != checksum:
   1198         raise IOError("{} has an SHA256 checksum ({}) "
   1199                       "differing from expected ({}), "

OSError: C:\Users\owner\scikit_learn_data\20news_home\20news-bydate.tar.gz has an SHA256 checksum (cb5c6e663e59b628d9016d3cb2a3992ad38811d846c04561c3fbfa58badcb1f7) differing from expected (8f1b2514ca22a5ade8fbb9cfa5727df95fa587f4c87b786e15c759fa66d95610), file may be corrupted.

下载的文件大小(C:\Users\owner\scikit_learn_data\20news_home\20news-bydate.tar.gz)为 1KB。但是文件的实际大小约为 14MB ( http://qwone.com/~jason/20Newsgroups/ )。

为什么 fetch(downloading) 确实失败了,我怎样才能成功下载带有 'fetch_20newsgroups' 的文件?

我的操作系统是 Windows10

非常感谢。

标签: pythonscikit-learndownloadsha256

解决方案


我找到了原因。原因是我们公司出于安全原因封锁了亚马逊网站。所以下载失败。20 个新闻组文本数据集可能保存在 amazon 中,scikit-learn 模块从中获取数据。我们公司的消息显示“s3-eu-west-1.amazonaws.com/pfigshare-u-files”和“s3-eu-west-1.amazonaws.com/”被阻止。

感谢Kota Mori。你的回答给了我一些提示。URL 是“https://ndownloader.figshare.com/files/5975967”,如果我将其复制到网络浏览器,地址将更改为“https://s3-eu-west-1.amazonaws.com/ pfigshare-u-files/5975967/20newsbydate.tar.gz?...' 并显示被阻止的图像。


推荐阅读