python-3.x - HTTPError:HTTP 错误 403:在从 Python3 中的链接下载 csv 文件期间定义标头时返回 Forbidden 或 None
问题描述
请告知我如何从https://www.hesa.ac.uk下载 Python3 csv 文件。
我抓取的 csv 文件链接:
csv_link = ['/data-and-analysis/finances/table-2.csv', '/data-and-analysis/finances/table-3.csv','/data-and-analysis/finances/table-3s.csv','/data-and-analysis/finances/table-4.csv','/data-and-analysis/finances/table-9.csv','/data-and-analysis/finances/table-10.csv']
我要下载的代码
import wget
for link in csv_link:
full_link = 'https://www.hesa.ac.uk' + link
print(print(full_link))
wget.download(full_link)
收到 403 错误:
https://www.hesa.ac.uk/data-and-analysis/finances/table-2.csv
None
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-7-6d016e0bdd56> in <module>
3 full_link = 'https://www.hesa.ac.uk' + link
4 print(print(full_link))
----> 5 wget.download(full_link)
6
/usr/local/lib/python3.7/dist-packages/wget.py in download(url, out, bar)
524 else:
525 binurl = url
--> 526 (tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
527 filename = detect_filename(url, out, headers)
528 if outdir:
/usr/lib/python3.7/urllib/request.py in urlretrieve(url, filename, reporthook, data)
245 url_type, path = splittype(url)
246
--> 247 with contextlib.closing(urlopen(url, data)) as fp:
248 headers = fp.info()
249
/usr/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
--> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):
/usr/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
529 for processor in self.process_response.get(protocol, []):
530 meth = getattr(processor, meth_name)
--> 531 response = meth(req, response)
532
533 return response
/usr/lib/python3.7/urllib/request.py in http_response(self, request, response)
639 if not (200 <= code < 300):
640 response = self.parent.error(
--> 641 'http', request, response, code, msg, hdrs)
642
643 return response
/usr/lib/python3.7/urllib/request.py in error(self, proto, *args)
567 if http_err:
568 args = (dict, 'default', 'http_error_default') + orig_args
--> 569 return self._call_chain(*args)
570
571 # XXX probably also want an abstract factory that knows when it makes
/usr/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
501 for handler in handlers:
502 func = getattr(handler, meth_name)
--> 503 result = func(*args)
504 if result is not None:
505 return result
/usr/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
647 class HTTPDefaultErrorHandler(BaseHandler):
648 def http_error_default(self, req, fp, code, msg, hdrs):
--> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp)
650
651 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 403: Forbidden
如果我修改我的代码以使用标题,那么我会得到 None 和警告:
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3:DeprecationWarning:AppURLopener 调用请求的风格已被弃用。使用更新的 urlopen 函数/方法这与 ipykernel 包是分开的,所以我们可以避免导入,直到
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.69 Safari/537.36"
urllib._urlopener = AppURLopener()
for link in csv_link:
full_link = 'https://www.hesa.ac.uk' + link
print(print(full_link))
urllib._urlopener.retrieve(full_link)
请告知如何更改我的代码,以便我可以下载我的文件。还真的想了解使用 Juputer Notebooks 从 Python 3 中的抓取链接下载文件的正确方法是什么。
解决方案
我在 os.system 的帮助下使它工作。仍在寻找正确的方法来做到这一点,但这是我的代码暂时解决了我的问题。
import os
for link in csv_link:
full_url = 'https://www.hesa.ac.uk' + link
os.system('wget ' + full_url)
推荐阅读
- c++ - OpenCV 直方图垫到 Picturebox 的位图
- datetime - 使用 BCP 将日期时间数据导入 SQL Server Warehouse
- javascript - GetGlobalContext 未在 Dynamics CRM 的 HTML 网络资源中定义
- c# - 有没有办法在 Linux 的 C# 中调整控制台的大小?
- zoho - 无法从 zoho 中的授权令牌生成访问令牌
- java - 当多次点击活动导航点击时,为什么第二个活动生命周期方法调用两次
- java - Discord Bot 的命令处理
- wpf - 领域在部署的 WPF/Windows10 应用程序中失败,领域包装器.dll 未找到错误
- python - 如何对两个列表进行相同的排序?
- mongodb - 我有两个文档,我们如何创建查找来获得结果?