python - 使用 urllib 在 Python 中进行网页抓取:urllib.error.HTTPError: HTTP Error 404: Not Found
问题描述
我从 github 获得了用于 url 提取器的代码。我对 python 很陌生,我尝试使用 requests 而不是 urllib,如某些答案中所建议的那样,但无法安静地弄清楚。
import re
import sys
import urllib.request
import argparse
from utils import check_date, check_lang
def main(argv):
"""
Fetch the wikipedia dump page and extract the (list of) file(s) containing
the full revision history of all articles and output it to a single file
for the next step.
:param argv: commandline parameters for the execution
:return:
"""
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--outputfile', action="store", dest='outpufile',
required=True,
help='Full path to output the url list into')
parser.add_argument('-l', '--lang', action="store", dest='lang',
required=True, type=check_lang,
help='Two-letter language tag to fetch')
parser.add_argument('-d', '--date', action="store", dest='date',
required=True, type=check_date,
help='The exact wikipedia archive date (YYYMMDD)')
args = parser.parse_args()
baseurl = 'https://dumps.wikimedia.org'
# Download page
response = urllib.request.urlopen(baseurl + "/" + args.lang + "wiki/"
+ args.date + "/")
page = response.read()
# Fetch the matching url for the complete page edit history in bz2 format
linkpattern = r"(\/" + args.lang + r"wiki\/" + args.date + r"\/" + \
args.lang + "wiki-" + args.date + \
r"-pages-meta-history\d{0,3}\.xml(-p\d{1,9}p\d{1,9})?\.bz2)"
matches = re.findall(linkpattern, page.decode("UTF-8"))
# Write them to the output file
cpt = 0
with open(args.outpufile, "w+") as f:
for m in matches:
cpt += 1
f.write(baseurl + m[0] + "\n")
print(str(cpt) + " url(s) generated")
if __name__ == "__main__":
main(sys.argv[1:])
我收到以下错误。是否有任何原因导致 urllib 不起作用。如果我必须使用 requests 而不是我该如何处理?有没有办法在不使用 requests 的情况下解决问题
Traceback (most recent call last):
File "url_extractor.py", line 54, in <module>
main(sys.argv[1:])
File "url_extractor.py", line 32, in main
response = urllib.request.urlopen(baseurl + "/" + args.lang + "wiki/"+ args.date + "/")
File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
´´´
解决方案
推荐阅读
- azure - 子对象的 ARM 模板迭代
- r - 如何在组内将第一个值与每个后续值进行比较,直到满足条件
- google-apps-script - 我想从 getPlainBody 数据发送 POST 详细信息
- javascript - JS 在现代脚本编辑器 Web 部件中不起作用
- assembly - Raspberry pi 4 (scanf) 中的汇编代码分段错误
- c# - 如何检测 Hangfire 中的连接池问题?
- bash - 当目标字符串包含引号时 [[ ]] 中的字符串比较失败
- java - 在功能上组合相同对象的列表
- python - 部分字符串过滤熊猫
- keras - Keras model.fit():“元组”对象没有属性“形状”