首页 > 解决方案 > 使用 urllib 在 Python 中进行网页抓取:urllib.error.HTTPError: HTTP Error 404: Not Found

问题描述

我从 github 获得了用于 url 提取器的代码。我对 python 很陌生,我尝试使用 requests 而不是 urllib,如某些答案中所建议的那样,但无法安静地弄清楚。

import re
import sys
import urllib.request
import argparse
from utils import check_date, check_lang
def main(argv):
    """
    Fetch the wikipedia dump page and extract the (list of) file(s) containing
    the full revision history of all articles and output it to a single file
    for the next step.
    :param argv: commandline parameters for the execution
    :return:
    """

    parser = argparse.ArgumentParser()
    parser.add_argument('-o', '--outputfile', action="store", dest='outpufile',
                        required=True,
                        help='Full path to output the url list into')
    parser.add_argument('-l', '--lang', action="store", dest='lang',
                        required=True, type=check_lang,
                        help='Two-letter language tag to fetch')
    parser.add_argument('-d', '--date', action="store", dest='date',
                        required=True, type=check_date,
                        help='The exact wikipedia archive date (YYYMMDD)')
    args = parser.parse_args()
    baseurl = 'https://dumps.wikimedia.org'

    # Download page
    response = urllib.request.urlopen(baseurl + "/" + args.lang + "wiki/"
                                      + args.date + "/")
    page = response.read()

    # Fetch the matching url for the complete page edit history in bz2 format
    linkpattern = r"(\/" + args.lang + r"wiki\/" + args.date + r"\/" + \
                  args.lang + "wiki-" + args.date + \
                  r"-pages-meta-history\d{0,3}\.xml(-p\d{1,9}p\d{1,9})?\.bz2)"
    matches = re.findall(linkpattern, page.decode("UTF-8"))

    # Write them to the output file
    cpt = 0
    with open(args.outpufile, "w+") as f:
        for m in matches:
            cpt += 1
            f.write(baseurl + m[0] + "\n")

    print(str(cpt) + " url(s) generated")


if __name__ == "__main__":
    main(sys.argv[1:])

我收到以下错误。是否有任何原因导致 urllib 不起作用。如果我必须使用 requests 而不是我该如何处理?有没有办法在不使用 requests 的情况下解决问题

Traceback (most recent call last):
  File "url_extractor.py", line 54, in <module>
    main(sys.argv[1:])
  File "url_extractor.py", line 32, in main
    response = urllib.request.urlopen(baseurl + "/" + args.lang + "wiki/"+ args.date + "/")
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
´´´

标签: pythonurlliburllib2

解决方案


推荐阅读