首页 > 解决方案 > 使用 Python urllib 读取 html 的错误

问题描述

我目前在 python 上使用 urllib 打开链接时遇到问题。我的代码基本上是为了能够获取文章的链接(变量“url”),打开链接(page = urlopen(url)),从网站获取html(html_bytes = page.read()) ,解码 html(变量 html),然后打印它解码的内容。

这是我的代码:


    from urllib.request import urlopen

    url = "https://www.wsj.com/articles/peloton-says-wait-times-are-down-to-pre-pandemic-levels-11620334234?mod=hp_lista_pos4"

    page = urlopen(url)
    html_bytes = page.read()
    html = html_bytes.decode("utf-8")
    print(html)

这是我的错误:

File "c:/Users/Stras/VeraitasBot/urlopentest.py", line 5, in <module>
    page = urlopen(url)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 640, in http_response        
    response = self.parent.error(
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.2800.0_x64__qbz5n2kfra8p0\lib\urllib\request.py", line 649, in http_error_default   
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

此代码能够打开大多数链接并从纽约时报、福克斯、CNN 等网站抓取 html,但是当我尝试从 WSJ 等网站拉取 html 时总是会出现该错误(如上例所示) .

有谁知道我可以从所有网站持续抓取信息的方法或如何解决此错误?谢谢

标签: pythonhtmlweb-scrapingurllibhtmldecode

解决方案


推荐阅读