首页 > 解决方案 > 网络抓取问题 - 某些字符无法解码,并被替换为 REPLACEMENT CHARACTER

问题描述

我试图用 urllib 和 beautifulsoup(python 3.9)抓取一个网站,但我仍然有相同的错误消息“某些字符无法解码,并被替换为替换字符”,特殊字符如下:

��T�w?.��m����%�%z��%�H=S��$S�YYyi�ABD�x�!%��f36��\�Y�j�46f ����I��9��!D��������������������b7�3�8��JnH�t���m�Bm���< ;����,�zR�m��A�g��{�XF%��&)�6zy��'�)a�Fo�����N叔,��~?w�w � ���7z�Y6N������Q��ƣA��,p�8��/��W��q�$ ���#e�J7�#� 5�X�z�Ȥ� &q��8 ��H"����I0������͂8ZY}J�m��c}&5e��? "/>[�7X�?NF4r���[k��6�X ?��VV��H�J$j�6h��e�C��]<�V��z D����"d�nje��{���+YL��*�X? a����m��������MNn�+��1=b$�N�4p�0����/�h�'�?�,�[��V��$�D ��Z��+�?�x�X�g����

我阅读了一些有关此问题的主题,但在我的情况下找不到解决方案。下面,我的代码:

url = "https://www.fnac.com"
hdr = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
        "Accept": "*/*",
        "Accept-Encoding" : "gzip, deflate, br",
        "Accept-Language": "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
        "Connection" : "keep-alive"}
req = urllib.request.Request(url, headers=hdr)

page = urllib.request.urlopen(req)

if page.getcode() == 200:
    soup = BeautifulSoup(page, "html.parser", from_encoding="utf-8")
    #divs = soup.findAll('div')
    #href = [i['href'] for i in soup.findAll('a', href=True)]
    print(soup)

else:
    print("failed!")

我尝试通过 ASCII 或 iso-8858-(1...9) 更改编码模式,但问题仍然相同。

谢谢你的帮助 :)

标签: python-3.xweb-scraping

解决方案


Accept-Encoding从 HTTP 标头中删除:

import urllib
from bs4 import BeautifulSoup

url = "https://www.fnac.com"
hdr = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
    "Accept": "*/*",
    # "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3",
    "Connection": "keep-alive",
}
req = urllib.request.Request(url, headers=hdr)

page = urllib.request.urlopen(req)

if page.getcode() == 200:
    soup = BeautifulSoup(page, "html.parser", from_encoding="utf-8")
    # divs = soup.findAll('div')
    # href = [i['href'] for i in soup.findAll('a', href=True)]
    print(soup)

else:
    print("failed!")

印刷:


<!DOCTYPE html>

<html class="no-js" lang="fr-FR">
<head><meta charset="utf-8"/> <!-- entry: inline-kameleoon -->


...

推荐阅读