首页 > 解决方案 > Beautiful Soup 的编码错误:字符映射到未定义(Python)

问题描述

我编写了一个脚本,该脚本应该从站点中检索 html 页面并更新其内容。以下函数在我的系统上查找某个文件,然后尝试打开并编辑它:

def update_sn(files_to_update, sn, table, title):
    paths = files_to_update['files']
    print('updating the sn')
    try:
        sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0]
        notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0]

    except Exception:
        print('no sns were found')
        pass

    new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original'])
    new_sn_number = sn

    htm_text = open(sn_htm, 'rb').read().decode('cp1252')
    content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S) 
    minus_content = htm_text.replace(content[0], '')
    table_soup = BeautifulSoup(table, 'html.parser')
    new_soup = BeautifulSoup(minus_content, 'html.parser')
    head_title = new_soup.title.string.replace_with(new_sn_number)
    new_soup.link.insert_after(table_soup.div.next)

    with open(new_path_name, "w+") as file:
        result = str(new_soup)
        try:
            file.write(result)
        except Exception:
            print('Met exception.  Changing encoding to cp1252')
            try:
                file.write(result('cp1252'))
            except Exception:
                print('cp1252 did\'nt work.  Changing encoding to utf-8')
                file.write(result.encode('utf8'))
                try:
                    print('utf8 did\'nt work.  Changing encoding to utf-16')
                    file.write(result.encode('utf16'))
                except Exception:
                    pass

这在大多数情况下都有效,但有时它无法写入,此时异常开始,我尝试了所有可行的编码但没有成功:

updating the sn
Met exception.  Changing encoding to cp1252
cp1252 did'nt work.  Changing encoding to utf-8
Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn
    file.write(result)
  File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
    file.write(result('cp1252'))
TypeError: 'str' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scraper.py", line 79, in <module>
    get_latest(entries[0], int(num), entries[1])
  File "scraper.py", line 56, in get_latest
    update_files.update_sn(files_to_update, data['number'], data['table'], data['title'])
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn
    file.write(result.encode('utf8'))
TypeError: write() argument must be str, not bytes

谁能给我任何关于如何更好地处理可能具有不一致编码的html数据的指示?

标签: pythonhtmlencodingbeautifulsoup

解决方案


In your code you open the file in text mode, but then you attempt to write bytes (str.encode returns bytes) and so Python throws an exception:

TypeError: write() argument must be str, not bytes

If you want to write bytes, you should open the file in binary mode.

BeautifulSoup detects the document’s encoding (if it is bytes) and converts it to string automatically. We can access the encoding with .original_encoding, and use it to encode the content when writting to file. For example,

soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii

with open('my.file', 'wb+') as file:
    file.write(data.encode(encoding))

In order for this to work you should pass your html as bytes to BeautifulSoup, so don't decode the response content.

If BeautifulSoup fails to detect the correct encoding for some reason, then you could try a list of possible encodings, like you have done in your code.

data = 'Somé téxt'
encodings = ['ascii', 'utf-8', 'cp1252']

with open('my.file', 'wb+') as file:
    for encoding in encodings:
        try:
            file.write(data.encode(encoding))
            break
        except UnicodeEncodeError:
            print(encoding + ' failed.')

Alternatively, you could open the file in text mode and set the encoding in open (instead of encoding the content), but note that this option is not available in Python2.


推荐阅读