python - Beautiful Soup 的编码错误:字符映射到未定义(Python)
问题描述
我编写了一个脚本,该脚本应该从站点中检索 html 页面并更新其内容。以下函数在我的系统上查找某个文件,然后尝试打开并编辑它:
def update_sn(files_to_update, sn, table, title):
paths = files_to_update['files']
print('updating the sn')
try:
sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0]
notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0]
except Exception:
print('no sns were found')
pass
new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original'])
new_sn_number = sn
htm_text = open(sn_htm, 'rb').read().decode('cp1252')
content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S)
minus_content = htm_text.replace(content[0], '')
table_soup = BeautifulSoup(table, 'html.parser')
new_soup = BeautifulSoup(minus_content, 'html.parser')
head_title = new_soup.title.string.replace_with(new_sn_number)
new_soup.link.insert_after(table_soup.div.next)
with open(new_path_name, "w+") as file:
result = str(new_soup)
try:
file.write(result)
except Exception:
print('Met exception. Changing encoding to cp1252')
try:
file.write(result('cp1252'))
except Exception:
print('cp1252 did\'nt work. Changing encoding to utf-8')
file.write(result.encode('utf8'))
try:
print('utf8 did\'nt work. Changing encoding to utf-16')
file.write(result.encode('utf16'))
except Exception:
pass
这在大多数情况下都有效,但有时它无法写入,此时异常开始,我尝试了所有可行的编码但没有成功:
updating the sn
Met exception. Changing encoding to cp1252
cp1252 did'nt work. Changing encoding to utf-8
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn
file.write(result)
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
file.write(result('cp1252'))
TypeError: 'str' object is not callable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scraper.py", line 79, in <module>
get_latest(entries[0], int(num), entries[1])
File "scraper.py", line 56, in get_latest
update_files.update_sn(files_to_update, data['number'], data['table'], data['title'])
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn
file.write(result.encode('utf8'))
TypeError: write() argument must be str, not bytes
谁能给我任何关于如何更好地处理可能具有不一致编码的html数据的指示?
解决方案
In your code you open the file in text mode, but then you attempt to write bytes (str.encode
returns bytes) and so Python throws an exception:
TypeError: write() argument must be str, not bytes
If you want to write bytes, you should open the file in binary mode.
BeautifulSoup detects the document’s encoding (if it is bytes) and converts it to string automatically. We can access the encoding with .original_encoding
, and use it to encode the content when writting to file. For example,
soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii
with open('my.file', 'wb+') as file:
file.write(data.encode(encoding))
In order for this to work you should pass your html as bytes to BeautifulSoup
, so don't decode the response content.
If BeautifulSoup fails to detect the correct encoding for some reason, then you could try a list of possible encodings, like you have done in your code.
data = 'Somé téxt'
encodings = ['ascii', 'utf-8', 'cp1252']
with open('my.file', 'wb+') as file:
for encoding in encodings:
try:
file.write(data.encode(encoding))
break
except UnicodeEncodeError:
print(encoding + ' failed.')
Alternatively, you could open the file in text mode and set the encoding in open
(instead of encoding the content), but note that this option is not available in Python2.
推荐阅读
- git - 更新分支而不删除从它发出的拉取请求
- highcharts - 如何在高图中显示垂直线,如下图所示?
- html - 视频未显示在自定义 Firebase 域上
- python - 如何从 Pivot Table() 结果在 python 中绘制条形图?
- json - 来自文件的 Json 未第一次加载
- python - Showerror不想出现,但是代码好像没有错误
- javascript - 如何从窗口中删除事件监听器?
- excel - 活动单元格值其他工作簿
- java - 从 main 调用 opencsv 时抛出异常,并且存在 module-info.java
- python - 有没有办法在说出单词后提取文本?