首页 > 解决方案 > Web 抓取 Python 程序返回“'charmap'编解码器无法编码字符”

问题描述

import bs4 as bs
import urllib.request
import re
import os
from colorama import Fore, Back, Style, init

init()

def highlight(word):
    if word in keywords:
      return Fore.RED + str(word) + Fore.RESET
    else:
      return str(word)

for newurl in newurls:
 url = urllib.request.urlopen(newurl)
 soup1 = bs.BeautifulSoup(url, 'lxml')
 paragraphs =soup1.findAll('p')
 print (Fore.GREEN + soup1.h2.text + Fore.RESET)
 print('')
 for paragraph in paragraphs:
    if paragraph != None:
        textpara = paragraph.text.strip().split(' ')
        colored_words = list(map(highlight, textpara))
        print(" ".join(colored_words).encode("utf-8")) #encode("utf-8")
    else:
        pass

我将列出要查看的关键词和网址。在 url 中运行几个关键字后,我得到这样的输出

b'\x1b[31mthe desired \x1b[31mmystery corners \x1b[31mthe differential . 
\x1b[31mthe back \x1b[31mpretends to be \x1b[31mthe'

我删除了 encode("utf-8") 并且出现编码错误

Traceback (most recent call last):
 File "C:\Users\resea\Desktop\Python Projects\Try 3.py", line 52, in 
 <module>
   print(" ".join(colored_words)) #encode("utf-8")
  File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 41, in 
   write
  self.__convertor.write(text)
   File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 162, 
   in write
    self.write_and_convert(text)
   File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 190, 
  in write_and_convert
  self.write_plain_text(text, cursor, len(text))
  File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 195, in 
   write_plain_text
  self.wrapped.write(text[start:end])
   File "C:\Python34\lib\encodings\cp850.py", line 19, in encode
   return codecs.charmap_encode(input,self.errors,encoding_map)[0]
   UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in 
   position 23: character maps to <undefined>

我哪里错了?

标签: python-3.xurlweb-scrapingbeautifulsoupencode

解决方案


我知道我要建议的更多的是一种解决方法,而不是“解决方案”,但我一次又一次地对必须处理“编码这个”或“编码那个”的各种奇怪字符感到沮丧",有时成功,但很多时候没有。

根据您在 中使用的文本类型,newurl有问题的字符的范围可能是有限的。所以我会根据具体情况处理它们:每次我遇到这些错误之一时,我都会这样做:

import unicodedata
unicodedata.name('\u2019')

在你的情况下,你会得到这个:

'RIGHT SINGLE QUOTATION MARK'

旧的,讨厌的,正确的单引号......所以接下来,正如这里所建议的,我只是用另一个看起来像它的讨厌的字符替换它,但不会引发错误;在你的情况下

colored_words = list(map(highlight, textpara)).replace(u"\u2019", "'") # or some other replacement character

应该管用。每次出现此错误时,您都会冲洗并重复。诚然,这不是最优雅的解决方案,但一段时间后,您的所有可能的奇怪字符都会newurl被捕获并且错误会停止。


推荐阅读