首页 > 解决方案 > python windows中的utf8编码问题

问题描述

我正在用 Python 处理 Windows 操作系统上的文件。我收到了诸如 Unicode error surrogate not allowed 之类的错误。

文档中的示例文本:

Ten states led by Texas Attorney General Ken Paxton (R) filed an antitrust lawsuit against 
Google on Wednesday, alleging the tech giant illegally sought to suppress competition and 
reap massive profits from targeted advertisements placed across the Web.

The lawsuit — filed in a Texas federal court and backed exclusively by Republicans — strikes 
at the heart of Google’s lucrative business in connecting those who seek to buy online ads 
with the websites that sell them. Paxton and his GOP allies contend that Google relied on a 
mix 
of improper tactics to force its ad tools on publishers and solidify its pole position as a 
“middleman” in the invisible transactions that power much of the Web.

Online advertising is expected to generate $42 billion in revenue this year for Google, 
which captures a third of all digital ad spending, according to an October projection from 

eMarketer 公司。谷歌的巨大影响力使德克萨斯州和其他州的总检察长在他们的诉讼中得出结论,这家科技巨头基本上已经建立了“现有最大的电子交易市场”,其运营的广告系统与证券交易所的交易没有什么不同。

代码1:

return_doc.to_csv(path, index= False)

Error1: UnicodeEncodeError: 'utf-8' codec can't encode character '\udc9d' in position 168: surrogates not allowed

代码2:

return_doc.to_csv(path, index= False, encoding='cp1252')

错误2:UnicodeEncodeError:'charmap'编解码器无法在位置168编码字符'\udc9d':字符映射到

代码3:

return_doc.to_csv(path, index= False, encoding='ISO 8859-15')

错误 3:UnicodeEncodeError:“charmap”编解码器无法在位置 14 编码字符“\u201d”:字符映射到

我用过Code4:

return_doc.to_csv(path, index= False, encoding='cp1252', errors='replace)

文字来自

“The actions harm every person in America,” Paxton said in a video statement preceding the 
case, which asked a judge to consider “structural” remedies that could theoretically include 
forcing a breakup of the company.

转换成

“The actions harm every person in America,�? Paxton said in a video statement preceding 
the case, which asked a judge to consider “structural�? remedies that could 
theoretically include forcing a breakup of the company.

这是我不想发生的。

请向我建议一个解决方案,我不会收到任何错误并且不会更改文本。

标签: pythonencodingwindow

解决方案


当 stdio 为控制台时,Python 默认使用 UTF-8。但是如果 stdio 被重定向(例如文件或管道),Python 使用 ANSI 代码页编码。

您可以使用 UTF-8 模式默认使用 UTF-8 进行文本编码。请参阅https://docs.python.org/3/using/windows.html#utf-8-mode以供参考。


推荐阅读