python - UnicodeDecodeError:“charmap”编解码器无法解码位置 1915 中的字节 0x9d:字符映射到
问题描述
我是 python 新手。我有一个 .txt(大小:15,259KB)。我想加载文件并对其进行处理,但我不断收到错误消息“UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to”
import nltk
from nltk import FreqDist
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#Read the datasets
path = "C:\\tmp\\FILENAME.txt"
dataset={}
dataset_raw = {}
allFeatures=set()
tot_articles = 0
articles_count={}
N={} # Number of articles in each corpus
for category in categories:
fileName=path
f=open(fileName,'r')
text = ''
text_raw = ''
lines=(f.readlines())
tot_articles+=len(lines)
articles_count[category] = len(lines)
dataset_raw[category] = list(map(lambda line: line.lower(), lines))
for line in lines:
text+=line.replace('\n',' ').lower()
text_raw = line.lower()
f.close
N[category]=len(lines)
tokens = nltk.word_tokenize(text)
dataset[category] = nltk.Text(tokens)
以下是我得到的错误:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-14-222e94b75803> in <module>
14 text = ''
15 text_raw = ''
---> 16 lines=(f.readlines())
17 tot_articles+=len(lines)
18 articles_count[category] = len(lines)
~\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to <undefined>
解决方案
尝试在打开文件时指定编码:
例如:
f=open(fileName,'r', encoding="utf8")
推荐阅读
- python - Python基于键合并嵌套对象
- flutter - 不支持的操作:Infinity 或 NaN toInt
- c++ - 套接字不发送数据
- javascript - 是否可以使用 ES6 模块导入 .css 文件?
- python - Pandas df 迭代寻找重复项
- reactjs - 处理更改输入 REACTJS + TypeScript 错误
- django - ValueError: int() 以 10 为底的无效文字:'favicon.ico'
- c# - ASP.NET Core 标识:如何重新定义禁止处理
- javascript - 如何在文本中间实时显示输入值
- azure - 在 CI 中无头运行时输出 cypress 浏览器日志消息