首页 > 解决方案 > Pandas 忽略 read_csv 中的设置编码?

问题描述

使用 Linux、Pandas 1.0.1 和 Python 3.6 我在生产中遇到一个奇怪的错误:


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "/opt/app-root/src/import_validation/validate_csv.py", line 275, in run
    validate(temp_csv, self.query_id)
  File "/opt/app-root/src/import_validation/validate_csv.py", line 263, in validate
    pandas.read_csv(path, encoding='latin1', sep=sep)
  File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1136, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1253, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas/_libs/parsers.pyx", line 1268, in pandas._libs.parsers.TextReader._string_convert
  File "pandas/_libs/parsers.pyx", line 1458, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 12: invalid continuation byte

正如您在回溯中看到的,我已经将编码设置为 latin1:

pandas.read_csv(path, encoding='latin1', sep=sep)

当我将 latin1 指定为编码时,为什么 pandas 会尝试解码 UTF-8?我尝试对 latin1 使用其他别名,它给出了相同的结果。

知道为什么熊猫似乎忽略了我的编码设置吗?

编辑:删除了关于不在 Windows 中工作的评论。发生了同样的错误,我只是在传递文件时作弊,而不是以相同的方式传递。

标签: pythonpython-3.xpandasparsing

解决方案


问题在于太多的抽象层。如果文件以'gz'结尾,我有一个包装器试图解压缩文件。然后我给 pandas 的不是路径,而是一个临时文件。这个文件当然已经有了它的编码设置,然后在 pandas 中会忽略编码设置。解决方案是将编码传递给临时文件,或者像我一样,将原始路径传递给 pandas,因为它会自动处理文件的解压缩。


推荐阅读