首页 > 解决方案 > 将 TSV 文件中的列加载到 python 列表中

问题描述

我想将“类别”列中的值加载到熊猫 df 中,这是我的 tsv 文件:

Tagname   text  category
j245qzx_8   hamburger toppings   f
h833uio_7   side of fries   f
d423jin_2   milkshake combo   d

这是我的代码:

with open(filename, 'r') as f:
    df = pd.read_csv(f, sep='\t')
    categoryColumn = df["category"]

    categoryList = []
    for line in categoryColumn:
        categoryColumn.append(line)

但是我得到了该行的UnicodeDecodeErrordf = pd.read_csv(f, sep='\t')并且我的代码停在那里:

File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2101, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 898: invalid start byte

任何想法为什么或如何解决这个问题?我的 tsv 中似乎没有任何特殊字符,所以我不确定是什么原因造成的或该怎么做。

标签: pythonpandasdataframecsv

解决方案


修复

只需阅读此 SO,我想我就知道出了什么问题。

您将使用 Python 获取文件句柄open()并将其传递给 Pandas 的read_csv(). open()确定文件的编码。

因此,尝试在 中设置编码open(),如下所示:

with open(filename, 'r', encoding='windows-1252') as f:
    df = pd.read_csv(f, sep='\t')
    categoryColumn = df["category"]

    categoryList = []
    for line in categoryColumn:
        categoryColumn.append(line)

或者,根本不使用open()

df = pd.read_csv(filename, sep='\t', encoding='windows-1252')
categoryColumn = df["category"]

categoryList = []
for line in categoryColumn:
    categoryColumn.append(line)

一些背景故事

x89在您的示例末尾回显,然后运行 ​​Python 的chardetect实用程序,这表明它是 Window-1252:

% echo -e '\x89' >> sample.csv

% cat sample.csv 
Tagname text    category
j245qzx_8       hamburger toppings      f
h833uio_7       side of fries   f
d423jin_2       milkshake combo d
�

% which chardetect
/Library/Frameworks/Python.framework/Versions/3.9/bin/chardetect

% chardetect sample.csv 
sample.csv: Windows-1252 with confidence 0.73

推荐阅读