首页 > 解决方案 > 在 Python3 中读取压缩 excel 时的 unknown_codepage_21010

问题描述

url = 'http://47.97.204.47/syl/bk20200416.zip'
response = requests.get(url)
zip_file = ZipFile(BytesIO(response.content))
entry = zip_file.namelist()[0]
file = zip_file.open(entry)

# This works
my_xls = xlrd.open_workbook(file_contents=zip_file.read(entry), encoding_override="gb2312")
my_xls.sheet_names()

# This doesn't work!
df = pd.read_excel(file, encoding_override='gb2312')

最后一行引发错误:


> LookupError: unknown encoding: unknown_codepage_21010 ERROR ***
> codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError:
> unknown encoding: unknown_codepage_21010

你知道如何传递encoding_overridexlrd引擎pandas.read_excel吗?

我检查了源代码pandas,似乎它没有传递encoding_overridexlrd

def load_workbook(self, filepath_or_buffer):
    from xlrd import open_workbook

    if hasattr(filepath_or_buffer, "read"):
        data = filepath_or_buffer.read()
        return open_workbook(file_contents=data)
    else:
        return open_workbook(filepath_or_buffer)

或者我可以使用xlrd.open_workbook,但不知道如何转换xlrd.book.BookDataFrame.

标签: pythonpandas

解决方案


url = 'http://47.97.204.47/syl/bk20200416.zip'
response = requests.get(url)
zip_file = ZipFile(BytesIO(response.content))
entry = zip_file.namelist()[0]
file_contents = zip_file.read(entry)
book = xlrd.open_workbook(file_contents=file_contents, encoding_override="gb2312")
xls_file = pd.ExcelFile(book)

pd.ExcelFile或者pd.read_excel可以接受一本书作为论据。因此,首先构建这本书,然后将其传递给ExcelFile会做的伎俩。

阅读评论了解更多详情:

class ExcelFile:
    """
    Class for parsing tabular excel sheets into DataFrame objects.
    Uses xlrd. See read_excel for more documentation

    Parameters
    ----------
    io : string, path object (pathlib.Path or py._path.local.LocalPath),
        file-like object or xlrd workbook
        If a string or path object, expected to be a path to xls or xlsx file.
    engine : string, default None
        If io is not a buffer or path, this must be set to identify io.
        Acceptable values are None or ``xlrd``.
    """

    from pandas.io.excel._odfreader import _ODFReader
    from pandas.io.excel._openpyxl import _OpenpyxlReader
    from pandas.io.excel._xlrd import _XlrdReader

    _engines = {"xlrd": _XlrdReader, "openpyxl": _OpenpyxlReader, "odf": _ODFReader}

    def __init__(self, io, engine=None):
        if engine is None:
            engine = "xlrd"
        if engine not in self._engines:
            raise ValueError("Unknown engine: {engine}".format(engine=engine))

        self.engine = engine
        # could be a str, ExcelFile, Book, etc.
        self.io = io
        # Always a string
        self._io = _stringify_path(io)

        self._reader = self._engines[engine](self._io)

推荐阅读