首页 > 解决方案 > Read file of any extension using python

问题描述

I am trying to read contents of various file. Some of those files can be docx extension or pdf or xlsx extension as well.

I tried to use this code

for path in paths:
    print(open(path, "r", encoding="utf8").read())

but it gave me following error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-22-db6ea654fe14> in <module>
      1 for path in paths:
----> 2     print(open(path, "r", encoding="utf8").read())

~\AppData\Local\Programs\Python\Python38\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid continuation byte

标签: pythonfileencoding

解决方案


没有一种方法可以读取和公开任何类型的文件扩展名的功能。您将需要相应地处理每个扩展

有一些库可以帮助您阅读某些文件格式,因此我建议您使用它们。

import PyPDF2
 
for path in paths:
    if path.endswith(".pdf"):
        with open(path,'rb') as pdf_file:
            pdf_read_obj = PyPDF2.PdfFileReader(pdf_file)
            print(pdf_read_obj.read()) # This is pseudo code

    elif path.endswith(".docx"):
        # handle doc case
    elif path.endsith("xlsx"):
        # handle excel case
    else: # Default to this case
        try:
            print(open(path, "r", encoding="utf8").read())
        except:
            print(f"Could not read file {path}") 

推荐阅读