首页 > 解决方案 > 为什么我在 Python PDFMiner 中收到此错误:TypeError: can only concatenate str (not "bytes") to str

问题描述

我是 python 新手,尝试使用 PDFminer 将 pdf 转换为 txt 文件,每次都会出现此错误TypeError: can only concatenate str (not "bytes") to str*-

我很困惑,因为似乎错误消息表明错误是由于pdfminer包中的文件引起的?我知道这里还有其他关于此错误消息的问题,但我无法根据它们找出我的问题 - 可能主要是因为我不知道他们的代码在做什么而且我是初学者,但也可能是因为它看起来像我的问题是由于与PDFminer特定关联的文件。

我正在运行此代码:

from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from io import StringIO
from pdfminer.pdfpage import PDFPage

def get_pdf_file_content(path_to_pdf):
    resource_manager = PDFResourceManager(caching=True)
    out_text = StringIO
    laParams = LAParams()
    text_converter = TextConverter(resource_manager, out_text, laparams= laParams)
    fp = open(path_to_pdf, 'rb')
    interpreter = PDFPageInterpreter(resource_manager, text_converter)
    for page in PDFPage.get_pages(fp, pagenos=set(), maxpages=0, password="", caching= True, check_extractable= True):
        interpreter.process_page(page)

    text = out_text.getvalue()

    fp.close()
    text_converter.close()
    out_text.close()

    return text

path_to_pdf = "C:\\files\\raw\\AZO - CALLSTREET REPORT  AutoZone, Inc.(AZO), Q1 2002 Earnings Call, 5-December-2001 10 00 AM ET - 05-Dec-01.pdf"
print(get_pdf_file_content(path_to_pdf))

我收到此错误消息:

  File "<stdin>", line 1, in <module>
  File "<stdin>", line 8, in get_pdf_file_content
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfpage.py", line 122, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py", line 575, in __init__
    self._initialize_password(password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py", line 599, in _initialize_password
    handler = factory(docid, param, password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py", line 300, in __init__
    self.init()
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py", line 307, in init
    self.init_key()
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py", line 320, in init_key
    self.key = self.authenticate(self.password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py", line 368, in authenticate
    key = self.authenticate_user_password(password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py", line 374, in authenticate_user_password
    key = self.compute_encryption_key(password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py", line 351, in compute_encryption_key
    password = (password + self.PASSWORD_PADDING)[:32]  # 1
TypeError: can only concatenate str (not "bytes") to str```

标签: pythonpython-3.xpdfpdfminer

解决方案


您在这里有两个选择:

1)您可以将密码设置为字节,从而最终得到

for page in PDFPage.get_pages(fp, pagenos=set(), maxpages=0, password=b"", caching= True, check_extractable= True):
        interpreter.process_page(page)

(注意定义密码的引号前的 b)

2)你可以摆脱那个论点

密码参数不是强制性的(它有一个默认值),因此如果您不需要它,您可以摆脱它。你最终会得到:

for page in PDFPage.get_pages(fp, pagenos=set(), maxpages=0, caching= True, check_extractable= True):
        interpreter.process_page(page)

推荐阅读