首页 > 解决方案 > 如何修复 - TypeError:int() 参数必须是字符串、类似字节的对象或数字,而不是“PSKeyword”?

问题描述

我正在尝试使用 pdfminer 从 pdf 文件中提取文本,但我遇到了这个问题,但仅限于某些文件。该代码在某些 pdf 上运行良好,但会为其他人返回此错误消息。这是我的代码(我从这个论坛的其他线程复制过来的):

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
def extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)
        
        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text

if __name__ == '__main__':
    print(extract_text_from_pdf('test.pdf'))*

这是我得到的错误:

Traceback (most recent call last):
  File "pdf.py", line 28, in <module>
    print(extract_text_from_pdf('test.pdf'))
  File "pdf.py", line 13, in extract_text_from_pdf
for page in PDFPage.get_pages(fh,
  File "C:\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdfminer\pdfpage.py", line 129, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "C:\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdfminer\pdfdocument.py", line 566, in __init__
xref.load(parser)
  File "C:\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdfminer\pdfdocument.py", line 195, in load
(_, obj) = parser.nextobject()
  File "C:\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdfminer\psparser.py", line 616, in nextobject
self.do_keyword(pos, token)
  File "C:\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdfminer\pdfparser.py", line 79, in do_keyword
(objid, genno) = (int(objid), int(genno))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'PSKeyword'

我一直在尝试寻找解决此问题的方法,但尚未成功。感谢帮助!多谢你们。

标签: python-3.xpdfminer

解决方案


推荐阅读