首页 > 解决方案 > Python - pypdf2 extractText() 不工作

问题描述

我正在尝试提取文本然后最后进行编辑,但文本没有被提取,它正确显示了页数、标题元素,只有 extractText() 不起作用。

-这是我的代码-

import PyPDF2 as o

#File Object

pdfFileObj=open('answkt.pdf','rb')

#Render Object

pdfReader=o.PdfFileReader(pdfFileObj)

#no of pages

print(pdfReader.numPages)

#page Object

pageObj=pdfReader.getPage(0)

#extract text

print(pageObj.extractText())

#close

pdfFileObj.close()

标签: pythonpdfpypdf2

解决方案


我在 youtube 上找到了解决方案,但我想与您分享代码!享受!

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import io

def pdf2txt(inPDFfile, outTXTFile): 
    inFile = open(inPDFfile, 'rb')
    resMgr = PDFResourceManager()
    retData = io.StringIO()
    TxtConverter = TextConverter(resMgr, retData,laparams = LAParams())
    interpreter = PDFPageInterpreter(resMgr, TxtConverter)
    
    for page in PDFPage.get_pages(inFile): 
        interpreter.process_page(page)
        
    txt = retData.getvalue()
    with open(outTXTFile, 'w') as f: 
         f.write(txt)
   
inPDFfile = "Resume.pdf" # your file path
outTXTFile = "sample.txt" # what ever the name you want enjoy!
pdf2txt(inPDFfile, outTXTFile)
 

推荐阅读