首页 > 解决方案 > 证书文本阅读脚本

问题描述

我正在尝试编写一个脚本来读取我扫描到输入文件中的证书的名称。

from PIL import Image 
import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os 

# Path to Input
inputfiles = os.listdir('/home/morningdew72/Documents/Python/Input')

# Convert PDF to Image
for certificate in inputfiles:
    PDF_file = certificate
    certificate_image = convert_from_path(PDF_file)

# Create Output Text File
    outfile = "outpage.txt"

# Convert Image to Text
    f = open(outfile, "w") 
    text = str(((pytesseract.image_to_string(Image.open(certificate_image))))) 
    text = text.replace('-\n', '')     
    f.write(text) 

    f.close()

我无法克服这个错误:

Traceback (most recent call last):
  File "/home/morningdew72/anaconda3/lib/python3.7/site-packages/pdf2image/pdf2image.py", line 425, in pdfinfo_from_path
    raise ValueError
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pdftotextapp.py", line 13, in <module>
    certificate_image = convert_from_path(PDF_file)
  File "/home/morningdew72/anaconda3/lib/python3.7/site-packages/pdf2image/pdf2image.py", line 89, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/home/morningdew72/anaconda3/lib/python3.7/site-packages/pdf2image/pdf2image.py", line 435, in pdfinfo_from_path
    "Unable to get page count.\n%s" % err.decode("utf8", "ignore")
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'testpdf2.pdf': No such file or directory.

我的最终目标是能够解析输出文件生成的文本,并将员工姓名与他们的能力文件相匹配。我在这一步中缺少什么?谢谢。

标签: python-3.xpdftotext

解决方案


推荐阅读