首页 > 解决方案 > 从扫描的 PDF 中提取文本而不将扫描保存为新的文件图像

问题描述

我想从扫描的 PDF 中提取文本。
我的“测试”代码如下:

from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image

converted_scan = convert_from_path('test.pdf', 500)

for i in converted_scan:
    i.save('scan_image.png', 'png')
    
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
    outfile.write(text.replace('\n\n', '\n'))

我想知道是否有一种方法可以直接从对象中提取图像内容converted_scan,而不将扫描保存为磁盘上的新“物理”图像文件?

基本上,我想跳过这部分:

for i in converted_scan:
    i.save('scan_image.png', 'png')

我有几千次扫描可以从中提取文本。虽然所有生成的新图像文件都不是特别重,但也不是可以忽略不计,我觉得有点矫枉过正。

编辑

根据这篇文章,这是一种与科隆德的答案略有不同、更紧凑的方法。对于具有许多页面的 .pdf 文件,可能值得使用例如tqdm模块为每个循环添加进度条。

from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io

infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''

# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
    image_png = scan.convert('png')
    for i in image_png.sequence:
        img_page = w_img(image = i)
        req_image.append(img_page.make_blob('png'))
    for i in req_image:
        content = tool.image_to_string(
            p_img.open(io.BytesIO(i)),
            lang = tool.get_available_languages()[0],
            builder = pyocr.builders.TextBuilder()
        )
        txt += content

# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
    full_txt = regex.sub(r'\n+', '\n', txt)
    outfile.write(full_txt)

标签: pythonocr

解决方案


2021 年 5 月更新
我意识到虽然pdf2image只是调用一个子进程,但不必将图像保存到随后的 OCR 中。您可以做的只是简单(您也可以pytesseract用作 OCR 库)

from pdf2image import convert_from_path

for img in convert_from_path("some_pdf.pdf", 300):
    txt = tool.image_to_string(img,
                               lang=lang,
                               builder=pyocr.builders.TextBuilder())

编辑:您也可以尝试使用pdftotext

pdf2image是一个简单的包装器pdftoppmand pdftocairo。它在内部什么也不做,只是调用子进程。这个脚本应该做你想做的事,但你也需要一个wandpyocr(我认为这是一个偏好问题,所以随意使用任何库来提取你想要的文本)。

from PIL import Image as Pimage, ImageDraw
from wand.image import Image as Wimage
import sys
import numpy as np
from io import BytesIO

import pyocr
import pyocr.builders

def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
    """
    Convert PDF file to JPG

    :param in_file_path: path of pdf file to convert
    :param resolution: resolution with which to read the PDF file
    :return: PIL Image
    """
    with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
        for page in all_pages.sequence:
            with Wimage(page) as single_page_image:
                # transform wand image to bytes in order to transform it into PIL image
                yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
for img in _convert_pdf2jpg("some_pdf.pdf"):
    txt = tool.image_to_string(img,
                               lang=lang,
                               builder=pyocr.builders.TextBuilder())

推荐阅读