python - 从扫描的 PDF 中提取文本而不将扫描保存为新的文件图像
问题描述
我想从扫描的 PDF 中提取文本。
我的“测试”代码如下:
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
converted_scan = convert_from_path('test.pdf', 500)
for i in converted_scan:
i.save('scan_image.png', 'png')
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
outfile.write(text.replace('\n\n', '\n'))
我想知道是否有一种方法可以直接从对象中提取图像内容converted_scan
,而不将扫描保存为磁盘上的新“物理”图像文件?
基本上,我想跳过这部分:
for i in converted_scan:
i.save('scan_image.png', 'png')
我有几千次扫描可以从中提取文本。虽然所有生成的新图像文件都不是特别重,但也不是可以忽略不计,我觉得有点矫枉过正。
编辑
根据这篇文章,这是一种与科隆德的答案略有不同、更紧凑的方法。对于具有许多页面的 .pdf 文件,可能值得使用例如tqdm
模块为每个循环添加进度条。
from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io
infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''
# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
image_png = scan.convert('png')
for i in image_png.sequence:
img_page = w_img(image = i)
req_image.append(img_page.make_blob('png'))
for i in req_image:
content = tool.image_to_string(
p_img.open(io.BytesIO(i)),
lang = tool.get_available_languages()[0],
builder = pyocr.builders.TextBuilder()
)
txt += content
# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
full_txt = regex.sub(r'\n+', '\n', txt)
outfile.write(full_txt)
解决方案
2021 年 5 月更新
我意识到虽然pdf2image
只是调用一个子进程,但不必将图像保存到随后的 OCR 中。您可以做的只是简单(您也可以pytesseract
用作 OCR 库)
from pdf2image import convert_from_path
for img in convert_from_path("some_pdf.pdf", 300):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
编辑:您也可以尝试使用pdftotext
库
pdf2image
是一个简单的包装器pdftoppm
and pdftocairo
。它在内部什么也不做,只是调用子进程。这个脚本应该做你想做的事,但你也需要一个wand
库pyocr
(我认为这是一个偏好问题,所以随意使用任何库来提取你想要的文本)。
from PIL import Image as Pimage, ImageDraw
from wand.image import Image as Wimage
import sys
import numpy as np
from io import BytesIO
import pyocr
import pyocr.builders
def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
"""
Convert PDF file to JPG
:param in_file_path: path of pdf file to convert
:param resolution: resolution with which to read the PDF file
:return: PIL Image
"""
with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
for page in all_pages.sequence:
with Wimage(page) as single_page_image:
# transform wand image to bytes in order to transform it into PIL image
yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
for img in _convert_pdf2jpg("some_pdf.pdf"):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
推荐阅读
- r - 面板数据中的基尼系数
- multithreading - 多线程空内核性能不一致
- javascript - 发出 VueJS 中的承诺
- android - Camera2 LENS_FOCUS_DISTANCE 的单位
- php - 在 apache 中使用 .htacces 文件清理 url 重写
- c++ - PowerPC ppc64le 上的 Gcc Woverloaded-virtual 错误
- swift - 无法转换“列表”类型的返回表达式
'返回类型'一些视图' - mongodb - mongo db docker 镜像认证失败
- node.js - 错误/node_modules/node-sass:命令失败
- javascript - 正则表达式匹配有效的电话号码