首页 > 解决方案 > PyTesseract OCR 在文档上的结果不佳

问题描述

我正在使用 PyTesseract 进行 OCR。这是我的图片:

在此处输入图像描述

在此处输入图像描述

我用于 OCR 的代码:

def pre_process(img):
"""Apply preprocessing to an image"""
   angle = getSkewAngle(img)
   img = rotateImage(img, -1.0 * angle)
   img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
   gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
   thresh = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY)[1]
   return thresh

def get_text_from_image(path):
   """Retrieve text data from image"""
   img = cv2.imread(path)

   img = pre_process(img)

   text = tess.image_to_string(
      img, lang='eng', config='--psm 6 --oem 3'
   ) 

   get_text_from_image(path)

如何改进我的 OCR 结果?我尝试了不同psm的 s,不同的阈值内核选项,不同的参数threshold,但没有成功。我读到 Tesseract 在 DPI > 300 上表现更好,但不确定在这种情况下 DPI 有什么意义,因为我使用的是opencv. 问题是我不知道我会得到什么样的文件作为输入。它可能是任何类型的文档,所以我在我的代码中很好地平衡了参数和方法,以便它可以在任何文档类型中表现同样出色。pdfs此外,可能会扫描图像扩展名jpgpng文件。

以下是 OCR 结果:


---------------------------------------FIRST IMAGE----------------------------------------------
‘ “\1yos/e7 11:03 813. 884 0863 LORILLARD TAMPA +++ GREENSBOR_ _——@ 002/003
Retail Excel Progress Report

Submission for: Distribution by/to:

July 34 (3 OM to RSM 1st of Month

August29 () To: BW. Caldarella RSM to RW.C. 40th

September 30 ( ) ce: 0.0.8.

October 31 (X) From: Kent B. Mills

Nevember 28 ( )

December 30 ( ) Area: § Reglon: 17

Acceptance/Response: What is the retailers response to Lorillard's Excel

Merchandising pian?

Payment” was not In glace. The chains where we were using the "Flex Payment”

system we have not heen as successful. The P.O.S, requirements of the P-1 Plan

ith OIC nies is. diff tai
Independents:
it PA ising Is bei

Addith . . g Zia fighting PM Exc! PM/RUR

Hardware Evaluation/Effectiveness; Comment on the assembly of displays and

application of shields:

The displays are easily assembled and durable. Some questions have been raised

sanceming the inahility to be flush with the counter and/or up against the register.

Pennanant Advertising Evaluation/Effectiveness/Acceptance: (P-U/P-5 & C-5 |

Plans Only:

vai og

eR co
nN
nN
on
eal
Cc
t=


-----------------------------------SECOND IMAGE-----------------------------------------------

@@ BRITISH

@@ COUNCIL
Questions 36—40
Complete the summary below.

Choose NO MORE THAN TWO WORDS from the passage for each answer.

Write your answers in boxes 36-40 on your answer sheet.

Sobotka argues that big business and users of helium need to help look after helium
stocks because 36 .................... Will not be encouraged through buying and selling
alone. Richardson believes that the 37 .................... needs to be withdrawn, as the
U.S. provides most of the world's helium. He argues that higher costs would mean
people have

38 ..............0:+2+. tO USE the resource many times over.

People should need a 39 .................... to access helium that we still have.
Furthermore, a 40 .................... should ensure that helium is used carefully.
14
````

标签: pythonimage-processingocrtesseractpython-tesseract

解决方案


推荐阅读