python - PyTesseract OCR 在文档上的结果不佳
问题描述
我正在使用 PyTesseract 进行 OCR。这是我的图片:
我用于 OCR 的代码:
def pre_process(img):
"""Apply preprocessing to an image"""
angle = getSkewAngle(img)
img = rotateImage(img, -1.0 * angle)
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY)[1]
return thresh
def get_text_from_image(path):
"""Retrieve text data from image"""
img = cv2.imread(path)
img = pre_process(img)
text = tess.image_to_string(
img, lang='eng', config='--psm 6 --oem 3'
)
get_text_from_image(path)
如何改进我的 OCR 结果?我尝试了不同psm
的 s,不同的阈值内核选项,不同的参数threshold
,但没有成功。我读到 Tesseract 在 DPI > 300 上表现更好,但不确定在这种情况下 DPI 有什么意义,因为我使用的是opencv
. 问题是我不知道我会得到什么样的文件作为输入。它可能是任何类型的文档,所以我在我的代码中很好地平衡了参数和方法,以便它可以在任何文档类型中表现同样出色。pdfs
此外,可能会扫描图像扩展名jpg
或png
文件。
以下是 OCR 结果:
---------------------------------------FIRST IMAGE----------------------------------------------
‘ “\1yos/e7 11:03 813. 884 0863 LORILLARD TAMPA +++ GREENSBOR_ _——@ 002/003
Retail Excel Progress Report
Submission for: Distribution by/to:
July 34 (3 OM to RSM 1st of Month
August29 () To: BW. Caldarella RSM to RW.C. 40th
September 30 ( ) ce: 0.0.8.
October 31 (X) From: Kent B. Mills
Nevember 28 ( )
December 30 ( ) Area: § Reglon: 17
Acceptance/Response: What is the retailers response to Lorillard's Excel
Merchandising pian?
Payment” was not In glace. The chains where we were using the "Flex Payment”
system we have not heen as successful. The P.O.S, requirements of the P-1 Plan
ith OIC nies is. diff tai
Independents:
it PA ising Is bei
Addith . . g Zia fighting PM Exc! PM/RUR
Hardware Evaluation/Effectiveness; Comment on the assembly of displays and
application of shields:
The displays are easily assembled and durable. Some questions have been raised
sanceming the inahility to be flush with the counter and/or up against the register.
Pennanant Advertising Evaluation/Effectiveness/Acceptance: (P-U/P-5 & C-5 |
Plans Only:
vai og
eR co
nN
nN
on
eal
Cc
t=
-----------------------------------SECOND IMAGE-----------------------------------------------
@@ BRITISH
@@ COUNCIL
Questions 36—40
Complete the summary below.
Choose NO MORE THAN TWO WORDS from the passage for each answer.
Write your answers in boxes 36-40 on your answer sheet.
Sobotka argues that big business and users of helium need to help look after helium
stocks because 36 .................... Will not be encouraged through buying and selling
alone. Richardson believes that the 37 .................... needs to be withdrawn, as the
U.S. provides most of the world's helium. He argues that higher costs would mean
people have
38 ..............0:+2+. tO USE the resource many times over.
People should need a 39 .................... to access helium that we still have.
Furthermore, a 40 .................... should ensure that helium is used carefully.
14
````
解决方案
推荐阅读
- python - 错误:使用数据库作为 SQLITE 插入命令错误
- machine-learning - 如何使用 Keras 过度拟合数据?
- c++ - OpenMP:为什么减少比拆分任务快得多
- flutter - 如何在颤动的人脸识别中眨眼
- flutter - Flutter 自定义字体多重权重无法正常工作
- mapbox - 有没有办法在地图上继续显示替代路线,使用drawRoutes()方法绘制,点击mapbox android中的任何替代路线?
- postgresql - Docker 和 postgresql:服务器意外关闭连接(MacOS Catalina)
- javascript - DataTables js 中的自定义分页
- schema - 是否可以在 CueLang 中扩展定义
- c# - C# .NET 什么是组织 SQL 查询并拥有干净代码的最佳方式