首页 > 解决方案 > 在非结构化数据中搜索文本

问题描述

我已经编写了一部分代码来使用 python 从图像中读取文本。图片是发票。

import pytesseract as tess
tess.pytesseract.tesseract_cmd = r'C:\Users\Me\AppData\Local\Tesseract-OCR\tesseract.exe'
from PIL import Image

img = Image.open('C:/Users/Me/Desktop/PM/Invoice Formats/TestInv.png')


text = tess.image_to_string(img)
print(text)

代码的结果是发票文本。我有多张不同格式的发票。谁能帮我从这些非结构化文本中提取发票编号、发票日期和发票金额?

对于少数发票,得到的文本有点像这样。对于其他人来说是不同的

ABC Manufacturing Corporation





Invoice 1111 HHH BBB
‘MyCity, AB'11111-111'
(111)111-1111
My exporter details
\xyz.com
Page: 1 of 2
invoice No, b123456
Date: 01/02/2019,
‘My Oil Products My Bill-To No. 3333
PO Box 1234, Account Number.: 12345
sdlfjsdlf slsdo

Invoice Summary

Delivery Terms:
Payment Terms:
Contact:

DELIVERY POINT
Net 20 days date of invoice
MY NAME

111-111-1111

111-111-1111
abc@xyz.com
Copies of Invoices and Delivery Notes are available on
my url/ check site/ here.

Hf you have any, further questions relating to, your Invoice,
lease contact MY NAME immediately on
111111111







Quantity - Price uni





1000 KG KM = 1000M — KG = Kilogram
Hours Litre M3 = Cubic meter
EA = Each) Normal Cubic Meter
Pounds 7OF, 1atm)











Product Price |
Product Price 1000.28
Net value 1000.28
Total to be paid INR 80000.28

提前谢谢。

标签: pythonocr

解决方案


让我向您展示一个提取日期的示例,然后您可以推断以提取其他日期:

date = text.split('Date: ')[1].split(',')[0]
print(date)

'01/02/2019'


推荐阅读