首页 > 解决方案 > Node.js - Parse raw text to JSON using RegEx

问题描述

I´m still new to Node.js and currently developing a small app for my kitchen. This app can scan receipts and uses OCR to extract the data. For OCR extracting I´m using the ocr-space web api. Afterwards I need to parse the raw text to a JSON structure and send it to my database. I´ve also tested this receipt using AWS textract, which gave me a even poorer result. Currently I´m struggling at the parsing part using RegEx in Node.js.

Here is my JSON structure which I use to parse the receipt data:

receipt = {
      title: 'title of receipt'
      items: [
               'item1',
               'item2',
               'item3'
          ],
      preparation: 'preparation text' 
}

As most of the receipts have a items part and afterwards a preparation part my general approach so far looks like the following:

This approach doesn´t work if these keywords are missing. Take for example the following receipt, where I´m struggle to parse it into my JSON structure. The receipt is in German and there are no corresponding keywords ('items' or 'Zutaten', 'preparation' or 'Zubereitung').

Following information from the raw text are necessary:

Do you have any hints or tips how to come closer to the solution? Or do you have any other ideas how to manage such situations accordingly?

Quinoa-Brot
30 g Chiasamen
350 g Quinoa
70 ml Olivenöl
1/2 TL Speisenatron
1 Prise Salz
Saft von 1/2 Zitrone
1 Handvoll Sonnenblumenkerne
30 g Schwarzkümmelsamen
1 Chiasamen mit 100 ml Wasser
verrühren und 30 Minuten quel-
len lassen. Den Ofen auf 200 oc
vorheizen, eine kleine Kastenform
mit Backpapier auslegen.
2 Quinoa mit der dreifachen
Menge Wasser in einen Topf ge-
ben, einmal aufkochen und dann
3 Minuten köcheln lassen - die
Quinoa wird so nur teilweise ge-
gegart. In ein Sieb abgießen, kalt
abschrecken und anschließend
gut abtropfen lassen. 

Between each line there is a \n tabulator.

The parsed receipt should look like this:

receipt = {
    title: 'Quinoa-Brot',
    items: [
        '30 g Chiasamen',
        '350 g Quinoa',
        '70 ml Olivenöl',
        '1/2 TL Speisenatron',
        '1 Prise Salz',
        'Saft von 1/2 Zitrone'
        '1 Handvoll Sonnenblumenkerne'
        '30 g Schwarzkümmelsamen',
    ],
    preparation: '1 Chiasamen mit 100 ml Wasser verrühren und 30 Minuten quellen lassen. Den Ofen auf 200 oc vorheizen, eine kleine Kastenform mit Backpapier auslegen. 2 Quinoa mit der dreifachen Menge Wasser in einen Topf geben, einmal aufkochen und dann 3 Minuten köcheln lassen - die Quinoa wird so nur teilweise gegegart. In ein Sieb abgießen, kalt abschrecken und anschließend gut abtropfen lassen.'
}

标签: node.jsjsonregexparsingocr

解决方案


Pattern matching solutions like RegExp don't sound suitable for this sort of a categorization problem. You might want to consider clustering (k-means, etc.) - training a model to differentiate between ingredients and instructions. This can be done by labeling a number of recipes (the more the better), or using unsupervised ML by clustering line by line.

If you need to stick to RegExp for some reason, you keeping track of repeated words. Weak methodology: ingredient names (Chiasemen, Quinoa, ) will be referenced in the instructions, so you can match on multiline to find where the same word is repeated later on:

(?<=\b| )([^ ]+)(?= |$).+(\1)

If you do run this on a loop, plus logic, you can find pairs ingredient-instruction pairs, and work through the document with silhouette information.

You might be able to take advantage of ingredient lines containing numeric data like numbers or words like "piece(s), sticks, leaves" which you might store in a dictionary. That can enrich the word boundary input matches.

I would reconsider using RegExp here at all...


推荐阅读