首页 > 解决方案 > ocr引擎返回时如何更正包含无关引号的不正确json?

问题描述

我们的 ocr 引擎将结果作为 json 数据返回: {"WordText":"\"*EET","Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0}

请注意,“WordText”的值在反斜杠后包含一个双引号。当我使用 json.dumps 处理它时,会出现“预期分隔符”错误。OCR 引擎在文本中遇到双引号时会产生大量此类错误。似乎没有任何方法可以修改 OCR 的输出,所以我需要编写后处理代码来纠正这些错误。

我很乐意消除任何不在冒号之后或逗号之前的双引号,但不知道如何在 python 或正则表达式中有效地做到这一点。

任何人都有可以清理此类 json 问题的建议或工具?

标签: pythonjsonregexocr

解决方案


这对额外的逃跑有什么帮助......

转储到 JSON 添加额外的双引号和引号转义

这可能并不完美(我觉得使用两种正则表达式模式有点粗糙)但是对于给定的 JSON ......

{"WordText":"\"*EET", "Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0},
{"WordText":""4512","Left":1.0,"Top":94.0,"Height":7.0,"Width":24.0},
{"WordText":"IV"L","Left":98.0,"Top":135.0,"Height":6.0,"Width":13.0}

这段代码...

import pandas as pd
import re

pattern1 = re.compile(r'(?i)(\"\"|\"\\\")') # replace with: "
pattern2 = re.compile(r'(?i)(\w)(\")(\w)') # replace with: \1\3

data = '''
[{"WordText":"\"*EET", "Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0},
{"WordText":""4512","Left":1.0,"Top":94.0,"Height":7.0,"Width":24.0},
{"WordText":"IV"L","Left":98.0,"Top":135.0,"Height":6.0,"Width":13.0}]
'''

data = pattern1.sub(r'"', data)
data = pattern2.sub(r'\1\3', data)

#load it into a pandas dataframe just to prove it is valid
df = pd.read_json(data)

print(df)

输出...

  WordText  Left  Top  Height  Width
0     *EET    88  153       7     21
1     4512     1   94       7     24
2      IVL    98  135       6     13

也许看看答案开头的那个额外的转义链接,看看那里是否有问题。这也可能有用...

如何从 JSON 文件中删除反斜杠

**

更新:

**

这是一个新代码,其中包含一个由两个正则表达式模式修复的损坏的 JSON 示例。我没有你的 JSON,但它表明正则表达式应该有助于解决迄今为止描述的损坏。我已经评论了代码以帮助解释它

代码:

import pandas as pd
import re

# compile a pattern to match "\"text" OR ""text" which needs replacing with a single doublequote
pattern1 = re.compile(r'(?i)(\"\"|\"\\\")')

# compile a second pattern to match "te"xt" which needs to be replacing with nothing/blank/just remove
pattern2 = re.compile(r'(?i)\b(\")\b')

# if this was the input (good_data) it would work without any clean up
good_data = '''
{"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["Erik", "Daniel", "Michael", "Sven",
                "Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Management", "IT", "HR", 
                      "Finance", "IT", "Management", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}
'''

# copied good_data and corrupted it with "\"Erik", ""Gary", and "Mana"gement"
bad_data = '''
{"Sub_ID":["1","2","3","4","5","6","7","8" ],
        "Name":["\"Erik", "Daniel", "Michael", "Sven",
                ""Gary", "Carol","Lisa", "Elisabeth" ],
        "Salary":["723.3", "515.2", "621", "731", 
                  "844.15","558", "642.8", "732.5" ],
        "StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
                     "6/11/2013", "3/27/2011","5/21/2012", 
                     "7/30/2013", "6/17/2014"],
        "Department":[ "IT", "Management", "IT", "HR", 
                      "Finance", "IT", "Mana"gement", "IT"],
        "Sex":[ "M", "M", "M", 
              "M", "M", "F", "F", "F"]}
'''

# run the bad_data through the find and replace for the two patterns
# first one finds such mistakes as "\"text" OR ""text" and replaces with a single doublequote
bad_data = pattern1.sub(r'"', bad_data)

# second pattern finds a doublequote on its own in the middle of a word like "te"xt" and removes it
bad_data = pattern2.sub(r'', bad_data)

# read the fixed bad_data into a pandas dataframe to check it's valid
df = pd.read_json(bad_data)

# print out the df
print(df)

输出:

   Sub_ID       Name  Salary   StartDate  Department Sex
0       1       Erik  723.30    1/1/2011          IT   M
1       2     Daniel  515.20   7/23/2013  Management   M
2       3    Michael  621.00  12/15/2011          IT   M
3       4       Sven  731.00   6/11/2013          HR   M
4       5       Gary  844.15   3/27/2011     Finance   M
5       6      Carol  558.00   5/21/2012          IT   F
6       7       Lisa  642.80   7/30/2013  Management   F
7       8  Elisabeth  732.50   6/17/2014          IT   F

如果您注释掉正则表达式替换...

bad_data = pattern1.sub(r'"', bad_data)
bad_data = pattern2.sub(r'', bad_data)

...并让 pandas 读取错误的 JSON...

ValueError: Unexpected character found when decoding array value (2)

...这是预期的。


推荐阅读