python - ocr引擎返回时如何更正包含无关引号的不正确json?
问题描述
我们的 ocr 引擎将结果作为 json 数据返回:
{"WordText":"\"*EET","Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0}
请注意,“WordText”的值在反斜杠后包含一个双引号。当我使用 json.dumps 处理它时,会出现“预期分隔符”错误。OCR 引擎在文本中遇到双引号时会产生大量此类错误。似乎没有任何方法可以修改 OCR 的输出,所以我需要编写后处理代码来纠正这些错误。
我很乐意消除任何不在冒号之后或逗号之前的双引号,但不知道如何在 python 或正则表达式中有效地做到这一点。
任何人都有可以清理此类 json 问题的建议或工具?
解决方案
这对额外的逃跑有什么帮助......
这可能并不完美(我觉得使用两种正则表达式模式有点粗糙)但是对于给定的 JSON ......
{"WordText":"\"*EET", "Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0},
{"WordText":""4512","Left":1.0,"Top":94.0,"Height":7.0,"Width":24.0},
{"WordText":"IV"L","Left":98.0,"Top":135.0,"Height":6.0,"Width":13.0}
这段代码...
import pandas as pd
import re
pattern1 = re.compile(r'(?i)(\"\"|\"\\\")') # replace with: "
pattern2 = re.compile(r'(?i)(\w)(\")(\w)') # replace with: \1\3
data = '''
[{"WordText":"\"*EET", "Left":88.0,"Top":153.0,"Height":7.0,"Width":21.0},
{"WordText":""4512","Left":1.0,"Top":94.0,"Height":7.0,"Width":24.0},
{"WordText":"IV"L","Left":98.0,"Top":135.0,"Height":6.0,"Width":13.0}]
'''
data = pattern1.sub(r'"', data)
data = pattern2.sub(r'\1\3', data)
#load it into a pandas dataframe just to prove it is valid
df = pd.read_json(data)
print(df)
输出...
WordText Left Top Height Width
0 *EET 88 153 7 21
1 4512 1 94 7 24
2 IVL 98 135 6 13
也许看看答案开头的那个额外的转义链接,看看那里是否有问题。这也可能有用...
**
更新:
**
这是一个新代码,其中包含一个由两个正则表达式模式修复的损坏的 JSON 示例。我没有你的 JSON,但它表明正则表达式应该有助于解决迄今为止描述的损坏。我已经评论了代码以帮助解释它
代码:
import pandas as pd
import re
# compile a pattern to match "\"text" OR ""text" which needs replacing with a single doublequote
pattern1 = re.compile(r'(?i)(\"\"|\"\\\")')
# compile a second pattern to match "te"xt" which needs to be replacing with nothing/blank/just remove
pattern2 = re.compile(r'(?i)\b(\")\b')
# if this was the input (good_data) it would work without any clean up
good_data = '''
{"Sub_ID":["1","2","3","4","5","6","7","8" ],
"Name":["Erik", "Daniel", "Michael", "Sven",
"Gary", "Carol","Lisa", "Elisabeth" ],
"Salary":["723.3", "515.2", "621", "731",
"844.15","558", "642.8", "732.5" ],
"StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
"6/11/2013", "3/27/2011","5/21/2012",
"7/30/2013", "6/17/2014"],
"Department":[ "IT", "Management", "IT", "HR",
"Finance", "IT", "Management", "IT"],
"Sex":[ "M", "M", "M",
"M", "M", "F", "F", "F"]}
'''
# copied good_data and corrupted it with "\"Erik", ""Gary", and "Mana"gement"
bad_data = '''
{"Sub_ID":["1","2","3","4","5","6","7","8" ],
"Name":["\"Erik", "Daniel", "Michael", "Sven",
""Gary", "Carol","Lisa", "Elisabeth" ],
"Salary":["723.3", "515.2", "621", "731",
"844.15","558", "642.8", "732.5" ],
"StartDate":[ "1/1/2011", "7/23/2013", "12/15/2011",
"6/11/2013", "3/27/2011","5/21/2012",
"7/30/2013", "6/17/2014"],
"Department":[ "IT", "Management", "IT", "HR",
"Finance", "IT", "Mana"gement", "IT"],
"Sex":[ "M", "M", "M",
"M", "M", "F", "F", "F"]}
'''
# run the bad_data through the find and replace for the two patterns
# first one finds such mistakes as "\"text" OR ""text" and replaces with a single doublequote
bad_data = pattern1.sub(r'"', bad_data)
# second pattern finds a doublequote on its own in the middle of a word like "te"xt" and removes it
bad_data = pattern2.sub(r'', bad_data)
# read the fixed bad_data into a pandas dataframe to check it's valid
df = pd.read_json(bad_data)
# print out the df
print(df)
输出:
Sub_ID Name Salary StartDate Department Sex
0 1 Erik 723.30 1/1/2011 IT M
1 2 Daniel 515.20 7/23/2013 Management M
2 3 Michael 621.00 12/15/2011 IT M
3 4 Sven 731.00 6/11/2013 HR M
4 5 Gary 844.15 3/27/2011 Finance M
5 6 Carol 558.00 5/21/2012 IT F
6 7 Lisa 642.80 7/30/2013 Management F
7 8 Elisabeth 732.50 6/17/2014 IT F
如果您注释掉正则表达式替换...
bad_data = pattern1.sub(r'"', bad_data)
bad_data = pattern2.sub(r'', bad_data)
...并让 pandas 读取错误的 JSON...
ValueError: Unexpected character found when decoding array value (2)
...这是预期的。
推荐阅读
- reactjs - heroku 构建失败(npm 无法找到文件)
- python - 检查脚本是否已经运行(python / linux)
- android - Android ShareScreen - Gmail“无法附加照片”
- c# - UWP - 发送本地磁贴通知
- html - 德国证券交易所页面中数据的 XML 查询
- vue.js - 如何在 Vue 中更新点击自定义指令
- android - 使用 mockito-kotlin 模拟 Kotlin 挂起函数时出现 ExceptionInInitializerError
- r - 取消列出连续包含多个值的列表
- c++ - 查找所有连接对之和的高效算法
- python - 我可以推送一个 docker compose 应用程序以供其他人拉取并在他们的系统上运行吗?