首页 > 解决方案 > Can't seem to use more than one -c argument for tesseract

问题描述

I'm just using tesseract through bash scripting. I've finally come up with all the settings that recognize my text for my images nearly perfectly; however, I can't seem to use all of the options together. My command is as follows:

$ tesseract infile.tif outputbase --psm 6 -c tosp_min_sane_kn_sp=0.0;tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-+&/\

I need the whitelist, because tesseract is picking up some lowercase characters, strange characters (such as yen sign), and other oddities. My images do not contain those characters, and since my document is quite simple I figured it would just be easier to whitelist the ones that do exist. Additionally, the image is in a "table" format (without any lines or borders), and tesseract only picks up the large spaces (which separate columns) and not individual spaces in between words within a column. Setting the tosp value to 0 seemed to fix that problem.

Now the issue is that tesseract won't process with both of those -c arguments at the same time, but the man pages explicitly states that you can use multiple -c arguments!

I've also tried to work around in the following way:

my_config_file
tosp_min_sane_kn_sp 0.0
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-+&/\

$ tesseract infile.tif outputbase --psm 6 my_config_file

The config file is saved in the correct directory, but again only one of the options will work at a time. If both options are in the config file, it seems like it ignores the tosp_min_sane_kn_sp 0.0. If I remove one, then the other works.

I'm pulling out my hair here, and I'm about to just work around this issue by running the OCR twice and then just merging the two files with an awk script. I really don't want to do that, however, because its obviously less efficient and I don't really like the idea of trying to use awk when the OCR isn't guaranteed to be formatted 100% in the way that I'm going to have to assume in my potential awk script.

Please help!

EDIT:

I forgot to mention that I have indeed tried to pass multiple -c options. Instead of guessing various field separators in between variables semicolon made the most sense to me because I understand that tesseract is written in C++ which uses semicolons to signify the end of a line. I know C++ isn't interpreted, but it just seemed to make sense. Now I'm digressing . . .

Additionally, I've tried the advice of putting the whitelist in quotation marks, but that has made no difference. I was really excited because that didn't even occur to me, but it doesn't seem that tesseract even recognizes quotations even if I run that one -c argument by itself.

标签: bashocrtesseract

解决方案


您不能将多个参数传递给单个-c选项,尤其是不能用分号分隔的。我没有tesseract,但我很确定您需要为-c要设置的每个配置变量传递一个单独的选项:

tesseract infile.tif outputbase --psm 6 -c tosp_min_sane_kn_sp=0.0 -c 'tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-+&/\'

(我还将第二个变量设置用单引号括起来,因此 shell 不会尝试解释反斜杠。没有引号,它会转义换行符,因此下一行将被视为这一行的延续。 )

原始问题的解释:当 shell 看到分号(并且它不在引号或转义中)时,shell 将其视为命令分隔符。因此它将该行视为两个完全独立的命令(由于反斜杠,下一行合并):

tesseract infile.tif outputbase --psm 6 -c tosp_min_sane_kn_sp=0.0
tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-+&/ <whatever's on the next line of the file>

第一个tesseract使用一个-c选项运行,第二个创建一个名为tessedit_char_whitelist. 即使您引用或转义了它,所以分号被传递给tesseract,我怀疑它不会按照您想要的方式将其视为分隔符。


推荐阅读