Hello Everyone;
I've been using open source and free Tesseract libs to perform OCR on JPG documents. Typically there is a free standing scanner somewhere in the facility where people feed tons of documents into. The scanner swallows these documents like the cookie monster. All these documents make it to a directory where a FWH program reads from. Documents are stored by the scanner as .jpg documents. Using Tesseract, the FWH app reads the .jpg, does OCR and detects an "Encounter" number printed somewhere on the document which is a unique key to a record on a .dbf table. This "Encounter" number is detected using a regular expression. Then, this .jpg as well as the "texted" document are saved to the corresponding record on the table on a corresponding Memo field.
All this works pretty well, except in 10% of the documents where Tesseract doesn't OCR with the needed precision and thus the "encounter" number can't be completely extracted. I've tried using the Tesseract learning tools with some success, but still; there is percentage of documents where the OCR isn't good enough for no apparent reason.
Is there anyone here using any other OCR library with better results?
I don't need a scanning library. I got that already and it is not needed for this use. The scanner "knows" how to scan into .jpg and save to a directory. What I need is to be able to read a .jpg and convert to text with a higher degree of accuracy.
Oh, and one last thing; the font on these documents isn't always the same and I have no way of controlling that.
Thank you,
Reinaldo.