FiveTech Software tech support forums

by **reinaldocrespo** » Mon Aug 04, 2014 1:29 pm

Hello everyone;

I wonder if anyone here has any experience implementing an OCR engine for their scanned documents. I've been using the free DOSADI EzTw32.dll lib to scan documents. I now need to add OCR.

There is a C++ open source OCR library maintained by google called Tesseract it seems popular among Visual Studio developers in the Windows world. http://code.google.com/p/tesseract-ocr/ ... eract_3.0x

Can someone share their experience or any ideas how to use it from (x)Harbour?

Best regards,

Reinaldo.

by **Gale FORd** » Mon Aug 04, 2014 2:06 pm

EZTwain has built in supports for Transym OCR Software's TOCR. I have used it and is pretty straight forward, especially if you are using EZTwain already.
It is not free so that might be a problem.

by **reinaldocrespo** » Tue Aug 05, 2014 1:02 pm

Thank you for the feedback, Gale. It looks like an alternative and the price isn't bad if it works. Do you have any code that you can share?

Reinaldo.

by **Gale FORd** » Tue Aug 05, 2014 3:56 pm

Actually I have a complete test scan app on the fivewin contributions section.
Here is the link: https://code.google.com/p/fivewin-contributions/downloads/detail?name=testscancomplete.zip&can=2&q=gale

One of the things it can do is allow you to Band an area on screen and have it OCR that area to text.
You will have to have valid license for it to work. Either the demo version or purchased.
For a more simple sample

Code: Select all Expand view: #define EZOCR_ENGINE_NONE 0 // ‘null’ OCR engine - turns off OCR. #define EZOCR_ENGINE_TRANSYM 1 // TOCR engine by Transym Ltd. static cDirectory static aFiles static lRunBarcode static lRunOCR // Change the values below to designate area to convert/OCR/Barcode static nAreaStartX := 1434 static nAreaStartY := 369 static nAreaWidth := 578 static nAreaHeight := 72 static nAreaFill := -1 function testscan local lIsAvailable, nResult local nLen lRunOcr := .t. lRunBarcode := .f. lIsAvailable := TW_Avail() if lIsAvailable // Replace license information with your own information!!! TW_UniversalLicense( 'My Company', nLicenseNo ) endif nResult := TW_LoadSourceManager() if nResult = -1 ? 'Error loading source manager' wait quit endif nLen := adir( cDirectory+"*.tif" ) if nLen = 0 ? "No tif files to check" quit endif aFiles := array( nLen ) adir( cDirectory+"*.tif", aFiles ) gdf_barcode() return nil function gdf_barcode() local nReturn, hDib, nCounter, cTestFile local hDibOcr local cBarValue local nOCREngine local cFilename local nCountFiles if .not. lRunOcr .and. .not. lRunBarcode ? 'No files processed' return nil endif for nCountFiles := 1 to len( aFiles ) cFileName := cDirectory+aFiles[ nCountFiles ] if empty( cFileName ) .or. !file( cFileName ) return .f. endif hDib := TW_LOADFROMFILENAME( cFileName ) tracelog( cFileName, TW_DibWidth(hDib), TW_DibHeight(hDib) ) cBarValue := '' ? '------------------------- '+aFiles[ nCountFiles ] if lRunOCR .and. hDib > 0 if .not. TW_OCR_SELECTENGINE(EZOCR_ENGINE_TRANSYM) > 0 ? "Dosadi OCR component is not installed" else //? 'OCR Engine available' nOCREngine := TW_OCR_SELECTEDENGINE() //? TW_OCR_ENGINENAME( nOCREngine ) nReturn := 0 hDibOcr := TW_DIBREGIONCOPY( hDib, ; nAreaStartX, ; nAreaStartY, ; nAreaWidth , ; nAreaHeight, ; nAreaFill ) nReturn := TW_OCR_RECOGNIZEDIB( hDibOcr ) //nReturn := TW_OCR_RECOGNIZEDIBZONE( hDib, 100, 100, 578, 72 ) //nReturn := TW_OCR_RECOGNIZEDIBZONE( hDib, 1434, 369, 578, 72 ) cText := '' do case case nReturn > 0 cText := TW_OCR_Text() case nReturn = -1 ? 'OCR Services or selected engine not available' case nReturn = -3 ? 'Image handle is nul or void' case nReturn = -5 ? 'Internal error or OCR engine returned error' endcase ? 'OCR Text: ' if .not. empty( cText ) //memowrit( 'test.txt', cText ) /* ? at( chr(12), cText ) ? asc( substr( cText, 33, 1) ) ? asc( substr( cText, 34, 1)) ? asc( substr( cText, 35, 1)) */ //? Token( cText, chr(10), 3 ) //? memoline( cText, 3 ) ?? ctext endif TW_Free( hDibOcr ) endif endif if lRunBarcode .and. hDib > 0 if .not. TW_Barcode_Avail() ? "Dosadi OCR component is not installed" else nNoBarCodes := Tw_BARCODE_RECOGNIZE(hDib,-1,-1) if nNoBarCodes > 0 for nCounter := 1 to nNoBarCodes ? 'Barcode'+str(nCounter,1)+": "+Tw_BARCODE_TEXT( nCounter-1 ) next //::cBarValue := ::oScanner:BC_Text(0) //::cBarValue := alltrim(::cBarValue) else ? ' Barcode: None' endif endif endif if hDib > 0 TW_Free( hDib ) endif next hDib := nil return nil #pragma BEGINDUMP #include <windows.h> #include "eztwain.h" #include "hbapi.h" /*--------- Top-Level Calls -------------------------------------*/ HB_FUNC( TW_UNIVERSALLICENSE ) { TWAIN_UniversalLicense( hb_parc(1), hb_parni(2) ); hb_ret(); } HB_FUNC( TW_ACQUIRE ) // hWnd { hb_retnl( ( LONG )TWAIN_Acquire( ( HWND ) hb_parnl( 1 ) ) ); } HB_FUNC( TW_FREE ) // hDib { DIB_Free( ( HANDLE ) hb_parnl( 1 ) ); hb_ret(); } HB_FUNC( TW_SELECTIMAGESOURCE ) // hWnd { hb_retni( TWAIN_SelectImageSource( ( HWND ) hb_parnl( 1 ) ) ); } HB_FUNC( TW_ACQUIRETOCLIPBOARD ) // hWnd, nPixTypes { hb_retl( TWAIN_AcquireToClipboard( ( HWND ) hb_parnl( 1 ), (unsigned) hb_parni(2) ) ); } HB_FUNC( TW_ACQUIREMEMORY ) // hWnd { hb_retnl( ( LONG )TWAIN_AcquireMemory( ( HWND ) hb_parnl( 1 ) ) ); } HB_FUNC( TW_ACQUIRETOFILENAME ) // hWnd, cFileName { hb_retni( TWAIN_AcquireToFilename( ( HWND ) hb_parnl( 1 ), hb_parc( 2 ) ) ); } HB_FUNC( TW_ACQUIREMULTIPAGEFILE ) // hWnd, cFileName { hb_retni( TWAIN_AcquireMultipageFile( ( HWND ) hb_parnl( 1 ), hb_parc( 2 ) ) ); } HB_FUNC( TW_ACQUIREFILE ) // hWnd, nFF, cFileName { hb_retni( TWAIN_AcquireFile( ( HWND ) hb_parnl( 1 ), hb_parni( 2 ) ,hb_parc( 3 ) ) ); } //--------- Basic TWAIN Inquiries HB_FUNC( TW_AVAIL ) { hb_retl( TWAIN_IsAvailable() ); } HB_FUNC( TW_EASYVERSION) { hb_retni( TWAIN_EasyVersion() ); } HB_FUNC( TW_STATE ) { hb_retni( TWAIN_State() ); } HB_FUNC( TW_SOURCENAME ) { hb_retc( TWAIN_SourceName() ); } HB_FUNC( TW_GETSOURCENAME ) // pzName { TWAIN_GetSourceName( (LPSTR) hb_parc( 1 ) ); hb_ret(); } //--------- DIB handling utilities --------- HB_FUNC( TW_DIBWJPG ) // hDib, cName { hb_parni( DIB_WriteToJpeg( ( HANDLE ) hb_parnl(1), hb_parc( 2 ) ) ); } HB_FUNC( TW_DIBWBMP ) // hDib, cName { hb_parni( DIB_WriteToBmp( ( HANDLE ) hb_parnl(1), hb_parc( 2 ) ) ); } HB_FUNC( TW_DIBTOFILENAME ) // hDib, cName { hb_parni( TWAIN_WriteToFilename( ( HANDLE ) hb_parnl(1), hb_parc( 2 ) ) ); } //--------- File Read/Write HB_FUNC( TW_ISJPG ) { hb_retl( TWAIN_IsJpegAvailable() ); } HB_FUNC( TW_SETSAVEFORMAT ) { hb_retni( TWAIN_SetSaveFormat( hb_parni( 1 ) ) ); } HB_FUNC( TW_GETSAVEFORMAT ) { hb_retni( TWAIN_GetSaveFormat() ); } HB_FUNC( TW_SETJPEGQUALITY ) // nQuality 1...100 { TWAIN_SetJpegQuality( hb_parni( 1 ) ); hb_ret(); } HB_FUNC( TW_GETJPEGQUALITY ) { hb_retni( TWAIN_GetJpegQuality() ); } HB_FUNC( TW_WRITENATIVETOFILENAME ) { hb_retni( TWAIN_WriteNativeToFilename( (HANDLE) hb_parnl(1), hb_parc(2) )); } HB_FUNC( TW_LOADNATIVEFROMFILENAME ) { hb_retnl( (LONG) TWAIN_LoadNativeFromFilename( hb_parc( 1 ) ) ); } HB_FUNC( TW_LOADFROMFILENAME ) { hb_retnl( (LONG) DIB_LoadFromFilename( hb_parc( 1 ) ) ); } HB_FUNC( TW_DIBSCALEDCOPY ) // hDib, cName { hb_retnl( (LONG) DIB_ScaledCopy( ( HANDLE ) hb_parnl( 1 ), hb_parni( 2 ), hb_parni( 3 ) ) ); } HB_FUNC( TW_DIBCOPY ) // hDib, cName { hb_retnl( (LONG) DIB_Copy( ( HANDLE ) hb_parnl( 1 )) ); } HB_FUNC( TW_DIBWIDTH ) { hb_parni( DIB_Width( ( HANDLE ) hb_parnl(1) ) ); } HB_FUNC( TW_DIBHEIGHT ) { hb_retni( DIB_Height( (HANDLE) hb_parnl(1) )); } //--------- Global Options ---------------------------------------------- HB_FUNC( TW_SETMULTITRANSFER ) { TWAIN_SetMultiTransfer( hb_parni( 1 ) ); hb_ret(); } HB_FUNC( TW_GETMULTITRANSFER ) { hb_retni( TWAIN_GetMultiTransfer() ); } HB_FUNC( TW_SETHIDEUI ) // nHide { TWAIN_SetHideUI( hb_parni( 1) ); hb_ret(); } HB_FUNC( TW_GETHIDEUI ) { hb_retni( TWAIN_GetHideUI() ); } HB_FUNC( TW_DISABLEPARENT ) { TWAIN_DisableParent( hb_parni( 1 ) ); hb_ret(); } HB_FUNC( TW_GETDISABLEPARENT ) { hb_retni( TWAIN_GetDisableParent() ); } HB_FUNC( TW_REGISTERAPP ) { TWAIN_RegisterApp( hb_parni(1),hb_parni(2),hb_parni(3),hb_parni(4), hb_parc(5), hb_parc(6), hb_parc(7), hb_parc(8) ); hb_ret(); } HB_FUNC( TW_SETAPPTITLE ) { TWAIN_SetAppTitle( hb_parc( 1 ) ); hb_ret(); } HB_FUNC( TW_SETAPPLICATIONKEY ) { TWAIN_SetApplicationKey( hb_parni( 1 ) ); hb_ret(); } //--------- TWAIN State Control --------------------------------------- HB_FUNC( TW_LOADSOURCEMANAGER ) { hb_retni( TWAIN_LoadSourceManager() ); } HB_FUNC( TW_OPENSOURCEMANAGER ) // hWnd { hb_retni( TWAIN_OpenSourceManager( ( HWND ) hb_parnl( 1 ) ) ); } HB_FUNC( TW_OPENDEFAULTSOURCE ) { hb_retni( TWAIN_OpenDefaultSource() ); } HB_FUNC( TW_GETSOURCELIST ) { hb_retni( TWAIN_GetSourceList() ); } HB_FUNC( TW_GETNEXTSOURCENAME ) { hb_retni( TWAIN_GetNextSourceName( ( char * ) hb_parc( 1 ) ) ); } HB_FUNC( TW_GETDEFAULTSOURCENAME ) { hb_retni( TWAIN_GetDefaultSourceName( ( char * ) hb_parc( 1 ) )); } HB_FUNC( TW_OPENSOURCE ) { hb_retni( TWAIN_OpenSource( hb_parc( 1 ) ) ); } HB_FUNC( TW_ENABLESOURCE ) // hWnd { hb_retni( TWAIN_EnableSource( ( HWND ) hb_parnl( 1 ) ) ); } HB_FUNC( TW_DISABLESOURCE ) { hb_retni( TWAIN_DisableSource( ) ); } HB_FUNC( TW_CLOSESOURCE ) { hb_retni( TWAIN_CloseSource() ); } HB_FUNC( TW_CLOSESOURCEMANAGER ) { hb_retni( TWAIN_CloseSourceManager( (HWND) hb_parnl( 1 ) ) ); } HB_FUNC( TW_UNLOADSOURCEMANEGER ) { hb_retni( TWAIN_UnloadSourceManager() ); } //--------- High-level Capability Negotiation Functions -------------- // These functions should only be called in State 4 (TWAIN_SOURCE_OPEN) HB_FUNC( TW_GETCURRENTUNITS ) { hb_retni( TWAIN_GetCurrentUnits() ); } HB_FUNC( TW_SETCURRENTUNITS ) // nUnits { hb_retni( TWAIN_SetCurrentUnits( hb_parni( 1 ) ) ); } HB_FUNC( TW_GETBITDEPTH ) { hb_retni( TWAIN_GetBitDepth() ); } HB_FUNC( TW_SETBITDEPTH ) { hb_retni( TWAIN_SetBitDepth( hb_parni( 1 ) ) ); } HB_FUNC( TW_GETPIXELTYPE ) { hb_retni( TWAIN_GetPixelType() ); } HB_FUNC( TW_SETCURRENTPIXELTYPE ) // nBits { hb_retni( TWAIN_SetCurrentPixelType( hb_parni( 1 ) ) ); } HB_FUNC( TW_GETCURRENTRESOLUTION ) { hb_retnd( TWAIN_GetCurrentResolution()); } HB_FUNC( TW_GETYRESOLUTION ) { hb_retnd( TWAIN_GetYResolution()); } HB_FUNC( TW_SETCURRENTRESOLUTION ) // dRes { hb_retni( TWAIN_SetCurrentResolution( hb_parnd( 1 ) ) ); } HB_FUNC( TW_SETXRESOLUTION ) { hb_retni( TWAIN_SetXResolution( hb_parnd( 1 ) ) ); } HB_FUNC( TW_SETYRESOLUTION ) { hb_retni( TWAIN_SetYResolution( hb_parnd( 1 ) ) ); } HB_FUNC( TW_SETCONTRATS ) //dCon { hb_retni( TWAIN_SetContrast( hb_parnd( 1 ) ) ); // -1000....+1000 } HB_FUNC( TW_SETBRIGHTNESS ) //dBri { hb_retni( TWAIN_SetBrightness( hb_parnd( 1 ) ) ); // -1000....+1000 } HB_FUNC( TW_SETTHRESHOLD ) { hb_retni( TWAIN_SetThreshold( hb_parnd( 1 ) ) ); } HB_FUNC( TW_GETCURRENTTHRESHOLD ) { hb_retnd( TWAIN_GetCurrentThreshold() ); } HB_FUNC( TW_SETXFERMECH ) { hb_retni( TWAIN_SetXferMech( hb_parni( 1 ) ) ); } HB_FUNC( TW_XFERMECH ) { hb_retni( TWAIN_XferMech() ); } HB_FUNC( TW_SUPPORTSFILEXFER ) { hb_retni( TWAIN_SupportsFileXfer() ); } HB_FUNC( TW_SETPAPERSIZE ) // nTypePaper { hb_retni( TWAIN_SetPaperSize( hb_parni( 1 ) ) ); } //-------- Document Feeder --------------------------------- HB_FUNC( TW_HASFEEDER ) { hb_retl( TWAIN_HasFeeder() ); } HB_FUNC( TW_ISFEEDERSELECTED ) { hb_retl( TWAIN_IsFeederSelected() ); } HB_FUNC( TW_SELECTFEEDER ) { hb_retl( TWAIN_SelectFeeder( hb_parl( 1 ) ) ); } HB_FUNC( TW_ISAUTOFEEDON ) { hb_retl( TWAIN_IsAutoFeedOn() ); } HB_FUNC( TW_SETAUTOFEDD ) { hb_retl( TWAIN_SetAutoFeed( hb_parl( 1 ) ) ); } HB_FUNC( TW_ISFEEDERLOADED ) { hb_retl( TWAIN_IsFeederLoaded() ); } //-------- Duplex Scanning ------------------------------------------ HB_FUNC( TW_GETDUPLEXSUPPORT ) { hb_retni( TWAIN_GetDuplexSupport() ); } HB_FUNC( TW_ENABLEDUPLEX ) { hb_retni( TWAIN_EnableDuplex( hb_parni( 1 ) ) ); } HB_FUNC( TW_ISDUPLEXENABLED ) { hb_retl( TWAIN_IsDuplexEnabled() ); } //--------- Other 'exotic' capabilities -------- HB_FUNC( TW_HASCONTROLLABLEUI ) { hb_retni( TWAIN_HasControllableUI() ); } HB_FUNC( TW_SETINDICATORS ) { hb_retni( TWAIN_SetIndicators( hb_parl( 1 ) ) ); } HB_FUNC( TW_COMPRESSION ) { hb_retni( TWAIN_Compression() ); } HB_FUNC( TW_SETCOMPRESSION ) { hb_retni( TWAIN_SetCompression( hb_parni( 1 ) ) ); } HB_FUNC( TW_TILED ) { hb_retl( TWAIN_Tiled() ); } HB_FUNC( TW_SETTILED ) { hb_retni( TWAIN_SetTiled( hb_parl( 1 ) ) ); } HB_FUNC( TW_PLANARCHUNKY ) { hb_retni( TWAIN_PlanarChunky() ); } HB_FUNC( TW_SETPLANARCHUNKY ) { hb_retni( TWAIN_SetPlanarChunky( hb_parni( 1 ) ) ); } HB_FUNC( TW_PIXELFLAVOR ) { hb_retni( TWAIN_PixelFlavor() ); } HB_FUNC( TW_SETPIXELFLAVOR ) { hb_retni( TWAIN_SetPixelFlavor( hb_parni( 1 ) ) ); } HB_FUNC( TW_SETLIGHTPATH ) { hb_retni( TWAIN_SetLightPath( hb_parl( 1 ) ) ); } HB_FUNC( TW_SETAUTOBRIGHT ) { hb_retni( TWAIN_SetAutoBright( hb_parl( 1 ) ) ); } HB_FUNC( TW_SETGAMMA ) { hb_retni( TWAIN_SetGamma( hb_parnd( 1 ) ) ); } HB_FUNC( TW_SETSHADOW ) { hb_retni( TWAIN_SetShadow( hb_parnd( 1 ) ) ); } HB_FUNC( TW_SETHIGHLIGHT ) { hb_retni( TWAIN_SetHighlight( hb_parnd( 1 ) ) ); } //--------- Barcode -------- HB_FUNC( TW_BARCODE_AVAIL ) { hb_retl( BARCODE_IsAvailable() ); } HB_FUNC( TW_BARCODE_SELENGINE ) { hb_retni( BARCODE_SelectedEngine() ); } HB_FUNC( TW_BARCODE_TEXT ) { hb_retc( BARCODE_Text( hb_parni( 1 ) )); } HB_FUNC( TW_BARCODE_TYPE ) { hb_retni( BARCODE_Type( hb_parni( 1 ) )); } HB_FUNC( TW_BARCODE_GETTEXT ) { BARCODE_GetText( hb_parni( 1 ), (LPSTR) hb_parc(2) ); hb_ret(); } HB_FUNC( TW_BARCODE_RECOGNIZE ) { hb_retni( BARCODE_Recognize( (HANDLE) hb_parnl(1),hb_parni( 2 ),hb_parni( 3 ) ) ); } //--------- OCR -------- HB_FUNC( TW_OCR_SELECTENGINE ) // nEngine { hb_retni( OCR_SelectEngine( hb_parni( 1 ) ) ); } HB_FUNC( TW_OCR_SELECTEDENGINE ) { hb_retni( OCR_SelectedEngine() ); } HB_FUNC( TW_OCR_ENGINENAME ) { hb_retc( OCR_EngineName( hb_parni( 1 ) ) ); } HB_FUNC( TW_OCR_TEXT ) { hb_retc( OCR_Text() ); } HB_FUNC( TW_OCR_RECOGNIZEDIB ) // hDib { hb_retni( OCR_RecognizeDib( (HANDLE) hb_parnl(1) ) ); } /* HB_FUNC( TW_OCR_RECOGNIZEDIBZONE ) // hDib, Left, Top, Width, Height { hb_retni( OCR_RecognizeDibZone( (HANDLE) hb_parnl(1), hb_parni( 2 ), hb_parni( 3 ), hb_parni( 4 ), hb_parni( 5 ) ) ); } */ HB_FUNC( TW_DIBREGIONCOPY ) // hDib, Left, Top, Width, Height { hb_retnl( (LONG) DIB_RegionCopy( (HANDLE) hb_parnl(1), hb_parni( 2 ), hb_parni( 3 ), hb_parni( 4 ), hb_parni( 5 ), hb_parni( 6 ) ) ); } //--------- Image Layout (Region of Interest) -------- HB_FUNC( TW_SETIMAGELAYOUT ) // left, top, right, bottom { hb_retni( TWAIN_SetImageLayout( hb_parnd( 1 ),hb_parnd( 2 ),hb_parnd( 3 ),hb_parnd( 4 ) ) ); } HB_FUNC( TW_RESETIMAGELAYOUT ) { TWAIN_ResetImageLayout(); hb_ret(); } HB_FUNC( TW_GETIMAGELAYOUT ) { double L,T,R,B; int nRet; nRet = TWAIN_GetImageLayout( &L,&T,&R,&B ); hb_stornd( L, 1 ); hb_stornd( T, 2 ); hb_stornd( R, 3 ); hb_stornd( B, 4 ); hb_retni( nRet ); } HB_FUNC( TW_GETDEFAULTIMAGELAYOUT ) { double L,T,R,B; int nRet; nRet = TWAIN_GetDefaultImageLayout( &L, &T, &R, &B); hb_stornd( L, 1 ); hb_stornd( T, 2 ); hb_stornd( R, 3 ); hb_stornd( B, 4 ); hb_retni( nRet ); } //HANDLE EZTAPI DIB_GetFromClipboard(void); HB_FUNC( TW_DIB_GETFROMCLIPBOARD ) { hb_retnl( (LONG) DIB_GetFromClipboard() ); } // para no depender de Five, y harbour poder el solito trabajar ;) HB_FUNC ( GETACTIVEWINDOW ) { hb_retnl( ( LONG ) GetActiveWindow() ); } HB_FUNC( TW_SETPIXELTYPE ) { hb_retni( TWAIN_SetPixelType( hb_parni( 1 ) ) ); } #pragma ENDDUMP

by **reinaldocrespo** » Tue Mar 03, 2015 10:54 pm

Hey Gale;

Don't you have to purchase Dosadi EzTwain Pro ($999) for the OCR to work?

For example function TW_OCR_SELECTENGINE(EZOCR_ENGINE_TRANSYM) is not on the free version of the API. Am I missing something here?

Thank you for your help,

Reinaldo.

by **reinaldocrespo** » Thu Mar 02, 2017 7:46 pm

Gale FORd wrote:EZTwain has built in supports for Transym OCR Software's TOCR. I have used it and is pretty straight forward, especially if you are using EZTwain already.
It is not free so that might be a problem.

Hey Gale;

What exactly do I need to purchase to get OCR? Is it the Transym OCR license or Dosadi pro ver 4?

I'm only interested on extracting text from .tiff scanned pages. The scanning is already done by the time I get the .tiff page images.

Please help.

Reinaldo.

by **frose** » Fri Mar 03, 2017 8:31 am

I have tested tesseract per command line for german, german-frak and english with exciting results. There are language data for over 100 languages available!
With version 3.03 and higher, tesseract can produce searchable pdf from images!
On this site https://github.com/UB-Mannheim/tesseract/wiki you'll find Windows bins for version 3.05 and 4.00.
With pdfImages (part of xpdfbin) and/or ImageMagick and pdftk you can realise all necessary jobs before and after the OCR-process.

by **reinaldocrespo** » Fri Mar 03, 2017 1:59 pm

Hi Frank;

I'm currently using Tesseract API from xharbour. I'm processing thousands of .tif scanned documents. Results are about 80% accurate. I need better than that. For tesseract to be more accurate for the type of documents I'm OCRing I would need to change psm mode to 3 -which is default from command line. Changing PSM to 3 from API causes the OCR engine to break with runtime error. It might work a few times but after a number of runs it breaks causing my Harbour program to stop working.

Just FYI- these documents contain a unique identifier that matches an account number for a customer on the database. In this way the documents are automatically indexed and saved into the customer's file without human intervention. Thousands of documents are feed into a commercial scanner each day and they end up stored on a blob field with the customer's account on another indexed char field. 80% accuracy means that 20% of the account numbers weren't read and thus we need a human opening these document to attach them to the correct customer.

If you are interested on how to use Tesseract API from (x)Harbour, I will gladly provide source samples for you to try it. I'd love to solve the problem of not being able to change psm mode to 3 for more accuracy with my documents. Maybe you can help.

Reinaldo.

by **Gale FORd** » Fri Mar 03, 2017 5:16 pm

I would like to see how you do it.
Is the identifier in a certain area of the documents. In the past if have cut out that area of document and performed OCR on just that area.
Much less memory needed and much less invalid conversion of text.

by **frose** » Mon Mar 06, 2017 8:03 am

Hi Reinaldo,

here my experiences with tesseract so far, perhaps it helps in some cases:

bins

https://github.com/UB-Mannheim/tesseract/wiki

http://stackoverflow.com/questions/5083492/tesseract-and-tiff-format-spp-not-in-set-1-3

on the fly

Assuming that the API is using the installed tesseract version, there shall be no differences using the API or CL, unless there is a bug somewhere in the API :wink:

Frank

by **reinaldocrespo** » Mon Mar 06, 2017 4:25 pm

Try using Tesseract with PSM of 6 and the compare results when using PSM 3. Use different types of documents and mix images with text. You will see the difference on the results. I can not control the type of documents being feed to Tesseract. Scanning is an automated procedure done large scale. These documents happen to be .tiff --again, no choice, just deal with .tiff's.

When using the API and setting PSM to 3, Tesseract breaks. I've tried versions, 3, 3.02, 3.03, 3.04, and 3.05. Same results.

Here is how to use the API:

Code: Select all Expand view: handle := TessBaseAPICreate() //abort if english traindata file can't be found locally. IF TessBaseAPIInit3( handle, NIL, "eng" ) != 0 RETURN NIL ENDIF ... //page segmentation mode can be set via API call TessBaseAPISetPageSegMode(), or by //setting variable "tessedit_pageseg_mode", or by reading from config file. Possible values: //1 -Automatic page segmentation with OSD //3 -Fully automatic page segmentation, but no OSD, or OCR //TessBaseAPIReadConfigFile( handle, "tessapi_config" ) //TessBaseAPISetVariable( handle, "tessedit_pageseg_mode", "3" ) TessBaseAPISetPageSegMode( handle, 3 ) //print all tesseract ocr engine internal variables to file tesseract.log on cur dir. IF lDebug ; TessBaseAPIPrintVariablesToFile( handle, "tesseract.log" ) ;ENDIF //Open input image with leptonica library API pixRead IF lDebug ; logfile( "trace.log", { "pixread file", cfile } ) ;ENDIF img := pixRead( ALLTRIM( cPath ) + cFile ) IF lDebug ; logfile( "trace.log", { "TessBaseAPISetImage2", cfile } ) ;ENDIF TessBaseAPISetImage2( handle, img ) //Recognize is called from GetUTF8Text but it doesn't hurt to call before and //makes debugging easier. Program freezes when executing TessBaseAPIRecognize() only //when PageSegMode is changed above. IF lDebug ; logfile( "trace.log", { "TessBaseAPIRecognize ", cfile } ) ;ENDIF //program freezes here but only when pageSeg_Mode is changed. IF TessBaseAPIRecognize( handle, Nil ) <> 0 ; LOOP ;ENDIF //if TessBaseAPIRecognize above is commented then program will freeze when executing //TessBaseAPIGetUTF8Text(). Recognize is called internally from GetUTF8Text so we know the //problem is at Recognize. IF lDebug ; logfile( "trace.log", { "TessBaseAPIGetUTF8Text", cfile } ) ;ENDIF cText := STRTRAN( TessBaseAPIGetUTF8Text( handle ), CHR( 10 ), CRLF ) ... TessDeleteText( cText ) pixDestroy( img ) ... TessBaseAPIEnd( handle ) TessBaseAPIDelete( handle )

It would be nice if we could debug and fix the problem with Tesseract PSM 3 from API. I'm at a point where I'm considering placing the order for a commercial OCR engine called Transym to replace Tesseract. My customer demands more accurate OCR and for the type of document they are processing I haven't been able to get Tesseract to do any better.

Reinaldo.

by **frose** » Tue Mar 07, 2017 8:29 am

Hi Reinaldo,

but you can convert the tiffs in png, jpg, or other picture format before OCRing!?
For this you can use ImageMagick like this:

Frank

by **reinaldocrespo** » Tue Mar 07, 2017 3:06 pm

It would make a lot more sense to fix or find how to use Tesseract API, don't you think?

Renaldo.

by **frose** » Wed Mar 08, 2017 7:24 am

Hi Reinaldo,

not for me, I don't want to deal with those big, obsolete multi page picture file monsters. :wink:

Already my standard viewer can't show me, what is in those files.

On the other side I need no API. I have made good experiences with CL Tools such as GhostScript, calibre (epub-convert/epub-meta), ImageMagick, NirCmd, SumatraPDF, DXFView, image2pdf, PDFtk, xpdfbin (pdftotext, pdfimages,...). Ups, quite a lot, I wasn't aware of it.

So, my recommendation is: For test purposes, split one of those affected tiff, OCRing (with the API) all single tiffs, pngs or ppm/pbm (built one searchable pdf) and compare the results. Then decide about the next steps.

Good luck, Frank

FiveTech Software tech support forums

OCR for scanned documents

OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Re: OCR for scanned documents

Who is online