OCR for scanned documents

OCR for scanned documents

Postby reinaldocrespo » Mon Aug 04, 2014 1:29 pm

Hello everyone;

I wonder if anyone here has any experience implementing an OCR engine for their scanned documents. I've been using the free DOSADI EzTw32.dll lib to scan documents. I now need to add OCR.

There is a C++ open source OCR library maintained by google called Tesseract it seems popular among Visual Studio developers in the Windows world. http://code.google.com/p/tesseract-ocr/ ... eract_3.0x

Can someone share their experience or any ideas how to use it from (x)Harbour?

Best regards,



Reinaldo.
User avatar
reinaldocrespo
 
Posts: 979
Joined: Thu Nov 17, 2005 5:49 pm
Location: Fort Lauderdale, FL

Re: OCR for scanned documents

Postby Gale FORd » Mon Aug 04, 2014 2:06 pm

EZTwain has built in supports for Transym OCR Software's TOCR. I have used it and is pretty straight forward, especially if you are using EZTwain already.
It is not free so that might be a problem.
Gale FORd
 
Posts: 663
Joined: Mon Dec 05, 2005 11:22 pm
Location: Houston

Re: OCR for scanned documents

Postby reinaldocrespo » Tue Aug 05, 2014 1:02 pm

Thank you for the feedback, Gale. It looks like an alternative and the price isn't bad if it works. Do you have any code that you can share?

Reinaldo.
User avatar
reinaldocrespo
 
Posts: 979
Joined: Thu Nov 17, 2005 5:49 pm
Location: Fort Lauderdale, FL

Re: OCR for scanned documents

Postby Gale FORd » Tue Aug 05, 2014 3:56 pm

Actually I have a complete test scan app on the fivewin contributions section.
Here is the link: https://code.google.com/p/fivewin-contributions/downloads/detail?name=testscancomplete.zip&can=2&q=gale

One of the things it can do is allow you to Band an area on screen and have it OCR that area to text.
You will have to have valid license for it to work. Either the demo version or purchased.
For a more simple sample
Code: Select all  Expand view

#define EZOCR_ENGINE_NONE     0 // ‘null’ OCR engine - turns off OCR.
#define EZOCR_ENGINE_TRANSYM  1 // TOCR engine by Transym Ltd.

static cDirectory
static aFiles
static lRunBarcode
static lRunOCR
// Change the values below to designate area to convert/OCR/Barcode
static nAreaStartX   := 1434
static nAreaStartY   := 369
static nAreaWidth    := 578
static nAreaHeight   := 72
static nAreaFill     := -1

function testscan
   local lIsAvailable, nResult
   local nLen

   lRunOcr := .t.
   lRunBarcode := .f.

   lIsAvailable := TW_Avail()
   if lIsAvailable
      // Replace license information with your own information!!!
      TW_UniversalLicense( 'My Company', nLicenseNo )
   endif
   nResult := TW_LoadSourceManager()
   if nResult = -1
      ? 'Error loading source manager'
      wait
      quit
   endif
   nLen := adir( cDirectory+"*.tif" )
   if nLen = 0
      ? "No tif files to check"
      quit
   endif
   aFiles := array( nLen )
   adir( cDirectory+"*.tif", aFiles )
   gdf_barcode()
return nil

function gdf_barcode()
   local nReturn, hDib, nCounter, cTestFile
   local hDibOcr
   local cBarValue
   local nOCREngine
   local cFilename
   local nCountFiles

   if .not. lRunOcr .and. .not. lRunBarcode
      ? 'No files processed'
      return nil
   endif
   for nCountFiles := 1 to len( aFiles )
      cFileName := cDirectory+aFiles[ nCountFiles ]
      if empty( cFileName ) .or. !file( cFileName )
         return .f.
      endif
      hDib := TW_LOADFROMFILENAME( cFileName )
      tracelog( cFileName, TW_DibWidth(hDib), TW_DibHeight(hDib) )
      cBarValue := ''
      ? '------------------------- '+aFiles[ nCountFiles ]
      if lRunOCR .and. hDib > 0
         if .not. TW_OCR_SELECTENGINE(EZOCR_ENGINE_TRANSYM) > 0
            ? "Dosadi OCR component is not installed"
         else
            //? 'OCR Engine available'
            nOCREngine := TW_OCR_SELECTEDENGINE()
            //? TW_OCR_ENGINENAME( nOCREngine )
            nReturn := 0
            hDibOcr := TW_DIBREGIONCOPY( hDib, ;
               nAreaStartX, ;
               nAreaStartY, ;
               nAreaWidth , ;
               nAreaHeight, ;
               nAreaFill   )
            nReturn := TW_OCR_RECOGNIZEDIB( hDibOcr )
            //nReturn := TW_OCR_RECOGNIZEDIBZONE( hDib, 100, 100, 578, 72 )
            //nReturn := TW_OCR_RECOGNIZEDIBZONE( hDib, 1434, 369, 578, 72 )
            cText := ''
            do case
               case nReturn > 0
                  cText := TW_OCR_Text()
               case nReturn = -1
                  ? 'OCR Services or selected engine not available'
               case nReturn = -3
                  ? 'Image handle is nul or void'
               case nReturn = -5
                  ? 'Internal error or OCR engine returned error'
            endcase
            ? 'OCR Text: '
            if .not. empty( cText )
               //memowrit( 'test.txt', cText )
               /*
               ? at( chr(12), cText )
               ? asc( substr( cText, 33, 1) )
               ? asc( substr( cText, 34, 1))
               ? asc( substr( cText, 35, 1))
               */

               //? Token( cText, chr(10), 3 )
               //? memoline( cText, 3 )
               ?? ctext
            endif
            TW_Free( hDibOcr )
         endif
      endif
      if lRunBarcode .and. hDib > 0
         if .not. TW_Barcode_Avail()
             ? "Dosadi OCR component is not installed"
         else
            nNoBarCodes := Tw_BARCODE_RECOGNIZE(hDib,-1,-1)
            if nNoBarCodes > 0
               for nCounter := 1 to nNoBarCodes
                  ? 'Barcode'+str(nCounter,1)+": "+Tw_BARCODE_TEXT( nCounter-1 )
               next
                //::cBarValue := ::oScanner:BC_Text(0)
                //::cBarValue := alltrim(::cBarValue)
            else
                ? ' Barcode: None'
            endif
         endif
      endif
      if hDib > 0
         TW_Free( hDib )
      endif
   next
   hDib := nil
return nil

#pragma BEGINDUMP

 #include <windows.h>
 #include "eztwain.h"
 #include "hbapi.h"


/*--------- Top-Level Calls -------------------------------------*/

 HB_FUNC( TW_UNIVERSALLICENSE )
 {
 TWAIN_UniversalLicense( hb_parc(1), hb_parni(2) );
 hb_ret();
 }

 HB_FUNC( TW_ACQUIRE )  // hWnd
 {
  hb_retnl( ( LONG )TWAIN_Acquire( ( HWND ) hb_parnl( 1 ) ) );
 }

 HB_FUNC( TW_FREE )     // hDib
 {
  DIB_Free( ( HANDLE ) hb_parnl( 1 ) );
  hb_ret();
 }

 HB_FUNC( TW_SELECTIMAGESOURCE )  // hWnd
 {
  hb_retni( TWAIN_SelectImageSource( ( HWND ) hb_parnl( 1 ) ) );
 }

 HB_FUNC( TW_ACQUIRETOCLIPBOARD )  // hWnd, nPixTypes
 {
  hb_retl( TWAIN_AcquireToClipboard( ( HWND ) hb_parnl( 1 ), (unsigned) hb_parni(2) ) );
 }

 HB_FUNC( TW_ACQUIREMEMORY )  // hWnd
 {
  hb_retnl( ( LONG )TWAIN_AcquireMemory( ( HWND ) hb_parnl( 1 ) ) );
 }

 HB_FUNC( TW_ACQUIRETOFILENAME )  // hWnd, cFileName
 {
  hb_retni( TWAIN_AcquireToFilename( ( HWND ) hb_parnl( 1 ), hb_parc( 2 ) ) );
 }
 HB_FUNC( TW_ACQUIREMULTIPAGEFILE )  // hWnd, cFileName
 {
  hb_retni( TWAIN_AcquireMultipageFile( ( HWND ) hb_parnl( 1 ), hb_parc( 2 ) ) );
 }


 HB_FUNC( TW_ACQUIREFILE )  // hWnd, nFF, cFileName
 {
  hb_retni( TWAIN_AcquireFile( ( HWND ) hb_parnl( 1 ), hb_parni( 2 ) ,hb_parc( 3 ) ) );
 }


//--------- Basic TWAIN Inquiries

 HB_FUNC( TW_AVAIL )
 {
  hb_retl( TWAIN_IsAvailable()  );
 }


 HB_FUNC( TW_EASYVERSION)
 {
  hb_retni( TWAIN_EasyVersion()  );
 }

 HB_FUNC( TW_STATE )
 {
  hb_retni( TWAIN_State() );
 }

 HB_FUNC( TW_SOURCENAME )
 {
  hb_retc( TWAIN_SourceName() );
 }

 HB_FUNC( TW_GETSOURCENAME )   // pzName
 {
  TWAIN_GetSourceName( (LPSTR) hb_parc( 1 ) );
  hb_ret();
 }


//--------- DIB handling utilities ---------

 HB_FUNC( TW_DIBWJPG ) // hDib, cName
 {
  hb_parni( DIB_WriteToJpeg( ( HANDLE ) hb_parnl(1), hb_parc( 2 ) ) );
 }

 HB_FUNC( TW_DIBWBMP ) // hDib, cName
 {
  hb_parni( DIB_WriteToBmp( ( HANDLE ) hb_parnl(1), hb_parc( 2 ) ) );
 }

 HB_FUNC( TW_DIBTOFILENAME ) // hDib, cName
 {
  hb_parni( TWAIN_WriteToFilename( ( HANDLE ) hb_parnl(1), hb_parc( 2 ) ) );
 }



//--------- File Read/Write

 HB_FUNC( TW_ISJPG )
 {
  hb_retl( TWAIN_IsJpegAvailable() );
 }

 HB_FUNC( TW_SETSAVEFORMAT )
 {
  hb_retni( TWAIN_SetSaveFormat( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_GETSAVEFORMAT )
 {
  hb_retni( TWAIN_GetSaveFormat() );
 }

 HB_FUNC( TW_SETJPEGQUALITY ) // nQuality 1...100
 {
  TWAIN_SetJpegQuality( hb_parni( 1 ) );
  hb_ret();
 }

 HB_FUNC( TW_GETJPEGQUALITY )
 {
  hb_retni( TWAIN_GetJpegQuality() );
 }

 HB_FUNC( TW_WRITENATIVETOFILENAME )
 {
  hb_retni( TWAIN_WriteNativeToFilename( (HANDLE) hb_parnl(1), hb_parc(2) ));
 }

 HB_FUNC( TW_LOADNATIVEFROMFILENAME )
 {
  hb_retnl( (LONG) TWAIN_LoadNativeFromFilename( hb_parc( 1 ) ) );
 }
 HB_FUNC( TW_LOADFROMFILENAME )
 {
  hb_retnl( (LONG) DIB_LoadFromFilename( hb_parc( 1 ) ) );
 }
 HB_FUNC( TW_DIBSCALEDCOPY ) // hDib, cName
 {
  hb_retnl( (LONG) DIB_ScaledCopy( ( HANDLE ) hb_parnl( 1 ), hb_parni( 2 ), hb_parni( 3 ) ) );
 }
 HB_FUNC( TW_DIBCOPY ) // hDib, cName
 {
  hb_retnl( (LONG) DIB_Copy( ( HANDLE )  hb_parnl( 1 )) );
 }
 HB_FUNC( TW_DIBWIDTH )
 {
  hb_parni( DIB_Width( ( HANDLE ) hb_parnl(1) ) );
 }
 HB_FUNC( TW_DIBHEIGHT )
 {
  hb_retni( DIB_Height( (HANDLE) hb_parnl(1) ));
 }



//--------- Global Options ----------------------------------------------
 HB_FUNC( TW_SETMULTITRANSFER )
 {
  TWAIN_SetMultiTransfer( hb_parni( 1 ) );
  hb_ret();
 }

 HB_FUNC( TW_GETMULTITRANSFER )
 {
  hb_retni( TWAIN_GetMultiTransfer() );
 }

 HB_FUNC( TW_SETHIDEUI  ) // nHide
 {
  TWAIN_SetHideUI( hb_parni( 1) );
  hb_ret();
 }

 HB_FUNC( TW_GETHIDEUI  )
 {
  hb_retni( TWAIN_GetHideUI() );
 }


 HB_FUNC( TW_DISABLEPARENT )
 {
  TWAIN_DisableParent( hb_parni( 1 ) );
  hb_ret();
 }

 HB_FUNC( TW_GETDISABLEPARENT )
 {
  hb_retni( TWAIN_GetDisableParent() );
 }

 HB_FUNC( TW_REGISTERAPP )
 {

 TWAIN_RegisterApp( hb_parni(1),hb_parni(2),hb_parni(3),hb_parni(4),
                    hb_parc(5), hb_parc(6),
                    hb_parc(7), hb_parc(8) );
 hb_ret();
 }

 HB_FUNC( TW_SETAPPTITLE )
 {
  TWAIN_SetAppTitle( hb_parc( 1 ) );
  hb_ret();
 }

 HB_FUNC( TW_SETAPPLICATIONKEY )
 {
  TWAIN_SetApplicationKey( hb_parni( 1 ) );
  hb_ret();
 }




//--------- TWAIN State Control ---------------------------------------

 HB_FUNC( TW_LOADSOURCEMANAGER )
 {
  hb_retni( TWAIN_LoadSourceManager() );
 }

 HB_FUNC( TW_OPENSOURCEMANAGER )  // hWnd
 {
  hb_retni( TWAIN_OpenSourceManager( ( HWND ) hb_parnl( 1 ) ) );
 }

 HB_FUNC( TW_OPENDEFAULTSOURCE )
 {
  hb_retni( TWAIN_OpenDefaultSource() );
 }

 HB_FUNC( TW_GETSOURCELIST )
 {
  hb_retni( TWAIN_GetSourceList() );
 }

 HB_FUNC( TW_GETNEXTSOURCENAME )
 {
  hb_retni( TWAIN_GetNextSourceName( ( char * ) hb_parc( 1 ) ) );
 }

 HB_FUNC( TW_GETDEFAULTSOURCENAME )
 {
  hb_retni( TWAIN_GetDefaultSourceName( ( char * ) hb_parc( 1 ) ));
 }

 HB_FUNC( TW_OPENSOURCE )
 {
  hb_retni( TWAIN_OpenSource( hb_parc( 1 ) ) );
 }

 HB_FUNC( TW_ENABLESOURCE )       // hWnd
 {
  hb_retni( TWAIN_EnableSource( ( HWND ) hb_parnl( 1 ) ) );
 }

 HB_FUNC( TW_DISABLESOURCE )
 {
  hb_retni( TWAIN_DisableSource( ) );

 }

 HB_FUNC( TW_CLOSESOURCE )
 {
  hb_retni( TWAIN_CloseSource() );
 }

 HB_FUNC( TW_CLOSESOURCEMANAGER )
 {
  hb_retni( TWAIN_CloseSourceManager( (HWND) hb_parnl( 1 ) ) );
 }

 HB_FUNC( TW_UNLOADSOURCEMANEGER )
 {
  hb_retni( TWAIN_UnloadSourceManager() );
 }



//--------- High-level Capability Negotiation Functions --------------
// These functions should only be called in State 4 (TWAIN_SOURCE_OPEN)

 HB_FUNC( TW_GETCURRENTUNITS )
 {
  hb_retni( TWAIN_GetCurrentUnits() );
 }

 HB_FUNC( TW_SETCURRENTUNITS ) // nUnits
 {
  hb_retni( TWAIN_SetCurrentUnits( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_GETBITDEPTH )
 {
  hb_retni( TWAIN_GetBitDepth() );
 }

 HB_FUNC( TW_SETBITDEPTH )
 {
  hb_retni( TWAIN_SetBitDepth( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_GETPIXELTYPE )
 {
  hb_retni( TWAIN_GetPixelType() );
 }

 HB_FUNC( TW_SETCURRENTPIXELTYPE )  // nBits
 {
  hb_retni( TWAIN_SetCurrentPixelType( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_GETCURRENTRESOLUTION )
 {
  hb_retnd( TWAIN_GetCurrentResolution());
 }

 HB_FUNC( TW_GETYRESOLUTION )
 {
  hb_retnd( TWAIN_GetYResolution());
 }

 HB_FUNC( TW_SETCURRENTRESOLUTION )  // dRes
 {
  hb_retni( TWAIN_SetCurrentResolution( hb_parnd( 1 ) ) );
 }

 HB_FUNC( TW_SETXRESOLUTION )
 {
  hb_retni( TWAIN_SetXResolution( hb_parnd( 1 ) ) );
 }

 HB_FUNC( TW_SETYRESOLUTION )
 {
  hb_retni( TWAIN_SetYResolution( hb_parnd( 1 ) ) );
 }

 HB_FUNC( TW_SETCONTRATS ) //dCon
 {
  hb_retni( TWAIN_SetContrast( hb_parnd( 1 ) ) ); // -1000....+1000
 }

 HB_FUNC( TW_SETBRIGHTNESS ) //dBri
 {
  hb_retni( TWAIN_SetBrightness( hb_parnd( 1 ) ) ); // -1000....+1000
  }

 HB_FUNC( TW_SETTHRESHOLD )
 {
  hb_retni( TWAIN_SetThreshold( hb_parnd( 1 ) ) );
 }

 HB_FUNC( TW_GETCURRENTTHRESHOLD )
 {
  hb_retnd( TWAIN_GetCurrentThreshold() );
 }

 HB_FUNC( TW_SETXFERMECH )
 {
  hb_retni( TWAIN_SetXferMech( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_XFERMECH )
 {
  hb_retni( TWAIN_XferMech() );
 }

 HB_FUNC( TW_SUPPORTSFILEXFER )
 {
  hb_retni( TWAIN_SupportsFileXfer() );
 }

 HB_FUNC( TW_SETPAPERSIZE ) // nTypePaper
 {
  hb_retni( TWAIN_SetPaperSize( hb_parni( 1 ) ) );
 }

//-------- Document Feeder ---------------------------------

 HB_FUNC( TW_HASFEEDER )
 {
  hb_retl( TWAIN_HasFeeder() );
 }

 HB_FUNC( TW_ISFEEDERSELECTED )
 {
  hb_retl( TWAIN_IsFeederSelected() );
 }

 HB_FUNC( TW_SELECTFEEDER )
 {
  hb_retl( TWAIN_SelectFeeder( hb_parl( 1 ) ) );
 }

 HB_FUNC( TW_ISAUTOFEEDON )
 {
  hb_retl( TWAIN_IsAutoFeedOn() );
 }

 HB_FUNC( TW_SETAUTOFEDD )
 {
  hb_retl( TWAIN_SetAutoFeed( hb_parl( 1 ) ) );
 }

 HB_FUNC( TW_ISFEEDERLOADED )
 {
  hb_retl( TWAIN_IsFeederLoaded() );
 }

//-------- Duplex Scanning ------------------------------------------
 HB_FUNC( TW_GETDUPLEXSUPPORT )
 {
  hb_retni( TWAIN_GetDuplexSupport() );
 }

 HB_FUNC( TW_ENABLEDUPLEX )
 {
  hb_retni( TWAIN_EnableDuplex( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_ISDUPLEXENABLED )
 {
  hb_retl( TWAIN_IsDuplexEnabled() );
 }

//--------- Other 'exotic' capabilities --------

 HB_FUNC( TW_HASCONTROLLABLEUI )
 {
  hb_retni( TWAIN_HasControllableUI() );
 }

 HB_FUNC( TW_SETINDICATORS )
 {
  hb_retni( TWAIN_SetIndicators( hb_parl( 1 ) ) );
 }

 HB_FUNC( TW_COMPRESSION )
 {
  hb_retni( TWAIN_Compression() );
 }

 HB_FUNC( TW_SETCOMPRESSION )
 {
  hb_retni( TWAIN_SetCompression( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_TILED )
 {
  hb_retl( TWAIN_Tiled() );
 }

 HB_FUNC( TW_SETTILED )
 {
  hb_retni( TWAIN_SetTiled( hb_parl( 1 ) ) );
 }

 HB_FUNC( TW_PLANARCHUNKY )
 {
  hb_retni( TWAIN_PlanarChunky() );
 }

 HB_FUNC( TW_SETPLANARCHUNKY )
 {
  hb_retni( TWAIN_SetPlanarChunky( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_PIXELFLAVOR )
 {
  hb_retni( TWAIN_PixelFlavor() );
 }

 HB_FUNC( TW_SETPIXELFLAVOR )
 {
  hb_retni( TWAIN_SetPixelFlavor( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_SETLIGHTPATH )
 {
  hb_retni( TWAIN_SetLightPath( hb_parl( 1 ) ) );
 }

 HB_FUNC( TW_SETAUTOBRIGHT )
 {
  hb_retni( TWAIN_SetAutoBright( hb_parl( 1 ) ) );
 }

 HB_FUNC( TW_SETGAMMA )
 {
  hb_retni( TWAIN_SetGamma( hb_parnd( 1 ) ) );
 }

 HB_FUNC( TW_SETSHADOW )
 {
  hb_retni( TWAIN_SetShadow( hb_parnd( 1 ) ) );
 }

 HB_FUNC( TW_SETHIGHLIGHT )
 {
  hb_retni( TWAIN_SetHighlight( hb_parnd( 1 ) ) );
 }

//--------- Barcode --------

 HB_FUNC( TW_BARCODE_AVAIL )
 {
  hb_retl( BARCODE_IsAvailable()  );
 }

 HB_FUNC( TW_BARCODE_SELENGINE )
 {
  hb_retni( BARCODE_SelectedEngine()  );
 }

 HB_FUNC( TW_BARCODE_TEXT )
 {
  hb_retc( BARCODE_Text( hb_parni( 1 ) ));
 }

 HB_FUNC( TW_BARCODE_TYPE )
 {
  hb_retni( BARCODE_Type( hb_parni( 1 ) ));
 }

 HB_FUNC( TW_BARCODE_GETTEXT )
 {
  BARCODE_GetText( hb_parni( 1 ), (LPSTR) hb_parc(2) );
  hb_ret();
 }


 HB_FUNC( TW_BARCODE_RECOGNIZE )
 {
  hb_retni( BARCODE_Recognize( (HANDLE) hb_parnl(1),hb_parni( 2 ),hb_parni( 3 ) ) );
 }

//--------- OCR     --------

 HB_FUNC( TW_OCR_SELECTENGINE ) // nEngine
 {
  hb_retni( OCR_SelectEngine( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_OCR_SELECTEDENGINE )
 {
  hb_retni( OCR_SelectedEngine() );
 }

 HB_FUNC( TW_OCR_ENGINENAME )
 {
  hb_retc( OCR_EngineName( hb_parni( 1 ) ) );
 }

 HB_FUNC( TW_OCR_TEXT )
 {
  hb_retc( OCR_Text() );
 }

 HB_FUNC( TW_OCR_RECOGNIZEDIB ) // hDib
 {
  hb_retni( OCR_RecognizeDib( (HANDLE) hb_parnl(1) ) );
 }
/*
 HB_FUNC( TW_OCR_RECOGNIZEDIBZONE ) // hDib, Left, Top, Width, Height
 {
  hb_retni( OCR_RecognizeDibZone( (HANDLE) hb_parnl(1), hb_parni( 2 ), hb_parni( 3 ), hb_parni( 4 ), hb_parni( 5 ) ) );
 }
 */


 HB_FUNC( TW_DIBREGIONCOPY ) // hDib, Left, Top, Width, Height
 {
  hb_retnl( (LONG) DIB_RegionCopy( (HANDLE) hb_parnl(1), hb_parni( 2 ), hb_parni( 3 ), hb_parni( 4 ), hb_parni( 5 ), hb_parni( 6 ) ) );
 }

//--------- Image Layout (Region of Interest) --------


 HB_FUNC( TW_SETIMAGELAYOUT )   // left, top, right, bottom
 {
  hb_retni( TWAIN_SetImageLayout( hb_parnd( 1 ),hb_parnd( 2 ),hb_parnd( 3 ),hb_parnd( 4 ) ) );
 }


 HB_FUNC( TW_RESETIMAGELAYOUT )
 {
  TWAIN_ResetImageLayout();
  hb_ret();
 }

 HB_FUNC( TW_GETIMAGELAYOUT )
 {
  double L,T,R,B;
  int nRet;

  nRet = TWAIN_GetImageLayout( &L,&T,&R,&B );

  hb_stornd( L, 1 );
  hb_stornd( T, 2 );
  hb_stornd( R, 3 );
  hb_stornd( B, 4 );
  hb_retni( nRet );
 }

 HB_FUNC( TW_GETDEFAULTIMAGELAYOUT )
 {
  double L,T,R,B;
  int nRet;

  nRet = TWAIN_GetDefaultImageLayout( &L, &T, &R, &B);

  hb_stornd( L, 1 );
  hb_stornd( T, 2 );
  hb_stornd( R, 3 );
  hb_stornd( B, 4 );
  hb_retni( nRet );
 }

//HANDLE EZTAPI DIB_GetFromClipboard(void);
HB_FUNC( TW_DIB_GETFROMCLIPBOARD )
{
 hb_retnl( (LONG) DIB_GetFromClipboard() );
}

// para no depender de Five, y harbour poder el solito trabajar ;)
 HB_FUNC ( GETACTIVEWINDOW )
 {
  hb_retnl( ( LONG ) GetActiveWindow() );
 }

 HB_FUNC( TW_SETPIXELTYPE )
 {
  hb_retni( TWAIN_SetPixelType( hb_parni( 1 ) ) );
 }

#pragma ENDDUMP

 
Gale FORd
 
Posts: 663
Joined: Mon Dec 05, 2005 11:22 pm
Location: Houston

Re: OCR for scanned documents

Postby reinaldocrespo » Tue Mar 03, 2015 10:54 pm

Hey Gale;

Don't you have to purchase Dosadi EzTwain Pro ($999) for the OCR to work?

For example function TW_OCR_SELECTENGINE(EZOCR_ENGINE_TRANSYM) is not on the free version of the API. Am I missing something here?

Thank you for your help,


Reinaldo.
User avatar
reinaldocrespo
 
Posts: 979
Joined: Thu Nov 17, 2005 5:49 pm
Location: Fort Lauderdale, FL

Re: OCR for scanned documents

Postby reinaldocrespo » Thu Mar 02, 2017 7:46 pm

Gale FORd wrote:EZTwain has built in supports for Transym OCR Software's TOCR. I have used it and is pretty straight forward, especially if you are using EZTwain already.
It is not free so that might be a problem.


Hey Gale;

What exactly do I need to purchase to get OCR? Is it the Transym OCR license or Dosadi pro ver 4?

I'm only interested on extracting text from .tiff scanned pages. The scanning is already done by the time I get the .tiff page images.

Please help.


Reinaldo.
User avatar
reinaldocrespo
 
Posts: 979
Joined: Thu Nov 17, 2005 5:49 pm
Location: Fort Lauderdale, FL

Re: OCR for scanned documents

Postby frose » Fri Mar 03, 2017 8:31 am

I have tested tesseract per command line for german, german-frak and english with exciting results. There are language data for over 100 languages available!
With version 3.03 and higher, tesseract can produce searchable pdf from images!
On this site https://github.com/UB-Mannheim/tesseract/wiki you'll find Windows bins for version 3.05 and 4.00.
With pdfImages (part of xpdfbin) and/or ImageMagick and pdftk you can realise all necessary jobs before and after the OCR-process.
Windows 11 Pro 22H2 22621.1848
Microsoft (R) Windows (R) Resource Compiler Version 10.0.10011.16384
Harbour 3.2.0dev (r2008190002)
FWH 23.10 x86
User avatar
frose
 
Posts: 392
Joined: Tue Mar 10, 2009 11:54 am
Location: Germany, Rietberg

Re: OCR for scanned documents

Postby reinaldocrespo » Fri Mar 03, 2017 1:59 pm

Hi Frank;

I'm currently using Tesseract API from xharbour. I'm processing thousands of .tif scanned documents. Results are about 80% accurate. I need better than that. For tesseract to be more accurate for the type of documents I'm OCRing I would need to change psm mode to 3 -which is default from command line. Changing PSM to 3 from API causes the OCR engine to break with runtime error. It might work a few times but after a number of runs it breaks causing my Harbour program to stop working.

Just FYI- these documents contain a unique identifier that matches an account number for a customer on the database. In this way the documents are automatically indexed and saved into the customer's file without human intervention. Thousands of documents are feed into a commercial scanner each day and they end up stored on a blob field with the customer's account on another indexed char field. 80% accuracy means that 20% of the account numbers weren't read and thus we need a human opening these document to attach them to the correct customer.

If you are interested on how to use Tesseract API from (x)Harbour, I will gladly provide source samples for you to try it. I'd love to solve the problem of not being able to change psm mode to 3 for more accuracy with my documents. Maybe you can help.


Reinaldo.
User avatar
reinaldocrespo
 
Posts: 979
Joined: Thu Nov 17, 2005 5:49 pm
Location: Fort Lauderdale, FL

Re: OCR for scanned documents

Postby Gale FORd » Fri Mar 03, 2017 5:16 pm

I would like to see how you do it.
Is the identifier in a certain area of the documents. In the past if have cut out that area of document and performed OCR on just that area.
Much less memory needed and much less invalid conversion of text.
Gale FORd
 
Posts: 663
Joined: Mon Dec 05, 2005 11:22 pm
Location: Houston

Re: OCR for scanned documents

Postby frose » Mon Mar 06, 2017 8:03 am

Hi Reinaldo,

here my experiences with tesseract so far, perhaps it helps in some cases:

    - I want direct output of searchable pdf - not only hOCR -, possible with version 3.03 and higher
    - I want to run tesseract on windows engines, so I was looking after windows bins version 3.03 or higher
    - Found them here https://github.com/UB-Mannheim/tesseract/wiki
    - My first and last experience with tiff was this 'Error in pixReadFromTiffStream: spp not in set {1,3,4}'
    - So I've changed to png with good results, as recommended here: http://stackoverflow.com/questions/5083492/tesseract-and-tiff-format-spp-not-in-set-1-3: 'Tesseract (well, Leptonica) accepts PNGs these days and is less picky about them, so it might be easier to migrate your workflow to PNG anyway.'
    - Later on, I changed picture splitting to PDFImages, it generates ppm/pbm and optional jpg pictures on the fly.
    - I experienced good OCR results, perhaps due to this fact: 'pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.
    - So, normally no problem with pdf's coming directly from the scanner, but be careful with scanned and reworked pdf :roll:
    - For keeping the pdf readable, I negate (color inversion) the pbm pictures
Assuming that the API is using the installed tesseract version, there shall be no differences using the API or CL, unless there is a bug somewhere in the API :wink:

Frank
Windows 11 Pro 22H2 22621.1848
Microsoft (R) Windows (R) Resource Compiler Version 10.0.10011.16384
Harbour 3.2.0dev (r2008190002)
FWH 23.10 x86
User avatar
frose
 
Posts: 392
Joined: Tue Mar 10, 2009 11:54 am
Location: Germany, Rietberg

Re: OCR for scanned documents

Postby reinaldocrespo » Mon Mar 06, 2017 4:25 pm

Try using Tesseract with PSM of 6 and the compare results when using PSM 3. Use different types of documents and mix images with text. You will see the difference on the results. I can not control the type of documents being feed to Tesseract. Scanning is an automated procedure done large scale. These documents happen to be .tiff --again, no choice, just deal with .tiff's.

When using the API and setting PSM to 3, Tesseract breaks. I've tried versions, 3, 3.02, 3.03, 3.04, and 3.05. Same results.

Here is how to use the API:

Code: Select all  Expand view


   handle := TessBaseAPICreate()

   //abort if english traindata file can't be found locally.
   IF TessBaseAPIInit3( handle, NIL, "eng" ) != 0    
       RETURN NIL
   ENDIF

...
         //page segmentation mode can be set via API call TessBaseAPISetPageSegMode(), or by
         //setting variable "tessedit_pageseg_mode", or by reading from config file. Possible values:
         //1 -Automatic page segmentation with OSD
         //3 -Fully automatic page segmentation, but no OSD, or OCR

         //TessBaseAPIReadConfigFile( handle, "tessapi_config" )
         //TessBaseAPISetVariable( handle, "tessedit_pageseg_mode", "3" )
         TessBaseAPISetPageSegMode( handle, 3 )

         //print all tesseract ocr engine internal variables to file tesseract.log on cur dir.
         IF lDebug ; TessBaseAPIPrintVariablesToFile( handle, "tesseract.log" )  ;ENDIF


         //Open input image with leptonica library API pixRead
         IF lDebug ; logfile( "trace.log", { "pixread file", cfile } ) ;ENDIF
         img := pixRead( ALLTRIM( cPath ) + cFile )

         IF lDebug ; logfile( "trace.log", { "TessBaseAPISetImage2", cfile } ) ;ENDIF
         TessBaseAPISetImage2( handle, img )

         //Recognize is called from GetUTF8Text but it doesn't hurt to call before and
         //makes debugging easier.  Program freezes when executing TessBaseAPIRecognize() only
         //when PageSegMode is changed above.
         IF lDebug ; logfile( "trace.log", { "TessBaseAPIRecognize ", cfile } ) ;ENDIF
         //program freezes here but only when pageSeg_Mode is changed.
         IF TessBaseAPIRecognize( handle, Nil ) <> 0  ; LOOP   ;ENDIF    

         //if TessBaseAPIRecognize above is commented then program will freeze when executing
         //TessBaseAPIGetUTF8Text().  Recognize is called internally from GetUTF8Text so we know the
         //problem is at Recognize.
         IF lDebug ; logfile( "trace.log", { "TessBaseAPIGetUTF8Text", cfile } ) ;ENDIF
         cText := STRTRAN( TessBaseAPIGetUTF8Text( handle ), CHR( 10 ), CRLF )
...
         TessDeleteText(  cText )
         pixDestroy( img )

...
   TessBaseAPIEnd( handle )
   TessBaseAPIDelete( handle )

 


It would be nice if we could debug and fix the problem with Tesseract PSM 3 from API. I'm at a point where I'm considering placing the order for a commercial OCR engine called Transym to replace Tesseract. My customer demands more accurate OCR and for the type of document they are processing I haven't been able to get Tesseract to do any better.


Reinaldo.
User avatar
reinaldocrespo
 
Posts: 979
Joined: Thu Nov 17, 2005 5:49 pm
Location: Fort Lauderdale, FL

Re: OCR for scanned documents

Postby frose » Tue Mar 07, 2017 8:29 am

Hi Reinaldo,

but you can convert the tiffs in png, jpg, or other picture format before OCRing!?
For this you can use ImageMagick like this:

    - split multi page split: magick.exe <FileName>.tiff -scene 1 <FileName>%d.tif
    - convert to png: magick.exe convert *.tiff *.png
Frank
Windows 11 Pro 22H2 22621.1848
Microsoft (R) Windows (R) Resource Compiler Version 10.0.10011.16384
Harbour 3.2.0dev (r2008190002)
FWH 23.10 x86
User avatar
frose
 
Posts: 392
Joined: Tue Mar 10, 2009 11:54 am
Location: Germany, Rietberg

Re: OCR for scanned documents

Postby reinaldocrespo » Tue Mar 07, 2017 3:06 pm

It would make a lot more sense to fix or find how to use Tesseract API, don't you think?

Renaldo.
User avatar
reinaldocrespo
 
Posts: 979
Joined: Thu Nov 17, 2005 5:49 pm
Location: Fort Lauderdale, FL

Re: OCR for scanned documents

Postby frose » Wed Mar 08, 2017 7:24 am

Hi Reinaldo,

not for me, I don't want to deal with those big, obsolete multi page picture file monsters. :wink: Already my standard viewer can't show me, what is in those files.

On the other side I need no API. I have made good experiences with CL Tools such as GhostScript, calibre (epub-convert/epub-meta), ImageMagick, NirCmd, SumatraPDF, DXFView, image2pdf, PDFtk, xpdfbin (pdftotext, pdfimages,...). Ups, quite a lot, I wasn't aware of it. 8)

So, my recommendation is: For test purposes, split one of those affected tiff, OCRing (with the API) all single tiffs, pngs or ppm/pbm (built one searchable pdf) and compare the results. Then decide about the next steps.

Good luck, Frank
Windows 11 Pro 22H2 22621.1848
Microsoft (R) Windows (R) Resource Compiler Version 10.0.10011.16384
Harbour 3.2.0dev (r2008190002)
FWH 23.10 x86
User avatar
frose
 
Posts: 392
Joined: Tue Mar 10, 2009 11:54 am
Location: Germany, Rietberg


Return to FiveWin for Harbour/xHarbour

Who is online

Users browsing this forum: Google [Bot] and 47 guests