Extract Text

Extract Text

Postby Jeff Barnes » Tue Mar 15, 2016 7:45 pm

I am looking for a way to extract text from a text file and was wondering if something already exists to do what I need.

I was looking for something like:

cText := TextExtract( cItem1, cItem2)

cItem1 and cItem2 would be the text that is around the text I am after.

Text Example from the file:
<system_name>MYPC101</system_name>

So I would like to do something like this:
cText := TextExtract( "<system_name>", "</system_name>" )

And it would return cText as "MYPC101"

Any ideas?
Thanks,
Jeff Barnes

(FWH 16.11, xHarbour 1.2.3, Bcc730)
User avatar
Jeff Barnes
 
Posts: 929
Joined: Sun Oct 09, 2005 1:05 pm
Location: Ontario, Canada

Re: Extract Text

Postby cnavarro » Tue Mar 15, 2016 8:04 pm

Cristobal Navarro
Hay dos tipos de personas: las que te hacen perder el tiempo y las que te hacen perder la noción del tiempo
El secreto de la felicidad no está en hacer lo que te gusta, sino en que te guste lo que haces
User avatar
cnavarro
 
Posts: 6541
Joined: Wed Feb 15, 2012 8:25 pm
Location: España

Re: Extract Text

Postby rhlawek » Tue Mar 15, 2016 10:49 pm

Jeff,

Here is a function I wrote a long time ago to do exactly this. It returns what it finds in a array of strings, not a single string. I suppose this can be optimized somewhat, but other than pre-allocating the array to avoid a bunch of AAdd() calls, I've simply never had the need to improve on it. If I were going to optimize anything the first would be to keep track of the offset into the string, instead of trimming the front off the input strings after each match.

As is, I often return an array of strings of 15,000 to 20,000 at a time, parsing various logs and xml files. Some of the logs are quite large, 50+ MB. Large logs means large memory allocation. Still, I typically just read the entire file into cInputString and process it all in one pass. I do have a version that finds the first instance of matching tags and returns that single instance in a string, but I hardly every use that version.

As written it creates a local upper case copy of the input string and the tags and does an upper case match, but it returns what it finds in the original case.

Code: Select all  Expand view

#if ! defined( DEFAULT_MAX_RECORDS )
#define DEFAULT_MAX_RECORDS   20000
#endif

FUNCTION BETWEENTAGSARRAY( cStartTag, cEndTag, cInputString, lIncludeTags )

   LOCAL nStartPoint, nEndPoint
   LOCAL nRecords := 00, nFetchLength := 00, aFoundText := Array( DEFAULT_MAX_RECORDS )
   LOCAL cMDML
   LOCAL cInputStringUpper := Upper( cInputString )
   LOCAL cStartTagUpper    := Upper( cStartTag    )
   LOCAL cEndTagUpper      := Upper( cEndTag      )
   
   hb_Default( @lIncludeTags, .F. )
   
   DO WHILE .T.

      // Find the starting point of the starting tag.
      nStartPoint := At( cStartTagUpper, SubStr( cInputStringUpper, 01 ) )
      IF nStartPoint > 00

         // Adjust starting point to end of starting tag
         nStartPoint += Len( cStartTagUpper )

         // If the first tag is found strip off string up to and including the starting tag itself
         cInputStringUpper := SubStr( cInputStringUpper, nStartPoint )
         cInputString      := SubStr( cInputString,      nStartPoint )

         // Find the starting point of the second tag, beginning from end of first tag.
         nEndPoint := At( cEndTagUpper, cInputStringUpper )
         IF nEndPoint > 00

            // If the second tag is found calculate its position from start of string.
            nFetchLength := nEndPoint - 1

            IF lIncludeTags
               cMDML := cStartTag + LTrim( SubStr( cInputString, 01, nFetchLength ) ) + cEndTag
            ELSE
               cMDML := LTrim( SubStr( cInputString, 01, nFetchLength ) )
            ENDIF

            IF ++nRecords <= DEFAULT_MAX_RECORDS
               aFoundText[ nRecords ] := cMDML
            ELSE
               // IF we get here it is gonna be oh so slow.
               AAdd( aFoundText, cMDML )
            ENDIF

            // clip off the front of the string then loop to find the next
            cInputStringUpper := SubStr( cInputStringUpper, nFetchLength + 01 )
            cInputString      := SubStr( cInputString,      nFetchLength + 01 )

         ELSE
            EXIT
         ENDIF
      ELSE
         EXIT
      ENDIF
   ENDDO
   IF nRecords < DEFAULT_MAX_RECORDS
      aFoundText := ASize( aFoundText, nRecords )
   ENDIF

   RETURN ( aFoundText )
 


Robb
User avatar
rhlawek
 
Posts: 194
Joined: Sun Jul 22, 2012 7:01 pm

Re: Extract Text

Postby James Bott » Tue Mar 15, 2016 11:43 pm

Jeff,

See: FWH\samples\xmlreader.prg

This is a sample XML document reader.

James
User avatar
James Bott
 
Posts: 4840
Joined: Fri Nov 18, 2005 4:52 pm
Location: San Diego, California, USA

Re: Extract Text

Postby rhlawek » Wed Mar 16, 2016 2:36 am

A lot of what I pull out of logs is xml, but it typically gets written as a line in the log, not clean XML. That is actually why I wrote this function, and also why it has a switch to leave the tags in place as part of the returned strings or not. With XML I typically want the tags, but with other raw logs I do not. I do use the TXMLDocument class, which is used in samples\xmlreader.prg, to parse the XML after it is extracted.
User avatar
rhlawek
 
Posts: 194
Joined: Sun Jul 22, 2012 7:01 pm

Re: Extract Text

Postby Jeff Barnes » Wed Mar 16, 2016 5:02 pm

Thanks Robb. With some slight fine tuning (less than 5 minutes) it does exactly what i need :)

James, I couldn't look at the sample xmleader.prg as I don't seem to have that in my samples folder.
Maybe my FWH version didn't have that.
Thanks,
Jeff Barnes

(FWH 16.11, xHarbour 1.2.3, Bcc730)
User avatar
Jeff Barnes
 
Posts: 929
Joined: Sun Oct 09, 2005 1:05 pm
Location: Ontario, Canada

Re: Extract Text

Postby Antonio Linares » Wed Mar 16, 2016 9:19 pm

Jeff,

FWH\samples\xmlreader.prg

Code: Select all  Expand view
// Simple example for a generic XML reader

#include "FiveWin.ch"

function Main()
   
   local hFile    := FOpen( "test.xml" )
   Local oXmlDoc  := TXmlDocument():New( hFile )
   Local oXmlIter := TXmlIterator():New( oXmlDoc:oRoot ), oTagActual

   while .T.
      oTagActual = oXmlIter:Next()
      If oTagActual != nil
         MsgInfo( oTagActual:cName, oTagActual:cData )
         HEval( oTagActual:aAttributes, { | cKey, cValue | MsgInfo( cKey, cValue ) } )
      Else
         Exit
      Endif
   End

   FClose( hFile )

return nil
regards, saludos

Antonio Linares
www.fivetechsoft.com
User avatar
Antonio Linares
Site Admin
 
Posts: 42074
Joined: Thu Oct 06, 2005 5:47 pm
Location: Spain


Return to FiveWin for Harbour/xHarbour

Who is online

Users browsing this forum: Google [Bot], Natter and 85 guests