MS Word to .TXT

MS Word to .TXT

Postby Colin Wisbey » Thu Oct 04, 2007 11:23 pm

How to convert a MS Word file to a .TXT (straight text) file, using my own code?

If I was doing it manually, it would be the equivalent of

(a) starting MS Word, opening a doc file, then "SAVE AS .." .TXT then closing MS Word

(b) then running my app against the resulting text file.

I am desperate for suggestions on how to automate step (a) in such a way that the user never sees a MS Word screen.

I need my app to do everything. i.e. The user simply selects a DOC file (via cGetFile("*.DOC", "Select a file to assess", ,) then my app needs to convert it to straight text and analyse the words in the text file.

The resulting text file is not later converted back to a MS Word DOC file or anything involving MS Word.

I don't want to shell out to a 3rd party Word->Txt converter if I can avoid it. The app is to be a low-priced mass market product in their own networked environment and I shall never have any technical contact with any user. All users are expected to have MS Word but, beyond that, I want to keep my app as self-contained as possible. (i.e. without having to register an OCX or change registry entries or anything like that). The users have only the most basic PC skills.

I've no experience with using OLE or interacting with a MS product so I'd be very grateful for any coding suggestions.

Many thanks in advance.
Colin Wisbey
Colin Wisbey
 
Posts: 56
Joined: Mon Jul 03, 2006 2:34 am

Postby Antonio Linares » Thu Oct 04, 2007 11:34 pm

Colin,

Just a quick idea:
Have you tried to locate the text into the DOC file using brute force ?

At( cToFind, MemoRead( "file.doc" ) )

If you place some text for the beginning and for the end, then you may extract what there is in between, using SubStr()
regards, saludos

Antonio Linares
www.fivetechsoft.com
User avatar
Antonio Linares
Site Admin
 
Posts: 41315
Joined: Thu Oct 06, 2005 5:47 pm
Location: Spain

Postby Adolfo » Fri Oct 05, 2007 12:13 am

Which compiler are you using Harbour, xHarbour.org, xHb.com ?

Depending on it you may use tActivex to open the doc, and saving it as a TXT file.

Just an idea.

From Chile
Adolfo
;-) Ji,ji,ji... buena la cosa... "all you need is code"

http://www.xdata.cl - Desarrollo Inteligente
----------
Asus TUF F15, 32GB Ram, 1 TB NVME M.2, 1 TB SSD, GTX 1650
User avatar
Adolfo
 
Posts: 846
Joined: Tue Oct 11, 2005 11:57 am
Location: Chile

Postby Adolfo » Fri Oct 05, 2007 12:17 am

Sorry... I meant tOle...

If you are using xhb.com, I have some easy samples, you may modify them to fullfil your needs.

I`ve tried to open MSWord documents with fopen, and read its content, It worked with some documents, but if It contains any kind of letter, borders, diferrent colors, it was a mess, since it has some control characters that made the reading of the file almost impossible.


From Chile
Adolfo
;-) Ji,ji,ji... buena la cosa... "all you need is code"

http://www.xdata.cl - Desarrollo Inteligente
----------
Asus TUF F15, 32GB Ram, 1 TB NVME M.2, 1 TB SSD, GTX 1650
User avatar
Adolfo
 
Posts: 846
Joined: Tue Oct 11, 2005 11:57 am
Location: Chile

Postby Colin Wisbey » Fri Oct 05, 2007 1:06 am

Antonio,
Thanks for responding so promptly.

Unfortunately your suggestion doesn't apply to my app. My app never knows what text is in the doc. It could be anything and the doc could could contain just a few words or over 1 million words.

The purpose of my app is to take all of the raw text (i.e. stripped of any graphics, word-processing formatting etc) and then analyse what words are in the file and how they are used. It is somewhat of an "English literacy test" analyser.
Colin Wisbey
 
Posts: 56
Joined: Mon Jul 03, 2006 2:34 am

Postby Colin Wisbey » Fri Oct 05, 2007 1:13 am

Adolfo,

That sounds like the sort of approach I guess I need to take.

I am using FW7.01 and xHb.com (but compiled with Borland).

The MS Word doc files could contain anything, including graphics. I need the ability to extract all the text and only the text.

Could you be so kind as to either post some code or email it to me at

cwisbey@optusnet.com.au

Many thanks for your offer of assistance.
Colin Wisbey
 
Posts: 56
Joined: Mon Jul 03, 2006 2:34 am

Re: MS Word to .TXT

Postby Enrico Maria Giordano » Fri Oct 05, 2007 7:11 am

Colin Wisbey wrote:How to convert a MS Word file to a .TXT (straight text) file, using my own code?


Code: Select all  Expand view
#define wdFormatDOSText 4


FUNCTION MAIN()

    LOCAL oWord := CREATEOBJECT( "Word.Application" )

    LOCAL oDoc := oWord:Documents:Open( "e:\xharbour\test.doc" )

    oDoc:SaveAs( "e:\xharbour\NewDocument.txt", wdFormatDOSText )

    oWord:Quit()

    RETURN NIL


EMG
User avatar
Enrico Maria Giordano
 
Posts: 8315
Joined: Thu Oct 06, 2005 8:17 pm
Location: Roma - Italia

Postby Colin Wisbey » Fri Oct 05, 2007 8:43 am

Enrico,

That works perfectly. I couldn't be happier. Biggest thanks!!!!!

Col
Colin Wisbey
 
Posts: 56
Joined: Mon Jul 03, 2006 2:34 am

Postby Antonio Linares » Fri Oct 05, 2007 9:02 am

Great :-)

Thanks Master Enrico
regards, saludos

Antonio Linares
www.fivetechsoft.com
User avatar
Antonio Linares
Site Admin
 
Posts: 41315
Joined: Thu Oct 06, 2005 5:47 pm
Location: Spain

Re: MS Word to .TXT

Postby ask » Fri Oct 05, 2007 10:28 am

EnricoMaria wrote:
Colin Wisbey wrote:How to convert a MS Word file to a .TXT (straight text) file, using my own code?


Code: Select all  Expand view
#define wdFormatDOSText 4


FUNCTION MAIN()

    LOCAL oWord := CREATEOBJECT( "Word.Application" )

    LOCAL oDoc := oWord:Documents:Open( "e:\xharbour\test.doc" )

    oDoc:SaveAs( "e:\xharbour\NewDocument.txt", wdFormatDOSText )

    oWord:Quit()

    RETURN NIL


EMG



I think you should use oWord:Documents:Add( " ... instead of oWord:Documents:Open because this may cause "read only problems" .The Add property opens it like a template so you will never have any problems

regards,

A.S.K
ask
 
Posts: 99
Joined: Wed Nov 02, 2005 10:40 am

MS Word to .HTM filtered

Postby Marc Vanzegbroeck » Mon Aug 18, 2008 6:55 pm

Enrico,

Do you know the how to save the word-file as Web-page-filtered?

Thanks,
Marc
Marc Vanzegbroeck
 
Posts: 1157
Joined: Mon Oct 17, 2005 5:41 am
Location: Belgium

Postby Marc Vanzegbroeck » Mon Aug 18, 2008 7:40 pm

Hi,

Please ignore this message, I found it :D

Marc
Marc Vanzegbroeck
 
Posts: 1157
Joined: Mon Oct 17, 2005 5:41 am
Location: Belgium

Postby Antonio Linares » Tue Aug 19, 2008 6:53 am

Marc,

Would you mind to share the solution with us ? thanks! :-)
regards, saludos

Antonio Linares
www.fivetechsoft.com
User avatar
Antonio Linares
Site Admin
 
Posts: 41315
Joined: Thu Oct 06, 2005 5:47 pm
Location: Spain

Postby gkuhnert » Tue Aug 19, 2008 8:28 am

These Save-Options are available in Word:

Code: Select all  Expand view
wdFormatDocument   0   Microsoft Office Word format.
wdFormatDOSText   4   Microsoft DOS text format.
wdFormatDOSTextLineBreaks   5   Microsoft DOS text with line breaks preserved.
wdFormatEncodedText   7   Encoded text format.
wdFormatFilteredHTML   10   Filtered HTML format.
wdFormatHTML   8   Standard HTML format.
wdFormatRTF   6   Rich text format (RTF).
wdFormatTemplate   1   Word template format.
wdFormatText   2   Microsoft Windows text format.
wdFormatTextLineBreaks   3   Windows text format with line breaks preserved.
wdFormatUnicodeText   7   Unicode text format.
wdFormatWebArchive   9   Web archive format.
wdFormatXML   11   Extensible Markup Language (XML) format.
wdFormatDocument97   0   Microsoft Word 97 document format.
wdFormatDocumentDefault   16   Word default document file format. For Microsoft Office Word 2007, this is the DOCX format.
wdFormatPDF   17   PDF format.
wdFormatTemplate97   1   Word 97 template format.
wdFormatXMLDocument   12   XML document format.
wdFormatXMLDocumentMacroEnabled   13   XML document format with macros enabled.
wdFormatXMLTemplate   14   XML template format.
wdFormatXMLTemplateMacroEnabled   15   XML template format with macros enabled.
wdFormatXPS   18   XPS format.


and can be used in the same way Enrico wrote:

Code: Select all  Expand view
oDoc:SaveAs( "e:\xharbour\NewDocument.txt", wdFormatDOSText )


You can find the documentation here: http://msdn.microsoft.com/en-us/library/bb238158.aspx
Best Regards,

Gilbert Kuhnert
CTO Software GmbH
http://www.ctosoftware.de
User avatar
gkuhnert
 
Posts: 274
Joined: Fri Apr 04, 2008 1:25 pm
Location: Aachen - Germany // Kerkrade - Netherlands

Postby Antonio Linares » Tue Aug 19, 2008 9:22 am

Gilbert,

Thanks! :-)
regards, saludos

Antonio Linares
www.fivetechsoft.com
User avatar
Antonio Linares
Site Admin
 
Posts: 41315
Joined: Thu Oct 06, 2005 5:47 pm
Location: Spain


Return to FiveWin for Harbour/xHarbour

Who is online

Users browsing this forum: No registered users and 94 guests