MS Word to .TXT
-
- Posts: 56
- Joined: Mon Jul 03, 2006 2:34 am
MS Word to .TXT
How to convert a MS Word file to a .TXT (straight text) file, using my own code?
If I was doing it manually, it would be the equivalent of
(a) starting MS Word, opening a doc file, then "SAVE AS .." .TXT then closing MS Word
(b) then running my app against the resulting text file.
I am desperate for suggestions on how to automate step (a) in such a way that the user never sees a MS Word screen.
I need my app to do everything. i.e. The user simply selects a DOC file (via cGetFile("*.DOC", "Select a file to assess", ,) then my app needs to convert it to straight text and analyse the words in the text file.
The resulting text file is not later converted back to a MS Word DOC file or anything involving MS Word.
I don't want to shell out to a 3rd party Word->Txt converter if I can avoid it. The app is to be a low-priced mass market product in their own networked environment and I shall never have any technical contact with any user. All users are expected to have MS Word but, beyond that, I want to keep my app as self-contained as possible. (i.e. without having to register an OCX or change registry entries or anything like that). The users have only the most basic PC skills.
I've no experience with using OLE or interacting with a MS product so I'd be very grateful for any coding suggestions.
Many thanks in advance.
Colin Wisbey
If I was doing it manually, it would be the equivalent of
(a) starting MS Word, opening a doc file, then "SAVE AS .." .TXT then closing MS Word
(b) then running my app against the resulting text file.
I am desperate for suggestions on how to automate step (a) in such a way that the user never sees a MS Word screen.
I need my app to do everything. i.e. The user simply selects a DOC file (via cGetFile("*.DOC", "Select a file to assess", ,) then my app needs to convert it to straight text and analyse the words in the text file.
The resulting text file is not later converted back to a MS Word DOC file or anything involving MS Word.
I don't want to shell out to a 3rd party Word->Txt converter if I can avoid it. The app is to be a low-priced mass market product in their own networked environment and I shall never have any technical contact with any user. All users are expected to have MS Word but, beyond that, I want to keep my app as self-contained as possible. (i.e. without having to register an OCX or change registry entries or anything like that). The users have only the most basic PC skills.
I've no experience with using OLE or interacting with a MS product so I'd be very grateful for any coding suggestions.
Many thanks in advance.
Colin Wisbey
- Antonio Linares
- Site Admin
- Posts: 42511
- Joined: Thu Oct 06, 2005 5:47 pm
- Location: Spain
- Has thanked: 31 times
- Been thanked: 73 times
- Contact:
Which compiler are you using Harbour, xHarbour.org, xHb.com ?
Depending on it you may use tActivex to open the doc, and saving it as a TXT file.
Just an idea.
From Chile
Adolfo
Depending on it you may use tActivex to open the doc, and saving it as a TXT file.
Just an idea.
From Chile
Adolfo
![Wink ;-)](./images/smilies/icon_wink.gif)
http://www.xdata.cl - Desarrollo Inteligente
----------
Asus TUF F15, 32GB Ram, 2 * 1 TB NVME M.2, GTX 1650
Sorry... I meant tOle...
If you are using xhb.com, I have some easy samples, you may modify them to fullfil your needs.
I`ve tried to open MSWord documents with fopen, and read its content, It worked with some documents, but if It contains any kind of letter, borders, diferrent colors, it was a mess, since it has some control characters that made the reading of the file almost impossible.
From Chile
Adolfo
If you are using xhb.com, I have some easy samples, you may modify them to fullfil your needs.
I`ve tried to open MSWord documents with fopen, and read its content, It worked with some documents, but if It contains any kind of letter, borders, diferrent colors, it was a mess, since it has some control characters that made the reading of the file almost impossible.
From Chile
Adolfo
![Wink ;-)](./images/smilies/icon_wink.gif)
http://www.xdata.cl - Desarrollo Inteligente
----------
Asus TUF F15, 32GB Ram, 2 * 1 TB NVME M.2, GTX 1650
-
- Posts: 56
- Joined: Mon Jul 03, 2006 2:34 am
Antonio,
Thanks for responding so promptly.
Unfortunately your suggestion doesn't apply to my app. My app never knows what text is in the doc. It could be anything and the doc could could contain just a few words or over 1 million words.
The purpose of my app is to take all of the raw text (i.e. stripped of any graphics, word-processing formatting etc) and then analyse what words are in the file and how they are used. It is somewhat of an "English literacy test" analyser.
Thanks for responding so promptly.
Unfortunately your suggestion doesn't apply to my app. My app never knows what text is in the doc. It could be anything and the doc could could contain just a few words or over 1 million words.
The purpose of my app is to take all of the raw text (i.e. stripped of any graphics, word-processing formatting etc) and then analyse what words are in the file and how they are used. It is somewhat of an "English literacy test" analyser.
-
- Posts: 56
- Joined: Mon Jul 03, 2006 2:34 am
Adolfo,
That sounds like the sort of approach I guess I need to take.
I am using FW7.01 and xHb.com (but compiled with Borland).
The MS Word doc files could contain anything, including graphics. I need the ability to extract all the text and only the text.
Could you be so kind as to either post some code or email it to me at
cwisbey@optusnet.com.au
Many thanks for your offer of assistance.
That sounds like the sort of approach I guess I need to take.
I am using FW7.01 and xHb.com (but compiled with Borland).
The MS Word doc files could contain anything, including graphics. I need the ability to extract all the text and only the text.
Could you be so kind as to either post some code or email it to me at
cwisbey@optusnet.com.au
Many thanks for your offer of assistance.
- Enrico Maria Giordano
- Posts: 8753
- Joined: Thu Oct 06, 2005 8:17 pm
- Location: Roma - Italia
- Has thanked: 1 time
- Been thanked: 4 times
- Contact:
Re: MS Word to .TXT
Colin Wisbey wrote:How to convert a MS Word file to a .TXT (straight text) file, using my own code?
Code: Select all | Expand
#define wdFormatDOSText 4
FUNCTION MAIN()
LOCAL oWord := CREATEOBJECT( "Word.Application" )
LOCAL oDoc := oWord:Documents:Open( "e:\xharbour\test.doc" )
oDoc:SaveAs( "e:\xharbour\NewDocument.txt", wdFormatDOSText )
oWord:Quit()
RETURN NIL
EMG
-
- Posts: 56
- Joined: Mon Jul 03, 2006 2:34 am
- Antonio Linares
- Site Admin
- Posts: 42511
- Joined: Thu Oct 06, 2005 5:47 pm
- Location: Spain
- Has thanked: 31 times
- Been thanked: 73 times
- Contact:
Re: MS Word to .TXT
EnricoMaria wrote:Colin Wisbey wrote:How to convert a MS Word file to a .TXT (straight text) file, using my own code?Code: Select all | Expand
#define wdFormatDOSText 4
FUNCTION MAIN()
LOCAL oWord := CREATEOBJECT( "Word.Application" )
LOCAL oDoc := oWord:Documents:Open( "e:\xharbour\test.doc" )
oDoc:SaveAs( "e:\xharbour\NewDocument.txt", wdFormatDOSText )
oWord:Quit()
RETURN NIL
EMG
I think you should use oWord:Documents:Add( " ... instead of oWord:Documents:Open because this may cause "read only problems" .The Add property opens it like a template so you will never have any problems
regards,
A.S.K
-
- Posts: 1163
- Joined: Mon Oct 17, 2005 5:41 am
- Location: Belgium
- Contact:
MS Word to .HTM filtered
Enrico,
Do you know the how to save the word-file as Web-page-filtered?
Thanks,
Marc
Do you know the how to save the word-file as Web-page-filtered?
Thanks,
Marc
-
- Posts: 1163
- Joined: Mon Oct 17, 2005 5:41 am
- Location: Belgium
- Contact:
- Antonio Linares
- Site Admin
- Posts: 42511
- Joined: Thu Oct 06, 2005 5:47 pm
- Location: Spain
- Has thanked: 31 times
- Been thanked: 73 times
- Contact:
- gkuhnert
- Posts: 274
- Joined: Fri Apr 04, 2008 1:25 pm
- Location: Aachen - Germany // Kerkrade - Netherlands
- Contact:
These Save-Options are available in Word:
and can be used in the same way Enrico wrote:
You can find the documentation here: http://msdn.microsoft.com/en-us/library/bb238158.aspx
Code: Select all | Expand
wdFormatDocument 0 Microsoft Office Word format.
wdFormatDOSText 4 Microsoft DOS text format.
wdFormatDOSTextLineBreaks 5 Microsoft DOS text with line breaks preserved.
wdFormatEncodedText 7 Encoded text format.
wdFormatFilteredHTML 10 Filtered HTML format.
wdFormatHTML 8 Standard HTML format.
wdFormatRTF 6 Rich text format (RTF).
wdFormatTemplate 1 Word template format.
wdFormatText 2 Microsoft Windows text format.
wdFormatTextLineBreaks 3 Windows text format with line breaks preserved.
wdFormatUnicodeText 7 Unicode text format.
wdFormatWebArchive 9 Web archive format.
wdFormatXML 11 Extensible Markup Language (XML) format.
wdFormatDocument97 0 Microsoft Word 97 document format.
wdFormatDocumentDefault 16 Word default document file format. For Microsoft Office Word 2007, this is the DOCX format.
wdFormatPDF 17 PDF format.
wdFormatTemplate97 1 Word 97 template format.
wdFormatXMLDocument 12 XML document format.
wdFormatXMLDocumentMacroEnabled 13 XML document format with macros enabled.
wdFormatXMLTemplate 14 XML template format.
wdFormatXMLTemplateMacroEnabled 15 XML template format with macros enabled.
wdFormatXPS 18 XPS format.
and can be used in the same way Enrico wrote:
Code: Select all | Expand
oDoc:SaveAs( "e:\xharbour\NewDocument.txt", wdFormatDOSText )
You can find the documentation here: http://msdn.microsoft.com/en-us/library/bb238158.aspx
- Antonio Linares
- Site Admin
- Posts: 42511
- Joined: Thu Oct 06, 2005 5:47 pm
- Location: Spain
- Has thanked: 31 times
- Been thanked: 73 times
- Contact: