Introduction
Originally, I wrote a C++ parser
which was used to parse given MS Word documents and put them into some
form of a structure that was more useful for data processing. After I
wrote the parser, I started working with .NET and C# to re-create the
parser. In the process, I also wrote my first article for Code Project,
Automating MS Word Using Visual Studio .NET. Several people have
requested to see the C++ version of the application, hence, I finally
got some time to put something together. I have written this article
with the intention of making it easier for someone who is looking for
quick answers. I hope that people can benefit from the information
provided and help them get started faster.
Background
No special background is necessary. Just have some hands on experience with C++.
Using the code
I
think the best way to present the code would be to first give you the
critical sections which you need to get an instance of MS Word, and
then give you snapshots of code that perform specific functions. I
believe this way will help you get started faster in developing your
own programs.
The following block is the header portion of the CPP file.
Note: The most important include files are and . These are used for COM and OLE.
// Vahe Karamian - 04-20-2004 - For Code Project
//---------------------------------------------------------------------------
#include
#pragma hdrstop
// We need this for the OLE object
#include
#include
#include "Unit1.h"
#include
//---------------------------------------------------------------------------
#pragma
package(smart_init) #pragma resource "*.dfm" TForm1 *Form1; The
following block creates MS Word COM Object. This is the object which
will be used to access MS Word application functions. To see what
functions are available, you can do within MS Word. Refer to the first
article, Automating MS Word Using Visual Studio .NET.
As
before, you can either make a Windows Forms Application or a Command
Line application, the process is the same. The code below is based on a
Windows Forms application, that has a button to start the process. When
the user clicks the button, the Button1Click(TObject *Sender) event
will be called and the code executed.
Note: To better understand the code, ignore everything in the code except the portions that are in bold.
TForm1 *Form1;
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner)
: TForm(Owner)
{
}
//---------------------------------------------------------------------------
void __fastcall TForm1::Button1Click(TObject *Sender)
{
.
.
.
// used for the file name
OleVariant fileName;
fileName = openDialog->FileName;
Variant my_word;
Variant my_docs;
// create word object
my_word = Variant::CreateObject( "word.application" );
// make word visible, to make invisible put false
my_word.OlePropertySet( "Visible", (Variant) true );
// get document object
my_docs = my_word.OlePropertyGet( "documents" );
Variant wordActiveDocument = my_docs.OleFunction( "open", fileName );
.
.
.
So a brief explanation, we define a OleVariant data type called
fileName, we assign a file path to our fileName variable. In the code
above, this is done using a OpenDialog object. Of course, you can just
assign a whole path for testing if you like, i.e., c:\test.doc.
Next,
we define two Variant data types called my_word, and my_docs. my_word
will be used to create a word.application object and my_docs will be
used to create a documents object.
Next, we define another
Variant data type called myActiveDocument. Using this referenced
object, we can now do what we want! In this case, we are going to open
the given MS Word document.
Notice that most of the variables are of type Variant.
At
this point, we have a Word document that we can start performing
functions on. At first, it might take a while for you to see how it
works, but once you get a hang of it, anything in MS Word domain is
possible.
Let's take a look at the following code, it is going to be dealing with tables within a MS Word document.
.
.
Variant wordTables = wordActiveDocument.OlePropertyGet( "Tables" );
long table_count = wordTables.OlePropertyGet( "count" );
.
.
As I mentioned before, all your data types are going to be of Variant.
So we declare a Variant data type called wordTables to represent Tables
object in our Document object.
Variant
wordTables = wordActiveDocument.OlePropertyGet( "Tables" ); The line
above will return all Table objects that are within our active Document
object. Since Tables is a property of a Document object, we have to use
the OlePropertyGet( "Tables" ); to get the value.
long
table_count = wordTables.OlePropertyGet( "count" ); The line above will
return the number of tables in out Tables object. This is done by
calling the OlePropertyGet( "count" ); to return us the value.
You
might be wondering where do I get this information from? The answer to
that question is in the first article: Automating MS Word Using Visual
Studio .NET.
The next block of code will demonstrate how to extract content from the Tables object.
.
.
.
int t, r, c;
try
{
for( t=1; t<=table_count; t++ )
{
Variant wordTable1 = wordTables.OleFunction( "Item", (Variant) t );
Variant tableRows = wordTable1.OlePropertyGet( "Rows" );
Variant tableCols = wordTable1.OlePropertyGet( "Columns" );
long row_count, col_count;
row_count = tableRows.OlePropertyGet( "count" );
col_count = tableCols.OlePropertyGet( "count" );
// LET'S GET THE CONTENT FROM THE TABLES
// THIS IS GOING TO BE FUN!!!
for( r=1; r<=row_count; r++ )
{
Variant tableRow = tableRows.OleFunction( "Item", (Variant) r );
tableRow.OleProcedure( "Select" );
Variant rowSelection = my_word.OlePropertyGet( "Selection" );
Variant rowColumns = rowSelection.OlePropertyGet( "Columns" );
Variant selectionRows = rowSelection.OlePropertyGet( "Rows" );
long rowColumn = rowColumns.OlePropertyGet( "count" );
for( c=1; c<=rowColumn; c++ ) //col_count; c++ )
{
Variant rowCells = tableRow.OlePropertyGet( "cells" );
Variant wordCell = wordTable1.OleFunction( "Cell",
(Variant) r, (Variant) c );
Variant cellRange = wordCell.OlePropertyGet( "Range" );
Variant rangeWords = cellRange.OlePropertyGet( "Words" );
long words_count = rangeWords.OlePropertyGet( "count" );
AnsiString test = '"';
for( int v=1; v<=words_count; v++ )
{
test = test + rangeWords.OleFunction( "Item",
(Variant) v ) + " ";
}
test = test + '"';
}
}
}
my_word.OleFunction( "Quit" );
}
catch( Exception &e )
{
ShowMessage( e.Message + "nType: " + __ThrowExceptionName() +
"nFile: "+ __ThrowFileName() +
"nLine:
" + AnsiString(__ThrowLineNumber()) ); } . . . Okay, so above we have
the code that actually will go through all of the tables in the
Document object and extract the content from them. So we have tables,
and tables have rows and columns. To go through all of the Tables
object in a document, we do a count and get the number of tables within
a document.
So
we have three nested for loops. The first one is used for the actual
Table object, and the 2nd and 3rd are used for the rows and columns of
the current Table object. We create three new Variant data types called
wordTable1, tableRows, and tableCols.
Note: Notice that
wordTable1 comes from the wordTables object. We get out table by
calling wordTables.OleFunction( "Item", (Variant) t );. This returns us
a unique Table object from the Tables object.
Next, we get the
Rows and Columns object of the given Table object. And this is done by
calling OlePropertyGet( "Rows" ); and OlePropertyGet( "Columns" ); of
the wordTable1 object!
Next, we get a count of rows and columns
in the given Rows and Columns objects which belong to the wordTable1
object. We are ready to step through them and get the content.
Now,
we will have to define four new Variant data types called tableRow,
rowSelection, rowColumsn, and selectionRows. Now, we can start going
from column to column in the selected row to get the content.
In
the most inner for loop, the final one, we again define four new
Variant data types called rowCells, wordCell, cellRange, and
rangeWords. Yes, it is tedious, but we have to do it.
Let's sum what we did so far:
We
got a collection of Tables object within the current Document object.
We got a collection of Rows and Columns in the current Table object. We
went through each row and got the number of columns it has. We get the
column and the cells, and step through the cells to get to the content
of the table. Note: Yes, some steps are repeated, but the reason behind
it is because not all tables in a given document are uniform! I.e., it
does not necessarily mean that if row 1 has 3 columns, then row 2 must
have 3 columns as well. More than likely, it will have different number
of columns. You can thank the document authors/owners.
So then the final step will just step through the cells and get the content and concatenate it for a single string output.
And finally, we want to quit Word and close all documents.
...
my_word.OleFunction( "Quit" );
...
That is pretty much it. The code does sometimes get pretty tedious and
messy. The best way to approach automating/using Word is by first
knowing what it is that you exactly want to do. Once you know what you
want to achieve, then you will need to find out what objects or
properties you need to use to perform what you want. That's the tricky
part, you will have to read the documentation: Automating MS Word Using
Visual Studio .NET.
In
the next code block, I will show you how to open an existing document,
create a new document, select content from the existing document and
paste it in the new document using Paste Special function, then do
clean up, i.e., Find and Replace function.
Before you look at the
block of code, the following list will identify which variable is used
to identify what object and the function that can be applied to them.
Variables and representations:
vk_filename: existing document name
vk_converted_filename: new document name
vk_this_doc: existing document object
vk_converted_document: new document object
vk_this_doc_select: existing document selected object
vk_this_doc_selection: existing document selection
vk_converted_document_select: new document selected object
vk_converted_document_selection: new document selection
wordSelectionFind: Find and Replace object
// Get the filename from the list of files in the OpenDialog
vk_filename = openDialog->Files->Strings[i];
vk_converted_filename = openDialog->Files->Strings[i] + "_c.doc";
// Open the given Word file
vk_this_doc = vk_word_doc.OleFunction( "Open", vk_filename );
statusBar->Panels->Items[2]->Text = "READING";
// -------------------------------------------------------------------
// Vahe Karamian - 10-10-2003
// This portion of the code will convert the word document into
// unformatted text, and do extensive clean up
statusBar->Panels->Items[0]->Text = "Converting to text...";
vk_timerTimer( Sender );
// Create a new document
Variant vk_converted_document = vk_word_doc.OleFunction( "Add" );
// Select text from the original document
Variant vk_this_doc_select = vk_this_doc.OleFunction( "Select" );
Variant vk_this_doc_selection = vk_word_app.OlePropertyGet( "Selection" );
// Copy the selected text
vk_this_doc_selection.OleFunction( "Copy" );
// Paste selected text into the new document
Variant vk_converted_document_select =
vk_converted_document.OleFunction( "Select" );
Variant vk_converted_document_selection =
vk_word_app.OlePropertyGet( "Selection" );
vk_converted_document_selection.OleFunction( "PasteSpecial",
0, false, 0, false, 2 );
// Re-Select the text in the new document
vk_converted_document_select =
vk_converted_document.OleFunction( "Select" );
vk_converted_document_selection =
vk_word_app.OlePropertyGet( "Selection" );
// Close the original document
vk_this_doc.OleProcedure( "Close" );
// Let's do out clean-up here ...
Variant wordSelectionFind =
vk_converted_document_selection.OlePropertyGet( "Find" );
statusBar->Panels->Items[0]->Text = "Find & Replace...";
vk_timerTimer( Sender );
wordSelectionFind.OleFunction( "Execute", "^l",
false, false, false, false, false, true, 1, false,
" ", 2, false, false, false, false );
wordSelectionFind.OleFunction( "Execute", "^p", false,
false, false, false, false, true, 1, false,
" ", 2, false, false, false, false );
// Save the new document
vk_converted_document.OleFunction( "SaveAs", vk_converted_filename );
//
Close the new document vk_converted_document.OleProcedure( "Close" );
// -------------------------------------------------------------------
So what we are doing in the code above, we are opening an existing
document with vk_this_doc = vk_word_doc.OleFunction( "Open",
vk_filename );. Next we add a new document with Variant
vk_converted_document = vk_word_doc.OleFunction( "Add" );. Then we want
to select the content from the existing document and paste them in our
new document. This portion is done by Variant vk_this_doc_select =
vk_this_doc.OleFunction( "Select" ); to get a select object and Variant
vk_this_doc_selection = vk_word_app.OlePropertyGet( "Selection" ); to
get a reference to the actual selection. Then we have to copy the
selection using vk_this_doc_selection.OleFunction( "Copy" );. Next, we
perform the same task for the new document with Variant
vk_converted_document_select = vk_converted_document.OleFunction(
"Select" ); and Variant vk_converted_document_selection =
vk_word_app.OlePropertyGet( "Selection" );. At this time, we have a
selection object for the existing document and the new document. Now,
we are going to be using them both to do our special paste using
vk_converted_document_selection.OleFunction( "PasteSpecial", 0, false,
0, false, 2 );. Now, we have our original content pasted in a special
format in the newly created document. We have to do a new select call
in the new document before we do our find and replace. To do so, we
simply use the same calls vk_converted_document_select =
vk_converted_document.OleFunction( "Select" ); and
vk_converted_document_selection = vk_word_app.OlePropertyGet(
"Selection" );. Next, we create a Find object with Variant
wordSelectionFind = vk_converted_document_selection.OlePropertyGet(
"Find" ); and finally, we can use our find object to perform our find
and replace with wordSelectionFind.OleFunction( "Execute", "^l", false,
false, false, false, false, true, 1, false, " ", 2, false, false,
false, false );.
That's all there is to it!
Points of Interest
Putting
structure to a Word document is a challenging task, given that many
people have different ways of authoring documents. Nevertheless, it
would help for organizations to start modeling their documents. This
will allow them to apply XML schema to their documents and make
extracting content from them much easier. This is a challenging task
for most companies; usually, either they are lacking the expertise or
the resources. And such projects are huge in scale due to the fact that
they will affect more than one functional business area. But on the
long run, it will be beneficial to the organization as a whole. The
fact that your documents are driven by structured data and not by
formatting and lose documents has a lot of value added to your business.