wvWare 2 Design Document

Design Goals

Directory Structure

Not a huge package, but in case you get lost:

Path Contents
wv2 Holds some build system stuff and general build information.
wv2/doc Here we keep some information for developers and a Doxygen file to generate the API documentation.
wv2/src Contains 99% of the sources. As we don't want to have a build-time dependency on Perl we also added the generated Code to the CVS tree.
wv2/src/generator Two Perl scripts, some template files, and the available file format specification for Word 8 and Word 6. This stuff generates the scanner code. If you finished reading this document you might want to check out the file format spec in this directory.
wv2/src/tests Mainly self checking unit tests and function tests for the library. Use "make check" to build them.

Design Overview

Viewed from far, far away the filter structure somehow looks like that:

Architecture

A Word document consists of a number of streams, embedded in one file. This file-system-in-a-file is called OLE structured storage. We're using libgsf to get hold of the real data. The filter itself consists of some central "intelligence" to perform the needed steps to parse the document and some utility classes to support that task. During the parsing process we send the gathered information to the consumer, the program loading the Word file (on the right). This program has to process the delivered information and either assemble a native file or stream the contents directly into the application.

OLE Handler

The interface to the documents is a C++ wrapper around the libgsf library. libgsf allows -- among many other things -- to read and write OLE streams from and to the document file. It would be rather inconvenient to use it directly, so we created a class representing the whole document (OLEStorage), and two classes for reading and writing a single stream (OLEStreamReader and OLEStreamWriter).

API

The external API for the users of the library should consist of at least two, but maybe more, layers. Ranging from a low level and fine grained API where lots of work is needed on the consumer side to a very high level API basicly returning enriched text, at the cost of flexibility.

Another main task of that API is to hide differences between Word versions if that's feasible. In any case even the low level layer of the API shouldn't expose too much of the uglyness of Word documents. The parser logic will differ between format versions and this has to be considered for all further design issues. Most likely we will choose some strategy pattern approach within the parsing section of the code to replace the logic behind the scenes while keeping the same API.

Currently we have a Parser baseclass for all parsers (Is-A), and a Parser9x baseclass for the Word9x filters (Is-Implemented-In-Terms-Of).

This part of the code surely is most demanding from a design point of view. I'd be very pleased to hear some of your ideas :-)

Parser

The core part of the whole filter. This part of the code ensures that the utility classes are used in the correct order and manages the communication between various parts of the library. It's also quite challenging to design this part of the code. Various versions contain similar or even identical chunks, but other parts differ a lot. The aim is to find a design which allows to reuse much of the parser code for several versions.

Right now it seems that we found a nice mixture of plain interfaces with virtual methods and fancy functor-like objects for more complex structures like footnote information. The advantage of this mixture is, that common operations are reasonably fast (just a virtual method call) and yet we provide enough flexibility for the consumer to trigger the parsing of the more complex structures itself. This means that you can easily cope with different concepts in the file formats by delaying the parsing of, say, headers and footers till after you read all the main body text.

This flexibility of course isn't free of costs, but the functor concept is pretty lightweight, totally typesafe, and it allows to hide parts of the API. I'd like to hear your opinions on that topic.

String Classes

We agreed to use Harri Porten's UString class from kjs, a clean implementation of an implicitly shared UCS-2 string class (host order Unicode values). In the same file (ustring.h) there's also a CString class, but we'll use std::string for ASCII strings.

The iconv library is used to convert text stored as CP 1252 or similar to UCS-2. This is done by the Textconverter class, which wraps libiconv. Some systems ship a broken/incomplete version of libiconv (e.g. Darwin, older Solaris versions,...), so we have a configure option --with-iconv-dir to specify the path of alternative iconv installations.

Utility Classes

To reduce the complexity of the code we try to write small entities designed to do one task (e.g. all the code in styles.cpp is used to read (and later on probably write) the stylesheet information contained in every Word file, lists.cpp cares about lists,...). We use a certain naming scheme to distinguish code which works for all versions (at least Word 6 and newer) or just for one specific category. All the *97.(cpp|h) files are designed to work with Word 8 or newer, files without such a number should work with all versions.

This part of the code also consists of a number of templates to handle the different ways arrays and more complex structures are stored in a Word file (e.g. the meta structures PLF, PLCF, and FKP). If that sounds like Greek to you it's probably a good idea to read the Definitions section at the top of the file format specification in wv2/src/generator.

Generated Scanner Code

It's a tedious job to implement the most basic part of the filter -- reading and writing the structures in Word documents. It is boring, repititious, error prone, so we decided to generate this ultra-low level code. We're using two Perl scripts and the available HTML specification for Word 8 and Word 6. One script called generate.pl is used to scan the HTML file and output the reading/writing code and some test files. The other script, convert.pl generates code to convert Word 6 to Word 8 structures. We need to do this, because we want to present the files as Word 8 files to the outside world. The idea behind that is to hide all the subtle differences between the formats from the user of this library. For Word 6 this seems to be possible, no idea if that will work out for older formats.

Unit Tests

A vital part of the whole library are self-checking unit and function tests, to avoid introducing hard to find bugs while implementing new features. The goal is to test the major components, but it's close to impossible to test everything. Please run the unit tests before you commit bigger changes to see if something breaks. If you find out that some test is broken on your platform please send me the whole output, some platform information, and the document you used for testing.

It's a bit hard to test the proper parsing of a file, so there aren't many full-automatic tests for that part of the code yet, and I honestly don't see any easy way out. We probably have to think of some way to abuse the KWord wv2 consumer to perform the synthetic tests. Suggestions are highly welcome.

Existing Code and Pending Design Issues

At the moment most of the basic work for the filter is done or close to be done. We have a working build system, code to read and write OLE streams, code to scan the basic building blocks of Word documents, and some utility classes like the string class. The filter is able to read the text including properties, it handles fonts, lists, headers/footers, sections, fields (to some extent), and tables.

This section of the document lists all the existing items and the main idea behind them. We also briefly discuss the design and the reasons to choose exactly that design. This section also contains a discussion about code we don't have yet, and some ideas about a possible design.

OLE Code

The OLE chunk of the code is basicly done. It utilizes libgsf and provides a stream-based API to read and write from/to the file. OLEStorage is the class to handle the whole document and travel through the "directories." OLEStream provides the common interface for readers and writers, like seeking in the stream and pushing and popping the current cursor position. OLEStreamReader and OLEStreamWriter inherit OLEStream and provide the real reading/writing methods.

This part of the code contained in the ole* files is generally straightforward, but as libgsf is a lot stricter than libole2 some of the functionality is gone (e.g. you can't browse the contents of a directory in a file you write out, you can't open an OLE storage for reading and writing,...).

API

The API is a mixture of a good old "Hollywood Principle" API (Don't call us; we'll call you) and a fancy functor-based approach. The Hollywood part of the API can be found in the handler.h file, it's split across several smaller interfaces. We are incrementally adding/moving/removing functionality there, so please don't expect that API to be stable, yet.

The main reason to choose this approach is that the very common callbacks like TextHandler::runOfText are as lightweight as possible. More complex callbacks like TextHandler::headersFound allow a good deal of flexibility in parsing, as the consumer decides when to parse e.g. the header. This helps to avoid nasty hacks if the concepts of the destination file format differ from the MS Word ones.

Parser

The main task in the parser section is to find a design which allows to share the common code between different file format versions. Another important task is to keep the coupling of the code reasonably low. I see a lot of places in the specification where information from various blocks of our design is needed, and I really hate code where every object holds 5 pointers to other objects just because it needs to query some information from every of these objects once in its lifetime. Code like that is a pain to maintain.

For the code sharing topic my current idea is to have a small hierarchy of Parser* classes like this one:

Hierarchy

Parser is an abstract base class providing a few methods to start the parsing and so on. This is the interface the outside world sees and uses. Parser97 (and also Parser95, which would be at the same position in the hierarchy as Parser97) are the two real classes doing all the work. So far this is perfectly normal Is-A inheritance and nothing to talk about. The Parser9x class is a bit different from what you would expect, though. Parser97 inherits privately from Parser9x (Is-Implemented-In-Terms-Of inheritance). Parser9x will consist of a number of non-virtual helper methods which can be shared between Word 6 and Word 8, and additionally implement the template method pattern for more complex shared algorithms (due to the virtual template methods we need inheritance, plain delegation isn't sufficent).

The whole parsing process is divided into different stages and all this code is chopped into nice little pieces and put into various helper/template methods. We take care to separate methods in a way that as many of them as possible can be "bubbled up" the inheritence hierarchy right to Parser9x.

Right now Parser9x is empty, as the Word 6 parser isn't started yet. As soon as this gets done we can move around the code which lives in the parser97.* files for now.

To keep the coupling between the blocks of the design low the parser has to implement the Mediator pattern or something similar. It is the only block in our design containing "intelligence" in the sense that it's the only block knowing about the sequence of parsing and the interaction of the encapsulated components like the OLE subsystem and the stylesheet-handling utility classes.

String Classes

The main classes UString and std::string are well tested and known to work well. One advice when using UString::ascii is to take a lot of care. The buffer for the ASCII string is shared among all instances of UString (static buffer)! As we need that method for debugging only this is no problem. UString is implictily shared, so copying strings is rather cheap as long as you don't modify them (copy on write semantics).

Older Word versions don't store the text as Unicode strings but encoded using some codepage like CP 1252. libiconv helps us to convert all these encodings to UCS-2 (sloppy: 16bit Unicode). We don't use libiconv directly from within the library, but we use a small wrapper class (Textconverter) for convenience.

Utility Classes

Utility classes perform one specific, encapsulated task, like reading in the whole stylesheet information and provide convenient access to it. These classes are, IMHO, the key to clean code. Classes for the programming infrastructure like the SharedPtr class also belong to this category. If we manage to encapsulate many of the more complex structures in a Word doucment the code inside the parser will get a lot simpler.

Currently we have code to read stylesheets (styles.*) and some code which helps us to read the meta-structures in Word documents like the PLCF template in word97_helper.* . This code is quite simple and the only thing to watch out is that using the C++ sizeof() operator is dangerous. The reason for that is that the structures in the Word file are "packed", this means there are no padding bytes between variables. In our generated code we can't achieve that in a portable manner, so we decided not to use it at all. Due to that reading the whole structure in at once doesn't even work on little endian platforms, but we have the appropriate read() methods anyway. Another limitation is that we can't use sizeof() on the structures as it almost always will return too large values. For structures where you need that information we can add a sizeOf variable (please check the code generation script for more information).

Generated Scanner Code

As stated above various times we generate a few thousand lines of code from the HTML specification. The design of this code is non-existant, it's just a number of structures supporting reading, writing, copying, assignment, and so on. Some of the structures are partly generated only (like the apply() method of the main property structures like PAP, CHP, SEP, and others). Some structues are commented out, as it would be too hard to generate them. These few structues have to be written manually if they are needed.

Generally we just parse the specification to get the information out, but sometimes we need a few hints from the programmer to know what to do. These hints are mostly given by adding special comments to the HTML specification. For further information on these hints, and on the available tricks, please have a look at the top of the Perl scripts. The comments are quite detailed and it should be easy to figure out what I intend to do with the hints.

Another way to influence the generated code is to manipulate certain parts of the script itself. You need to do that to change the order of the structures in the file, disable a structure completely and so on. You can also select structures to derive from the Shared class to be able to use the structure with the SharedPtr class.

The whole file might need some minor tweaking, a license, #includes, and maybe even some declarations or code. This is what the template files in wv2/src/generator are for -- the code gets copied verbatim into the generated file. Never manipulate a generated file, all your changes will be lost when the code is regenerated!

If you think you found a bug in the specification you can try to correct the HTML file and regenerate the scanner code using the command make generated. In case you aren't satisfied with the resulting C++ code, or if you found a bug in the scripts please contact me. If you aren't scared by a bit of Perl code feel free to fix or extend the code yourself.

Unit Tests

There's not much to say about the unit tests. If you add new code please also add a test for it, or at least tell me to do so. The header test.h contains a trivial test method and a method to convert integers to strings (as std::string doesn't have such functionality).

If you decide to create a unit test please ensure that it's self checking. That means if it runs till the end everything is allright. If it stops somewhere in between something unexpected happened. Oh, and let me repeat the warning that UString::ascii() might produce unexpected results due to the static buffer.

Questions

Please send comments, corrections, condolences, patches, and suggestions to Werner Trobin. Thanks in advance. If you really read that document till here I owe you a beverage of your choice next time we meet :-)