Wednesday, August 1, 2007

The Word file format as a career

As I mentioned in an earlier post, I used to work on a technical document management system for NASA. We did a simple analysis and 95% of the hundreds of thousands of documents in our system were binary Word documents (.doc). So I started writing a java library for reading and writing binary Word documents in my spare time to eventually sell as a commercial product. I worked on it on and off and eventually I moved onto other things. Around this time the POI project was formed under Apache Jakarta by Andrew C. Oliver and Marc Johnson. I contacted Andy and donated what I had done so far to POI. This became the codebase for the Word piece of POI (HWPF).

Over the course of 2 years, the project never really took off at POI. I never had the time to really make it into what it could be and I was pretty much the only contributor. It barely reached a semi-working state. I launched a spin-off project so Lucene users would have a simple way to extract text from Word documents here (http://www.textmining.org/)

In 2004, the project I was working on at NASA was waning. I was getting bored so I contacted SoftArtisans and applied for a job. The main selling point was my expertise with the Word file format and SoftArtisans' loved that because one of their products was an API for creating reports in the Word file format. I joined SoftArtisans and became the lead developer for their WordWriter product. This was part of the OfficeWriter suite of products. We released a complete API for reading and writing the Microsoft Word binary file format in late spring 2005. In November of that same year, the OfficeWriter intellectual property was purchased by Microsoft. I moved to Redmond to work on SQL Server Reporting Services as a Microsoft employee.

I left Microsoft in April of this year. What has surprised me over the last few years is that most interest in the Word file format has been for reporting applications. I originally learned the format because I saw the need to extract information from a proprietary format for collaboration. Not the other way around.

A side note about Microsoft. You may be asking yourself why Microsoft would buy an API that reads and writes their own binary format. We were bought by the SQL Server group and my take on it was that the Microsoft Office team is very much against doing anything with the binary formats outside of their respective applications. I'm guessing that from an engineering point of view, the formats (especially Word) have become unmaintainable behomeths. I was told by every person that I came in contact with from the Office team without exception: "Do not, I repeat, do not under any circumstances attempt to write the binary Word file format outside of Microsoft Word, use the new XML format". I chuckle when I see the stories on Slashdot about the file formats (and other MS topics) because it just isn't the way people think it is. They are true believers in the new XML formats ;-)

Ironically, for my own product, I wrote absolutely zero code to work with the Word file format. I'm using a third party java component called Aspose.Words. When I worked at SoftArtisans, they were our main competitor. It has an easy to understand API and most importantly they have excellent support. Most questions I had were already answered in their support forums but when I did post a question, I always received a response from an employee. With all of the hubbub about Office 2.0 and Web 2.0, I'm surprised I haven't seen them mentioned more. They have a full line of Java products that work with Office files on the server. Go check them out.


2 comments:

Finnovator said...

Thanks for the great post Ryan. It is enlighting to learn that people inside Microsoft are leaning towards using XML. Despite evident shortcomings of OOXML, it is a great step forward.

During the span of 10 years or so I've been involved in quite a few companies and projects where ability to read in MSOffice files have been key functionality. This need has come from both text/entity extraction for content search, as well as massive server-side publishing tasks using MSOffice templates.

It really has been a pain to notice that this key functionality, ability to process proprietary files on a server-side in multithreaded environments, have been overlooked by providers making most of their money from selling desktop authoring products.

To make situation even worse, most of the projects have been built with Java technologies, seeking to leverage power delivered by both open source and commercial application servers. There has not been too many good alternatives for those seeking to get inside these proprietary files with Java solution; without having to hack together some sort of hybrid solution with OpenOffice and Microsoft products. As a consequence, scalability and stability of these hybrid solutions have been somewhat questionable.

So far we have found just a handful of good alternatives to consider. If there are other good ones, we'd love to know about them.

* Antennahouse (www.antennahouse.com); very neat XSL-FO processor, can be used thru Java API.

* Davisor (www.davisor.com); pure Java components for MSOffice content extraction and conversion. Developer-friendly packaging for OEM/embedding purposes.

* Snowbound (www.snowbound.com); RasterMaster SDK and other products offering Java support for visual content conversions

* CambridgeDocs (www.cambridgedocs.com); XML-based dynamic publishing solutions, allows using MSOffice files as templates.

In some projects we've started with open source solutions like Apache POI and few smaller ones, but eventually switched into some commercial offering when dealing with MSWord content. POI, on the otherhand, has been a good fit when dealing with MSExcel content. How unfair.

Lack of support, slow development speed and at best alpha quality solutions just have not convinced teams to use open source alternatives for something they, and their customers have to live with for years to come.

For an open source enthusiastic like me this has been painful to notice; while in some areas open source is paving the way for the future of computing, in this key area it really is lacking behind.

Despite the fact that people may consider accessing closed binary formats as something non-sexy, it really is key requirement for building server-side content extraction, search and processing solutions.

Alex said...

I often use ms word files but yesterday I stumbled across with a complicated issue. My doc files were deleted. Luckily for me I accidentally found out - fix micrsoft word docx files. The application resolved my trouble within a minute and ultimately free of cost.