Generating Documents with the Open XML SDK - Part 1
Recently I was asked to create a document generation engine for a loan application and quote system at work. Our customers needed to enter some basic information about a loan applicant, and at a later time receive a URL to the bundle of PDF's which represented the data rich documents they needed to send to their client.
The problem outlined:
Generally speaking, the templates were already in .rtf, .doc or .docx form. Many included old mail merge and form fields, as well as embedded formulas and the like. Additionally, the following cases needed to be catered for:
- Repeating sections
- Data-bound tables
- Template composition
- Turning sections on and off based on data
- Image injection
- Data-bound fields
Several documents need to be created at once, then zipped up to form a downloadable bundle for the end user to consume. The killer was that the end users wanted to make templates using familiar tools.
With a variety of options at our disposal; XSL:FO , HTML via a view engine like Spark, InfoPath, or even Adobe Forms the determining factor for us was that when end users want to design documents, 90% of the time, they want to do it in MS Word. Generally speaking, our users were familiar with defining 'what they want' in Word. As a design tool, Word's typographic capabilities lie somewhere between XSL:FO and Adobe InDesign or QuarkXPress.
Unfortunately, using Word has its drawbacks. Given source documents that contain various conflicting methods of data binding, domain logic on the form of formulas and form fields, as well as having file creation dates in the early 90's; the potential for mandelbugs is extremely high.
In the past, automating word server-side with .NET meant hacking about with COM Interop or VBA or both. While it was a practical approach, it often meant a number of things:
- Server side installation and licensing of Office products
- Difficulties in managing instances of winword.exe, excel.exe etc.
- COM Interop libraries were designed with Visual Basic making development in C# difficult.
- Generally high levels of excruciating, eye-popping, pain.
A new solution
The Open XML SDK, now in its second version, is a suite of tools including a flexible API to generate documents, and a reflectoresque tool that shows how an office document is constructed. This suite is designed to cater for document creation; it will not automate user interactions to Powerpoint, but it will make awesome documents from scratch, and it will do it faster than you can say "No more PIA's!".
This comprehensive API gives you the flexibility to inject content as XML directly, or to create content using typed classes. Finally, LINQ to XML works brilliantly, and VB developers could even take full advantage of XML literals, and intellisense for XML Schemas if desired.
Developer prerequisites:
- Open XML SDK v2.0 ( get the productivity tool here too)
- Content Control Toolkit
- Visual Studio 2008 or above
Benefits
- No need to have licenced products installed on servers to generate templates.
- Templating in a familiar editor
- Can convert various formats of documents into templates
- Super awesome fast
- Verifiable output - output can be schema verified
- Extensible
- Testable, maintainable code
Drawbacks
- No ability to automate Word itself, or inspecting paginated output.
- Requires a basic understanding of XML & XPath queries
- Code required.
A design emerges
User interaction
The process begins when a customer interacts with the user interface to enter relevant information about the documents to be created. Additional information about the request is sourced from any existing data available. The interface stores these requests for documents as jobs in a queue.
[caption id="attachment_827161" align="alignnone" width="500" caption="Our customers interact with our software to request a bundle of generated, data-rich documents. These are stored as jobs in a queue."][/caption]
Job processing
A separate job service polls this collection of jobs for new work to perform, fetches any required data, flattens the data into presentation models, and delegates to relevant 'DocumentBuilders' to create the documents themselves.
The last stage involves converting the documents to PDF, moving the resultant documents into a folder structure which is then zipped, moved and linked to.
[caption id="attachment_827163" align="alignnone" width="500" caption="The job processor polls the job queue for docgen jobs, and chooses the required document builder(s) to execute the job."][/caption]
Document Building
The document builders create a word document based on a template document and an XML representation of the data to be injected. They do this following an MVC pattern of sorts; the template is just a view, it has knowledge of data-bindings and that's about it. The document builder is the controller, it initializes the process and passes data to the template, as well as orchestrating post-data-binding manipulations of the template. The model, comes in the form of a POCO which is ultimately serialized to XML and injected into the view by the controller.
To clarify, each document builder is responsible for generating one type of document. They may have intimate knowledge of the view and the model; they are by no means generic. However, there are generic patterns we can apply to common design issues and I will get to those in a later post.
Content Controls and CustomXmlParts
At the heart of the design is the concept of content controls: these are a feature of MS Word that allow us to use place holders in a document and bind data to them. I also use them to allow manipulations to the document beyond simple data-binding.
CustomXmlParts are equally integral; these are the buckets in which we pour our view models into. Once hydrated, the content controls in a word document can data bind to nodes in the CustomXmlPart via XPath queries.
Where to go from here
In my next few posts, I'll dive deeper into the preparation of templates, data binding them to XML, various tools I use, and the Document Builders themselves. Along the way I'll be solving some common issues like tables and composing templates. Finally, I'll broach the topic of automated testing and potential for a TDD like approach.
UPDATE: Part two - Databinding with ContentControls is now published.
In the meantime, I'd like to direct you to the sources of information I used to become familiar with Open XML:
While I'm here, I'll just make a quick shout out to my new colleagues on this project George and Paul, whose hard work underpins a lot of the ideas you see here. A special thanks to Darren for encouraging me to get this stuff out in the form of a blog, and challenging my thinking every step of the way. Thanks guys, this is a direct result of your hard work and advice.