Doctools > Docbook, Namespaces & Mortality > Previous implementations

Previous implementations

Earlier I gave a brief tour of my historical attempts from a functional perspective. Here I'll highlight selected implementation details of interest, paying specific attention to design flaws in their respective architectures.

Non-XML things

Of the prehistoric non-XML implementations with which I have experimented, the only feature worth noting is the ability to index documents into a bibliography. This is more easily provided by XML.

Report

Report was an XML schema transformed by a variety of needlessly-complex XSLT files, used for writing up reports on coursework assignments. The intention was to provide an archive of assignments accessible online, in addition to the printouts submitted for examination. It offered several noteworthy features:

  • Typographically dexterous output. For example, newly-introduced terms were linked on their first occurrence only. A lot of care was taken to automate decisions to produce output as the author would if setting by hand.

    It seems that most software outputs items in one manner only, regardless of their contexts. It would be preferable to have this logic shared between multiple output formats.

  • The output to HTML and PDF appeared virtually identical, visually. Plain-text output was also provided, via roff.

  • Written at a time when TeX-generated PDFs were usually constructed of bitmapped fonts, Report made use of the highest-quality intermediate formats available for each output format. This included outlined fonts for PDF.

  • High-quality figures were a significant motivation. The document-building systems was able to transform a variety of input data, including rules for generating plots using various programs.

    Each format was rendered from source (where the source data was understood) in order to produce the most appropriate output for that format: namely vector graphics for PDF, ASCII-art plots for illustrating data in plain-text, and so on..

    Importantly, given several routes to render an image, the author investigated the various ways available, and determined which transformations produced unacceptable output (usually because of poor implementations of intermediate languages). These were then avoided.

  • Output intended to be printed is physically different from output for the screen. Since most reports were printed, this was especially relevant for illustrating hyperlinks (for example as footnotes or bibliographic references) as opposed to output for the screen, where hyperlinks may be simply inlined.

  • Good use of the mechanisms provided by the underlying tools. For example, there was a convenient set of predefined entities for acronyms a other common terms. Those used would be appended as a glossary for larger documents, using the mechanism provided by ConTeXt.

  • Contextually-aware alternatives for terms. For example, shorter versions of titles were used in places of their longer counterparts where space was at a premium. This helped bring details into resolution only when appropiate.

and had the following design flaws:

  • A poor understanding of XML. Modularisation of features was provided by including together various DTDs, as opposed to making use of namespaces. While not a problem for Report itself, this prevented subsets of the work from being used in other projects. Most notably, a schema extending Report was created for authoring websites (does this sound familiar?)

  • Whilst extensibility was straightforward, the project was unmaintainable. The stylesheets were complex and difficult to understand. There was no way to customise behaviour without modifying the code base itself (a symptom of a single-user project), and the set of supported formats was hardcoded.

  • Although the XML schema was well-designed, minimal, and expressive, it was “yet another language”.

Doctools

Doctools started as a set of makefiles to aide production of documents from various input formats to XHTML. It has grown to include a system for user-defined themes, multiple output formats, sanitising of Docbook's idiosyncrasies, and extensions of its own motivation (such as a convenient interface for inter-document links). It is used successfully in several quite different projects, and expresses each of their requirements well. It is the current focus of my efforts in document processing.

The main body of code has four portions of interest: XSLT is used for almost all of the translations between formats. Makefiles provide a convenient (and customisable) build framework, portable to many systems. Definitions of themes give visual customisations. Lastly, a small body of TeX code acts as support routines to assist in generating sensible output.

Whilst it does provide the more difficult features in an acceptable way (such as high-quality TeX generation), it has grown into disarray. The current implementation is sound, though not easily extensible. In short, it needs refactoring, which is the focus of this document.

Generating TeX

Assuming XML input, generating sensible TeX output is non-trivial. Using XSLT to output TeX directly hits against the flaws inherent from mapping such different syntaxes using only the processing model of the first (for example, should whitespace be suppressed after a command? How about if it is nested?).

Post-processing the output from XSLT to “clean up” these issues does not address these issues. Rather, a redesign is required in order to avoid them in the first place.

The only satisfying solution so far has been to generate a direct mapping of TeX commands in XML, and to use a small non-XSLT program to render this XML out to TeX syntax. This allows XSLT to be used effectively, as well as producing sensible-looking TeX. One such implementation is the relatively obscure TeXML.