But if you have to do stream-based processing, make sure to use robust,
fairly scaleable tools like
sgmlspl. Of course it cannot
be as pleasant as tree-based XML processing, but examine
XML::DOMdirectly for stylesheets. Your “stylesheet” would become seriously unmanageable. At least take a look at some of the XPath modules out there. Ideally, use a real stylesheet language like XSLT. A C-based implementation of XSLT is faster than any Perl hack you can come up with.
You might think that we could, instead, make a separate class (in the Java sense) that hides all this complexity from the rest of the conversion program. Theoretically you would get the same result, but it would be harder. Firstly, it is far easier to write plain text manipulation code in Perl than in Java or C or XSLT, which is what you would be restricted otherwise. Secondly, if the intermediate format is hidden in a Java class or C API, it is harder to debug errors. Whereas with the approach we have taken, we can visually examine the textual output of the XSLT processor and fix the Perl script as we go along.
Finally, another advantage of using intermediate XML formats processed by a Perl script is that we can often eliminate the use of XSLT extensions. In particular, all the way back when XSLT stylesheets first went into docbook2X, the extensions related to Texinfo node handling could have been easily moved to the Perl script, but I didn't realize it! I feel stupid now.
Design the XML intermediate format to be easy to use from the standpoint of the conversion tool, and similar to how XML document types work in general. e.g. abstract the paragraphs of a document, rather than their paragraph breaks (the latter is typical of traditional markup languages, but not of XML).
If I had known this in the very beginning, it would have saved a lot of development time, and docbook2X would be much more advanced by now.
db2x_texixmlfall in the category of things that can be done in XSLT 1.0 but inelegantly.)
Same advice for build system.
 This number is probably inflated because of the so many design mistakes in the process.