The performance of docbook2X, and most other DocBook tools[2] can be summed up in a short phrase: they are slow.
On a modern computer producing only a few man pages at a time, with the right software — namely, libxslt as the XSLT processor — the DocBook tools are fast enough. But their slowness becomes a hindrance for generating hundreds or even thousands of man pages at a time.
The author of docbook2X encounters this problem whenever he tries to do automated tests of the docbook2X package. Presented below are some actual benchmarks, and possible approaches to efficient DocBook to man pages conversion.
Table 1. docbook2X running times on
2157 refentry
documents
Step | Time for all pages | Avg. time per page |
---|---|---|
DocBook to Man-XML | 519.61 s | 0.24 s |
Man-XML to man-pages | 383.04 s | 0.18 s |
roff character mapping | 6.72 s | 0.0031 s |
Total | 909.37 s | 0.42 s |
The above benchmark was run on 2157 documents coming from the doclifter man-page-to-DocBook conversion tool. The man pages come from the section 1 man pages installed in the author’s Linux system. The XML files total 44.484 MiB, and on average are 20.6KiB long.
The results were obtained using the test script in test/mass/test.pl
, using the default man-page
conversion options. The test script employs the obvious
optimizations, such as only loading once the XSLT processor, the
man-pages stylesheet, db2x_manxml and utf8trans.
Unfortunately, there does not seem to be obvious ways that the performance can be improved, short of re-implementing the tranformation program in a tight programming language such as C.
Some notes on possible bottlenecks:
Character mapping by utf8trans is very fast compared to the other stages of the transformation. Even loading utf8trans separately for each document only doubles the running time of the character mapping stage.
Even though the XSLT processor is written in C, XSLT processing
is still comparatively slow. It takes double the time of the Perl
script[3] db2x_manxml, even though the XSLT portion
and the Perl portion are processing documents of around the same
size[4] (DocBook refentry
documents and Man-XML
documents).
In fact, profiling the stylesheets shows that a significant amount of time is spent on the localization templates, in particular the complex XPath navigation used there. An obvious optimization is to use XSLT keys for the same functionality.
However, when that is implemented, the author found that the
time used for setting up
keys dwarfs the time savings from avoiding the complex
XPath navigation. It adds an extra 10s to the processing time for
the 2157 documents. Upon closer examination of the libxslt source
code, XSLT keys are seen to be implemented rather inefficiently:
each key pattern x
causes the entire input document
to be traversed once by evaluating the XPath //
!x
Perhaps a C-based XSLT processor written with the best performance in mind (libxslt is not particularly the most efficiently coded) may be able to achieve better conversion times, without losing all the nice advantages of XSLT-based tranformation. Or failing that, one can look into efficient, stream-based transformations (STX).
[2] with the notable exception of the docbook-to-man tool based on the instant stream processor (but this tool has many correctness problems)
[3] From preliminary estimates, the Pure-XSLT solution takes only slightly longer at this stage: .22 s per page
[4] Of course, conceptually, DocBook processing is more complicated. So these timings also give us an estimate of the cost of DocBook’s complexity: twice the cost over a simpler document type, which is actually not too bad.