Charset considerations

Next: FAQ, Previous: The XSLT stylesheets, Up: Top

5 Charset considerations

XML uses Unicode as its character set, and so most XML tools use the UTF-8 encoding to cover all the possible characters. On the other hand, the non-XML world makes use of some other charsets¹, and in fact neither man nor Texinfo support UTF-8 very well. So db2x_manxml and db2x_texixml have to transcode their output.

`Transcoding' can be separated into three components:

UTF-8 is converted to another `native' charset, such as ISO-8879-1. Many tools exist for this purpose.
docbook2X uses iconv(1) here.
Certain Unicode characters, such as dashes and directional quotes, are escaped with special Texinfo- or roff- specific markup.
This part can be problematic, because there is no official mapping from Unicode to these markup-level escapes. Even if a certain character has a markup-level escape, that does not necessarily mean it should be escaped! Texinfo and roff implementations often do not have much native charset support and would use ASCII approximations for the escaped character even if that character exists in the native charset. And if the document is primarily in a non-English language, it becomes cumbersome to escape all the non-ASCII characters. (For example: é in French texts)
utf8trans, a program included in docbook2X, converts some of these characters to markup-level escapes. `Character maps' for both roff and Texinfo are included in docbook2X under charmaps/. db2x_manxml and db2x_texixml will apply these character mappings automatically.
Other Unicode characters are approximated using character sequences in the native charset. This part is clearly domain-specific: it depends on how the characters to be approximated are used in the document, the language, user preference, etc.
You can make custom character maps for utf8trans to do this, if your approximations are on a character-by-character basis and not context-dependent.

Footnotes

[1] `charset' is used very loosely here to mean any set of byte sequences used to represent characters. Other specifications typically do not make such fine distinctions between encoding and character set as the Unicode and XML standards do. Non-Unicode charsets are specifically referred to here.