XML uses Unicode as its character set, and so most XML tools use the UTF-8
encoding to cover all the possible characters. On the other hand,
the non-XML world makes use of some other charsets1, and in fact neither
man nor Texinfo support UTF-8 very well. So db2x_manxml
and
db2x_texixml
have to transcode their output.
`Transcoding' can be separated into three components:
docbook2X uses iconv(1) here.
This part can be problematic, because there is no official mapping from Unicode to these markup-level escapes. Even if a certain character has a markup-level escape, that does not necessarily mean it should be escaped! Texinfo and roff implementations often do not have much native charset support and would use ASCII approximations for the escaped character even if that character exists in the native charset. And if the document is primarily in a non-English language, it becomes cumbersome to escape all the non-ASCII characters. (For example: é in French texts)
utf8trans
, a program included in docbook2X, converts
some of these characters to markup-level escapes.
`Character maps' for both roff and Texinfo are included
in docbook2X under charmaps/.
db2x_manxml
and db2x_texixml
will apply these character mappings
automatically.
You can make custom character maps for utf8trans
to do this,
if your approximations are on a character-by-character basis
and not context-dependent.
[1] `charset' is used very loosely here to mean any set of byte sequences used to represent characters. Other specifications typically do not make such fine distinctions between encoding and character set as the Unicode and XML standards do. Non-Unicode charsets are specifically referred to here.