Encoding problems in xml
More serious problems can arise from the use of encodings. Developers often overlook the fact that encodings do not limit the set of characters that XML supports. Every XML document supports the full Unicode character set (16-bit or 32-bit characters in XML 1.1).
Encoding XML documents can reduce their size, but it does not limit the document to a subset of Unicode — thanks to the magic of character entities. Indeed, through character entities, it is possible to insert any character from the Unicode table, even if the document uses the most restrictive encoding (US-ASCII, which is only good for four languages: English, Hawaiian, Latin, and Swahili).
This is a problem because while a Java application or a recent version of DB2® might support Unicode, few legacy applications do. So if the XML stream feeds a legacy application, you must deal with Unicode. To avoid misunderstanding, let me state again that imposing an encoding is not a solution because, as explained above, it is always possible to escape special characters to character entities.
Because rewriting a legacy application is seldom an option, you need a conversion routine that will convert Unicode characters into a set that is acceptable to the application — for example converting “î” into a straight “i” (removing the circumflex). Most XML parsers provide routines for manipulating Unicode characters.