Is XML About Text or Not? - Dare Obasanjo's weblog

November 14, 2003

@ 04:22 PM

Fumiaki Yoshimatsu writes

Why does someone still think that they have to write Unicode BOMs by themselves, digging deep inside XmlTextWriter.BaseStream and UnicodeEncoding.GetPreamble? Encoding hint in the XML declarations and Unicode BOMs are all about XML 1.0 thing, but WriteStartElement and WriteStartDocument are not. They are InfoSet thing, so they do not have anything to do with the serialization format. Think about XmlNodeWriter for example. Why does XmlNodeWriter NOT have any constructor that have a parameter of type Encoding? Why does it always call XmlDocument.CreateXmlDeclaration with null as the second argument?

This is a common point of confusion for users of XML in the CLR. XmlNodeWriter doesn't have a parameter of type Encoding because it writes to an XmlDocument which is stored in memory and all strings in the CLR are in UTF-16 encoding. Setting the encoding only matters when saving the XmlDocument to a stream. As for having to dig into XmlTextWriter.BaseStream to set the encoding, I find this weird considering that the XmlTextWriter constructor has a number of ways to specifying the encoding on instantiating an instance of the class. Since XML 1.0 mandates that an XML document can only have one encoding there is no reason for methods like WriteStartElement and WriteStartDocument to concern themselves with encoding issues.

If you really want to dive deep into issues involving specifying the encoding of XML documents and the CLR take a look at this discussion in Robert McLaws's weblog.

PS: One of my pet peeves is the way people misuse the term XML infoset to mean "things in XML I don't care about" even though there is a precise definitition (nay an entire spec) that describes what it means. The document information item clearly has a [character encoding scheme] property which means character encodings are an XML infoset thing.