A recent spate of discussions about well-formed XML in the context of the ATOM syndication format kicked of by There are no exceptions to Postel's Law post has reminded me that besides using an implementation of the W3C DOM most developers do not have a general means of generating well-formed, correct XML in their applications. In the .NET Framework we provide the XmlWriter class for generating XML in a streaming manner but it is not without it's issues. In a recent blog post entitled Well-Formed XML in .NET, and Postel's Rebuttal Kirk Allen Evans writes

At any rate, Tim successfully convinced me that aggregators should not have the dubious task of “correcting“ feeds or displaying feeds that are not well-formed. 

Yet I still have a concern about Tim's post, concerning XmlWriter and well-formedness:

PostScript: I just did the first proof on the first draft of this article. It had a mismatched tag and wasn’t well-formed. The publication script runs an XML parser over the draft and it told me the problem and I fixed it. It took less time than writing this postscript.

PPS: Putting My Money Where My Mouth Is - If you’re programming in .NET, there’s a decent-looking XmlWriter class.

The problem is that it is quite possible to emit content using the XmlWriter that is not well-formed. From MSDN online's “Customized XML Writer Creation“ topic:

  • The XmlTextWriter does not verify that element or attribute names are valid.
  • The XmlTextWriter writes Unicode characters in the range 0x0 to 0x20, and the characters 0xFFFE and 0xFFFF, which are not XML characters.
  • The XmlTextWriter does not detect duplicate attributes. It will write duplicate attributes without throwing an exception.

Even using the custom XmlWriter implementation that is mentioned in the MSDN article does not remove the possibility of a developer circumventing the writing process:

Kirk provides a code sample that shows that even with an XmlWriter implementation that performs the well-formedness checks that are missing from the XmlTextWriter provided in v1.0 & v1.1 of the .NET Framework, a developer could still inadvertently write out malformed XML if they hand out the XML stream without closing the XmlTextWriter and thus closing all the unclosed tags.

In the next version of the .NET Framework we plan to provide an XmlWriter implementation that performs all the conformance checks required by the W3C XML 1.0 recommendation when generating XML [except for duplicate attribute checking].


Sam Ruby posted an RSS feed that was malformed XML which can be subscribed to from RSS Bandit without any complaints. I mentioned in a response to the post on Sam Ruby's blog that this is because RSS Bandit uses the XmlTextReader class in the .NET Framework which by default doesn't perform character range checking for numeric entities to ensure that the XML document does not contain invalid XML characters. To get conformant behavior from the XmlTextReader one needs to set its Normalization property to true. In retrospect this was an unfortunate design decision and we should have chosen the default to be conformant behavior but allowed users have the option to change it to unconformant behavior if it suited their needs not the other way around.

In the next version of the .NET Framework we plan to provide an implementation of the XmlReader which is fully conformant to the W3C XML 1.0 recommendation by default.


 

Comments are closed.