Hindsight is 20/20: Three Things XML Got Wrong

October 17, 2004

@ 04:56 PM

Derek Denny-Brown, the dev lead for both MSXML & System.Xml, who's been involved with XML before it even had a name has finally started a blog. Derek's first XML-related post is Where XML goes astray... which points out three features of XML that turn out to have caused significant problems for users and implementers of XML technologies. He writes

First, some background: XML was originally designed as an evolution of SGML, a simplification that mostly matched a lot of then existing common usage patterns. Most of its creators saw XML and evolving and expanding the role of SGML, namely text markup. XML was primarily intended to support taking a stream of text intended to be interpreted as a human readable document, and delineate portions according to some role. This sequence of characters is a paragraph. That sequence should be displayed with a link to some other information. Et cetera, et cetera. Much of the process in defining XML based on the assumption that the text in an XML document would eventually be exposed for human consumption. You can see this in the rules for what characters are allowed in XML content, what are valid characters in Names, and even in "</tagname>" being required rather than just "</>".
...
Allowed Characters
The logic went something like this: XML is all about marking up text documents, so the characters in an XML document should conform to what Unicode says are reasonable for a text document. That rules out most control characters, and means that surrogate pairs should be checked. All sounds good until you see some of the consequences. For example, most databases allow any character in a text column. What happens when you publish your database as XML? What do you do about values that include characters which are control characters that the XML specification disallowed? XML did not provide any escaping mechanism, and if you ask many XML experts they will tell you to base64 encode your data if it may include invalid characters. It gets worse.

The characters allowed in an XML name are far more limited. Basically, when designing XML, they allowed everything that Unicode (as defined then) considered a ‘letter’ or a ‘number’. Only 2 problems with that: (1) It turns out many characters common in Asian texts were left out of that category by the then-current Unicode specification. (2) The list of characters is sparse and random, making implementation slow and error prone.
...
Whitespace
When we were first coding up MSXML, whitespace was one of our perpetual nightmares. In hand-authored XML documents (the most common form of documents back then), there tended to be a great deal of whitespace. Humans have a hard time reading XML if everything is jammed on one line. We like a tag per line and indenting. All those extra characters, just there so that our feeble minds could make sense of this awkward jumble of characters, ended up contributing significantly to our memory footprint, and caused many problems to our users. Consider this example:
<customer>
           <name>Joe Schmoe</name>
           <addr>123 Seattle Ave</addr>
  </customer>
A customer coming to XML from a database back ground would normally expect that the first child of the <customer> element would be the <name> element. I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab.
...
XML Namespaces
Namespaces is still, years after its release, a source of problems and disagreement. The XML Namespaces specification is simple and gets the job done with minimum fuss. The problem? It pushes an immense burden of complexity onto the APIs and XML reader/writer implementations. Supporting XML Namespaces introduces significant complexity in the parsers, because it forces parsers to parse the entire start-tag before returning any text information. It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities.

Then there is the issue of the 'default namespace’. I still see regular emails from people confused about why their XPath doesn’t work because of namespace issues. Namespaces is possibly the single largest obstacle for people new to XML.

My experiences as the program manager for the majority of the XML programming model in the .NET Framework agree with this list. The above list hits the 3 most common areas people seem to have problems with working with XML in the .NET Framework. His blog post makes a nice companion piece to my The XML Litmus Test: Understanding When and Why to Use XML article on MSDN.

Categories: XML

« Read your GMail inbox in RSS Bandit | Home | The "Blogging is Male Dominated" Myth »

Sunday, 17 October 2004 21:50:59 (GMT Daylight Time, UTC+01:00)

"Proposing solutions is of no real use, since XML is a standard and isn’t changing significantly anytime soon. It is worth understanding where we made our worst mistakes to avoid making similar mistakes again." [quoting from the linked article, not Dare's post]

That's probably true, but that's kindof a depressing thought: The best we can do is avoid making these mistakes again, but we're not likely to be in a position to do so?

The industry is at an interesting junction: do we fix the problems with XML that cause real-world pain, do we learn to live with them, or do we move beyond it with something that tries to learn from its lessons but not maintain full compatibility with its syntax/semantics/datamodels/APIs? Sooner or later the latter option will become viable, even it if isn't now. What's to be done in the meantime? I really hope there's a better answer than document XML Antipatterns to warn the unsuspecting. [I started to write such a book but it was too depressing!]

mike champion

Monday, 18 October 2004 01:05:24 (GMT Daylight Time, UTC+01:00)

Yup...I've wrestled with all those things myself. Lately I've been wishing that there were some "ignore whitespace" flag I could set when parsing that would just throw it all away. I've got a schema where literally everything is either in a tag or an attribute yet yesterday I ran into validation problems because a tag that wasn't supposed to contain anything contained whitespace.

Not to mention the extra code needed to ignore all that useless text.

What I would love would be to have something like XML, but where there wasn't any "text". The cool thing about that is that if it were defined like that, editors could be written to automatically tab and newline when displaying, leaving the files sans whitespace on the drive. Of course, that wouldn't work for many of the things that XML is used in, but who cares? Somethimes two small languages are better than one large one. I suspect that it would be easier to write two parsers, one to deal with an XML-like text markup language and another to deal with an XML-like data language, then it is to try to parse something that does both.

I don't know if "avoiding similar mistakes" is really possible as it seems to me that the core mistake was taking something designed for one domain and forcing it into another. It's the "one-size fits all" attitude that is the source of the problem. I think the core mistake was made when someone saw the text markup language and said "let's make that support descriptions of data too" when they should have said "let's make something like that to describe data."

ucblockhead

Monday, 18 October 2004 06:29:37 (GMT Daylight Time, UTC+01:00)

Mike,
XML is infrastructure and most of us think although it has its issues it makes more sense to build the next layers of abstraction and user benefit on top of it than starting again from scratch thus throwing away all the existing investment in it. ASCII sucks, does that mean that the software industry as a whole should try and replace it as well as unicode formats that aim to be backwards compatible with it such as UTF-8?

This is one of the reasons I've always felt the Atom effort was so misguided. Instead of building the next set of interesting user applications on top of RSS (like Dave Winer is doing with podcasting) some perfectionists decide that they need to create a third XML syndication flavor that is slightly better than the rest. Of course, since it is built on the broken promise that is XML (even I haven't been immune from bugs in my code due to forgetting about whitespace issues in XML when processing Atom feeds) it's not like whatever spec they produce will be that much better than RSS 1.0 or RSS 2.0.

Similarly I don't see much benefit in speculating on what a backwards incompatible future version of XML would look like since the benefits of creating it don't outweigh the costs of hitting the reset button on all the gains we've gotten from XML or from fragmenting the interop story. That's one of the reasons I and others I work with dislike the calls for a standardized binary XML format.

kpako@yahoo.com (Dare Obasanjo)

Tuesday, 26 October 2004 22:27:11 (GMT Daylight Time, UTC+01:00)

There seem to be two errors above:

First, XML 1.0 did not require that "surrogate pairs should be checked" because there were no non-BMP characters at the time XML was created.

Second, XML's name rules have never been to allow just letters and digits, and it is completely untrue that there were many characters common in Asian languages left out: XML allowed all characters in common use in Chinese, Korean, Japanese, Thai, Bahasa (Indonesia/Malasian, err, which just use Latin) that were available in Unicode at the time.

As for control characters, XML 1.1 addresses this.

Since XML was designed as a textual format for text,
people who try to send arbitrary binary or strings with odd bytes thrown in are not users but abusers and, with all the sympathy in the world, should not be too surprised if they have "significant problems".

Rick Jelliffe

Comments are closed.

Dare Obasanjo's weblog

"You can buy cars but you can't buy respect in the hood" - Curtis Jackson

Navigation for Hindsight is 20/20: Three Things XML Got Wrong - Dare Obasanjo's weblog