October 19, 2003
@ 07:56 PM
The people who got together to produce the XML 1.0 recommendation where motivated to do this because they saw a need for SGML on the Web. Specifically  
their discussions focused on two general areas:
  • Classes of software applications for which HTML was an inadequate information format
  • Aspects of the SGML standard itself that impeded SGML's acceptance as a widespread information technology

The first discussion established the need for SGML on the web. By articulating worthwhile, even mission-critical work that could be done on the web if there were a suitable information format, the SGML experts hoped to justify SGML on the web with some compelling business cases.

The second discussion raised the thornier issue of how to "fix" SGML so that it was suitable for the web.

And thus XML was born. One might ask what classes of documents did HTML prove to be an inadequate format and for this Lou Burnard's presentation SGML on the Web: too little too soon, or too much too late? from seven years ago where he wrote

Lastly, what is wrong with HTML? Well, rather a lot, if we compare it with other general purpose document type definitions...Compare for example the following two declarations:

<!ELEMENT Book - - ((Title, TitleAbbrev?)?, BookInfo?, ToC?, LoT*, Preface*,
                (((%chapter.gp;)+, Reference*) | Part+ | Reference+ |
                Article+), (%appendix.gp;)*, Glossary?, Bibliography?,
                (%index.gp;)*, LoT*, ToC? ) +(%ubiq.gp;) >
<!ENTITY % html.content "HEAD, BODY">
<!ELEMENT HTML O O  (%html.content)>
<!ENTITY % body.content "(%heading | %text | %block | HR | ADDRESS)*">
<!ELEMENT BODY O O  %body.content>

The first, from the DocBook dtd, makes explicit that books potentially contain a number of subcomponents, each of which is distinguishable, and has a proper place. The second, from the HTML 2.0 dtd, states that the body of an HTML document contains just about anything in just about any order....

HTML's permissiveness makes it difficult or impossible to do many of the things for which we go to the trouble of making information digitally accessible. Specifically, it is hard to:

  • validate document data structures (for example where documents are to be managed by database software)
  • impose editorial control (for example in co-operatively authored projects)
  • generate navigational aids such as tables of contents directly from the document itself
  • generate or manage cross-document (or even intra-document) links in anything other than an ad hoc and manual manner
  • address or manage objects smaller or larger than a single document
  • efficiently re-use document components
  • search within semantically significant components of a document

These are the problems that web authors supposedly had with HTML that would be fixed by bringing SGML to the Web (i.e. inventing XML). Seven years later, using XML and XML-related technologies to produce content for the Web does alievate a number of the issues with producing content with HTML but it is a far cry from actually having "SGML on the Web". What has instead happened is that XML and related technologies can be used to produce content for the Web but this content is placed "on the Web" as HTML.

The W3C's attempts to get people to author XML directly on the Web have mostly failed as can be seen by the dismal adoption rate of XHTML and in fact many [including myself] have come to the conclusion that the costs of adopting XHTML compared to the benefits are too low if not non-existent. There was once an expectation that content producers would be able to place documents conformant to their own XML vocabularies on the Web and then display would entirely be handled by stylesheets but this is yet to become widespread. In fact, at least one member of a W3C working group has called this a bad practice since it means that User Agents that aren't sophisticated enough to understand style sheets are left out in the cold.

Interestingly enough although XML has not been as successfully as its originators initially expected as a markup language for authoring documents on the Web it has found significant success as the successor to the Comma Separated Value (CSV) File Format. XML's primary usage on the Web and even within internal networks is for exchanging machine generated, structured data between applications. Speculatively, the largest usage of XML on the Web today is RSS and it conforms to this pattern.

A lot of the idiosyncracies of XML that tend developers tend to get hung up on are due to XML's legacy as a document authoring format. However in much the same way that Oak a programming language and environment designed for programming embedded systems transformed into Java a programming langauge and environment mostly used for building mid-tier applications so also has XML outgrown its roots.

Unfortunately a lot of people working on XML technologies today fail to understand its history but even worse a lot of those who know its history fail to realize that its usage scenarios and users have changed from what they originally thought.