One of the biggest problems that faces designers of XML vocabularies is how to make them extensible and design them in a way that applications that process said vocabularies do not break in the face of changes to versions of the vocabulary. One of the primary benefits of using XML for building data interchange formats is that the APIs and technologies for processing XML are quite resistant to additions to vocabularies. If I write an application which loads RSS feeds looking for
item elements then processes their
title elements using any one of the various technologies and APIs for processing XML such as SAX, the DOM or XSLT it is quite straightforward to build an application that processes said elements which is resistant to changes in the RSS spec or extensions to the RSS spec as the link and
title elements always appear in a feed.
On the other hand, actually describing such extensibility using the most popular XML schema language, W3C XML Schema, is difficult because of several limitations in its design which make it very difficult to describe extension points in a vocabulary in a way that is idiomatic to how XML vocabularies are typically processed by applications. Recently, David Orchard, a standards architect at BEA Systems wrote an article entitled Versioning XML Vocabularies which does a good job of describing the types of extensibility XML vocabularies should allow and points out a number of the limitations of W3C XML Schema that make it difficult to express these constraints in an XML schema for a vocabulary. David Orchard has written a followup to this article entitled Providing Compatible Schema Evolution which contains a lot of assertions and suggestions for improving extensibility in W3C XML Schema that mostly jibe with my experiences working as the Program Manager responsible for W3C XML Schema technologies at Microsoft.
The scenario outlined in his post is
We start with a simple use case of a name with a first and last name, and it's schema. We will then evolve the language and instances to add a middle name. The base schema is:
<xs:element name="first" type="xs:string" />
<xs:element name="last" type="xs:string" minOccurs="0"/>
Which validates the following document:
And the scenarios asks how to validate documents such as the following where the new schema with the extension is available or not available to the receiver.:
At this point I'd like to note that this a versioning problem which is a special instance of the extensibility problem. The extensibility problem is how does one describe an XML vocabulary in a way that allows producers to add elements and attributes to the core vocabulary without causing problems for consumers that may not know about them. The versioning problem is specific to when the added elements and attributes actually are from a subsequent version of the vocabulary (i.e. a version 2.0 server talking to a version 1.0 client). The additional wrinkle in the specific scenario outlined by David Orchard is that elements from newer versions of the vocabulary have the same namespace as elements from the old version.
A strategy for simplifying the problem statement would be if additions in subsequent versions of the vocabulary had were in a different namespace (i.e. a version 2.0 document would have elements from the version 1.0 namespace and the version 2.0 namespace) which would then make the versioning problem the same as the extensibility problem. However most designers of XML vocabularies would balk at creating a vocabulary which used elements from multiple namespaces for its core [once past version 2.0] and often site that this makes it more cumbersome for applications that process said vocabularies because they have to deal with multiple namespaces. This is a tradeoff which every XML vocabulary designer should consider during the design and schema authoring process.
David Orchard takes a look at various options for solving the extensibility problem outlined above using current XML Schema design practices.
Use type extension or substitution groups for extensibility. A sample schema is:
This requires that both sides simultaneously update their schemas and breaks backwards compatibility. It only allows the extension after the last element
There is a [convoluted] way to ensure that both sides do not have to update their schemas. The producer can send a <name> element that contains
xsi:type attribute which has the
NameExtendedType as its value. The problem is then how the client knows about the definition for the NameExtendedType type which is solved by the root element of the document containing an xsi:schemaLocation attribute which points to a schema for that namespace which includes the schema from the previous version. There are at least two caveats to this approach (i) the client has to trust the server since it is using a schema defined by the server not the client's and (ii) since the xsi:schemaLocation attribute is only a hint it is likely the validator may ignore it since the client would already have provided a schema for that namespace.
Change the namespace name or element name
The author simply updates the schema with the new type. A sample is:
This does not allow extension without changing the schema, and thus requires that both sides simultaneously update their schemas. If a receiver has only the old schema and receives an instance with middle, this will not be valid under the old schema
Most people would state that this isn't really extensibility since [to XML namespace aware technologies and APIs] the names of all elements in the vocabulary have changed. However for applications that key off the local-name of the element or are unsavvy about XML namespaces this is a valid approach that doesn't cause breakage. Ignoring namespaces, this approach is simply adding more stuff in a later revision of the spec which is generally how XML vocabularies evolve in practice.
Use wildcard with ##other
This is a very common technique. A sample is:
The problems with this approach are summarized in Examining elements and wildcards as siblings. A summary of the problem is that the namespace author cannot extend their schema with extensions and correctly validate them because a wildcard cannot be constrained to exclude some extensions.
I'm not sure I agree with David Orchard summary of the problem here. The problem described in the article he linked to is that a schema author cannot refine the schema in subsequent versions to contain optional elements and still preserve the wildcard. This is due to the Unique Particle Attribution constraint which states that a validator MUST always have only one choice of which schema particle it validates an element against. Given an element declaration for an element and a wildcard in sequuence the schema validator has a CHOICE of two particles it could validate an element against if its name matches that of the element declaration. There are a number of disambiguating rules the W3C XML Schema working group could have come up with to allow greater flexibility for this specific case such as (i) using a first match rule or (ii) allowing exclusions in wildcards.
Use wildcard with ##any or ##targetnamespace
This is not possible with optional elements. This is not possible due to XML Schema's Unique Particle Attribution rule and the rationale is described in the Versioning XML Languages article. An invalid schema sample is:
The Unique Particle Attribution rule does not allow a wildcard adjacent to optional elements or before elements in the same namespace.
Agreed. This is invalid.
This is the solution proposed in the versioning article. A sample of the pre-extended schema is:
An extended instance is
This is the only solution that allows backwards and forwards compatibility, and correct validation using the original or the extended schema. This articles shows a number of the difficulties remaining, particularly the cumbersome syntax and the potential for some documents to be inappropriately valid. This solution also has the problem of each subsequent version will increase the nesting by 1 level. Personally, I think that the difficulties, including potentially deep nesting levels, are not major compared to the ability to do backwards and forwards compatible evolution with validation.
The primary problem I have with this approach is that it is a very unidiomatic way to process XML especially when combined with the problem with nesting in concurrent versions. For example, take a look at
Imagine if this is the versioning strategy that had been used with HTML, RSS or DocBook. That gets real ugly, real fast. Unfortunately this is probably the best you can if you want to use W3C XML Schema to strictly define the an XML vocabulary with extensibility yet allow backwards & forwards compatibility.
David Orchard goes on to suggest a number of potential additions to future versions of W3C XML Schema which would make it easier to use it in defining extensible XML vocabularies. However given that my personal opinion is that adding features to W3C XML Schema is not only trying to put lipstick on a pig but also trying to build a castle on a foundation of sand, I won't go over each of his suggestions. My recent suggestion to some schema authors at Microsoft about solving this problem is that they should have two validation phases in their architecture. The first phase does validation according to W3C XML Schema rules while the other performs validation of “business rules“ specific to their scenarios. Most non-trivial vocabularies end up having such an architecture anyway since there are a number of document validation capabilities missing from W3C XML Schema so schema authors shouldn't be too focused on trying to force fit their vocabulary into the various quirks of W3C XML Schema.
For example, in one could solve the original schema with a type definition such as
<xsd:choice minOccurs="1" maxOccurs="unbounded">
<xsd:element name="first" type="xsd:string" />
<xsd:element name="last" type="xsd:string" minOccurs="0"/>
<xsd:any namespace="##other" processContents="lax" />
where the validation layer above the W3C XML Schema layer ensures that an element doesn't occur twice (i.e. there can't be two <first> elements in a <name>). It adds more code to the clients & servers but it doesn't result in butchering the vocabulary either.