Daniel Cazzulino is writing about W3C XML Schema type system < - > CLR type system and has an informal poll at the bottom of his article where he writes

We all agree that many concepts in WXS don't map to anything existing in OO languages, such as derivation by restriction, content-ordering (i.e. sequence vs choice), etc. However, in the light of the tools the .NET Framework makes available to map XML to objects, we usually have to analyze WXS (used to define the structure of that very XML instance to be mapped) and its relation with our classes
In this light, I'm conducting a survey about developer's view on the relation of the XSD type system and the .NET one. Ignoring some of the more advanced (I could add cumbersome and confusing) features of WXS, would you say that both type systems fit nicely with each other?

I find the question at the end of his post which I highlighted to be highly tautological. His question is basically, “If you ignore the parts where they don't fit well together do the CLR and XSD type  system fit well together?”. Well if you ignore the parts where they don't then the only answer is YES. In reality many developers don't have the freedom to ignore parts of XSD they don't want to support especially when utilizing XML Web Services designed by others.

There are two primary ways one can utilize the XmlSerializer which maps between XSD and CLR types

  1. XML Serialization of Object State: In this case the developer is only interested in ensuring that the state of his classes can be converted to XML. This is a fairly simple problem because the expressiveness of the CLR is a subset of that of W3C XML Schema. Any object's state could be mapped to an element of complex type containing a sequence or choice of other nested elements that are either nested simple types or complex types.

    Even then there are limitations in the XmlSerializer which make this cumbersome such as the fact that it only serializes public fields but not public properties. But that is just a design decision that can be revisited in future releases.

  2. Conversion of XML to Objects: This is the scenario where a developer converts an XML schema to CLR objects to make them easier to program against. This is particularly common in XML Web Services scenarios which is why the XmlSerializer was originally designed. In this scenario the conversion tool has to contend with the breadth of features in the XML Schema: Structures and XML Schema: Datatypes recommendations.

    There are enough discrepancies between the W3C XML Schema type system and that of the CLR to fill a Ph.D thesis. I touched on some of these in my article XML Serialization in the .NET Framework such as

    Q: What aspects of W3C XML Schema are not supported by the XmlSerializer during conversion of schemas to classes?

    A: The XmlSerializer does not support the following:

    • Any of the simple type restriction facets besides enumeration.
    • Namespace based wildcards.
    • Identity constraints.
    • Substitution groups.
    • Blocked elements or types.

    After gaining more experience with working with the XmlSerializer and talking to a number of customers I wrote som more about the impedance mismatches in my article XML Schema Design Patterns: Is Complex Type Derivation Unnecessary? specifically

    For usage scenarios where a schema is used to create strongly typed XML, derivation by restriction is problematic. The ability to restrict optional elements and attributes does not exist in the relational model or in traditional concepts of type derivation from OOP languages. The example from the previous section where the email element is optional in the base type, but cannot appear in the derived type, is incompatible with the notion of derivation in an object oriented sense, while also being similarly hard to model using tables in a relational database.

    Similarly changing the nillability of a type through derivation is not a capability that maps to relation or OOP models. On the other hand, the example that doesn't use derivation by restriction can more straightforwardly be modeled as classes in an OOP language or as relational tables. This is important given that it reduces the impedance mismatch which occurs when attempting to map the contents of an XML document into a relational database or convert an XML document into an instance of an OOP class

I'm not the only one at Microsoft who's written about this impedance mismatch or tried to solve it. Gavin Bierman, Wolfram Schulte and Erik Meijer wrote in their paper Programming with Circles, Triangles and Rectangles an entire section about this mismatch. Below are links to descriptions of a couple of the mismatches they found most interesting

The mismatch between XML and object data-models
     Edge-labelled vs. Node-labelled
     Attributes versus elements
     Elements versus complex and simple types
     Multiple occurrences of the same child element
     Anonymous types
     Substitution groups vs derivation and the closed world assumption
     Namespaces, namespaces as values
     Occurence constraints part of container instead of type
     Mixed content

There is a lot of discussion one could have about the impedance mismatch between the CLR type system and the XSD type system but one thing you can't say is that it doesn't exist or that it can be ignored if building schema-centric applications.

    In conclusion, the brief summary is that if one is mapping objects to XML for the purpose of serializing their state then there is a good match between the CLR & XSD type systems since the XSD type system is more expressive than the CLR type system. On the other hand, if one is trying to go from XSD to the CLR type system there are significant impedance mismatches some of which are limitations of the current tools (e.g. XmlSerializer could code gen range checks for derivation by restriction of simple types or uniqueness tests for identity constraints ) while others are fundamental differences between the XSD type system and object oriented programming such as the difference between derivation by restriction in XSD and type derivation.


     

    Categories: XML

    Recently ZDNet ran an article entitled Google spurns RSS for rising blog format where it stated

    The search giant, which acquired Blogger.com last year, began allowing the service's million-plus members to syndicate their online diaries to other Web sites last month. To implement the feature, it chose the new Atom format instead of the widely used, older RSS.

    I've seen some discussion about the fact that Google only provides feeds for certain blogs in the ATOM 0.3 syndication format  which is an interim draft of the spec that is part of an effort being driven by Sam Ruby to replace RSS and related technologies. When I first read this I ignored it because I didn't have any Blogger.com feeds that were of interest to me. This changed today. This afternoon I found out that Steve Saxon, the author of the excellent article XPath Querying Over Objects with ObjectXPathNavigator had a Blogger.com blog that only provided an ATOM feed. Being that I use RSS Bandit as my aggregator of choice I cannot subscribe to his feed nor can I use a large percentage of the existing news aggregators to read Steve's feed.

    What a find particularly stunning about Google's decision is that they have removed support for an existing, widely supported format for an interim draft of a format which  according to Sam Ruby's slides for the O'Reilly Emerging Technology Conference is several months away from being completed. An appropriate analogy for what Google has done would be like AOL abandoning support for HTML and changing all of its websites to use the May 6th 2003 draft of the XHTML 2.0 spec. It simply makes no sense.

    Some people, such as Dave Winer believe Google is engaging in such user unfriendly behavior for malicious reasons but given that Google doesn't currently ship a news aggregator there doesn't seem to be much of a motive there (Of course, this changes once they ship one).  I recently stumbled across an article entitled The Basic Laws of Human Stupidity which described the following 5 laws

    1. Always and inevitably everyone underestimates the number of stupid individuals in circulation.

    2. The probability that a certain person be stupid is independent of any other characteristic of that person.

    3. A stupid person is a person who causes losses to another person or to a group of persons while himself deriving no gain and even possibly incurring losses.

    4. Non-stupid people always underestimate the damaging power of stupid individuals. In particular non-stupid people constantly forget that at all times and places and under any circumstances to deal and/or associate with stupid people always turns out to be a costly mistake.

    5. A stupid person is the most dangerous type of person.

    The only question now is Is Google crazy, or crazy like a fox? and only time will tell the answer to that question.


     

    Categories: Technology

    Dave Winer recently wrote that at least one person has asked if it is safe to ignore Atom in his weblog. If you are a cautious person like Tim Bray's Mr. Safe or you fit more on the right than the left side of the Technology Adoption Life Cycle then you are probably wondering why you should want to support the Atom syndication format over one of the many flavors of RSS. There are two parts to this question, if you are a consumer of syndication feeds or if you are a consumer of syndication feeds.

    The Safe Syndication Producer's Perspective
    An RSS feed is a regularly updated XML document that contains metadata about a news source and the content in it. Minimally an RSS feed consists of a channel that represents the news source, which has a title, link, and description that describe the news source. Additionally, an RSS feed typically contains one or more item elements that represent individual news items, each of which should have a title, link, or description

    There are two primary flavors of RSS; Dave Winer's family of specifications (the most popular being RSS 0.91 & RSS 2.0) and the RDF-based RSS 1.0. The most popular are Dave Winer's family of specifications which have been adopted by a number of well-known organizations such as Yahoo! News, the BBCRolling Stone magazine, the Microsoft Developer Network (MSDN) , the Oracle Technology Network (OTN), the Sun Developer Network and Apple's iTunes Music Store. According to Syndic8 which tracks over 50,000 RSS feeds RSS 0.91, RSS 1.0 & RSS 2.0 all have about 30% of the RSS marketshare.  

    Most news aggregators support all 3 major versions of RSS although few actually take advantage of the fact that RSS 1.0 is an RDF vocabulary. If all one want is simple syndication of news items the RSS 0.91 should be satisfactory. If one plans to use extensions to the core RSS specification that expose application or domain specific functionality such as the ability to post comments one can use one of the many RSS modules in combination with RSS 2.0. The only advantage that RSS 1.0 gives over RSS 0.91/RSS 2.0 is that it is an RDF vocabulary and thus fits nicely into the dream of the Semantic Web.

    The Atom syndication format can be considered to be a more sophisticated implementation of the ideas in RSS 2.0. It adds richer syndication capabilities such as the ability to put binary formats such as Word documents and Powerpoint documents in feeds and formalizes some of the best practices in the RSS world around putting [X]HTML in feeds.

    The average user of a news aggregator will not be able to tell the difference between an Atom or RSS feed from their aggregator if it supports both. However users of aggregators that don't support Atom will not be able to subscribe to feeds in that format. In a few years, the differences between RSS and Atom will most likely be the same as those that are different between RSS 1.0 and RSS 0.91/RSS 2.0; only of interest to a handful of XML syndication geeks. Even then the simplest and safest bet would still be to use RSS as a syndication format. This is the same as the fact that even though the W3C has published XHTML 1.0 & XHTML 1.1 and is working on XHTML 2.0, the safest bet to get the widest reach with the least problems is to publish a website in HTML 3.2 or HTML 4.01.

    The Safe Syndication Consumer's Perspective
    If you plan to consume feeds from a wide variety of sources then one should endeavor to support as many syndication formats as possible. The more formats a feed consumer supports the more content is available for its users.

    Based on their current popularity, degree of support and ease of implementation one should consider supporting the major syndication formats in the following order of priority

    1. RSS 0.91/RSS 2.0
    2. RSS 1.0
    3. Atom

    RSS 0.91 support is the simplest to implement and most widely supported by websites while Atom is the most difficult to implement being the most complex and will be least supported by websites in the coming years.


     

    Categories: XML

    February 18, 2004
    @ 04:42 PM

    This is here mainly for me to be able to look back on in a few years and for any new readers of my blog who wonder what I actually do at Microsoft.

    I am a program manager for the WebData XML team. The WebData team is part of the SQL Server Product Unit and produces the major data access technologies that Microsoft produces including MDAC, MSXML, ADO.NET, System.Xml, ObjectSpaces and the WinFS API.

    As a technical program manager I am responsible for the nitty gritty of the design of the classes in the following namespaces in the .NET Framework

    Nitty gritty design details means stuff like triaging bug fixes, designing new features or new classes, writing specifications, and interacting with internal & external customers to discover their likes and dislikes about the APIs in question.

     I am also the community lead for the WebData XML team which means I am responsible for things like XML Most Valuable Professional (MVP) program and the upcoming MSDN XML Developer Center. For the MVP program I am the primary point of contact between my team and the Microsoft MVP program and with our MVPs. I am also one of the folks who approves or rejects nominees. As for the developer center, I am the equivalent of what MSDN likes to call a “content strategist“ which basically means I am responsible for the content on the site. For the most part I am also the primary point of contact between my team and MSDN.

    If you have any issues or questions related to the aforementioned aspects of my job at Microsoft (e.g. bug reports, feature requests or questions about writing for MSDN)  feel free to ping me on my work email address. If you don't know it you should be able to find it from a minute or two of Googling.  


     

    Categories: Life in the B0rg Cube

    February 18, 2004
    @ 06:28 AM

    Chris Sells writes

    On his quest to find "non-bad WinFS scenarios" (ironically, because he was called out by another Microsoft employee -- I love it when we fight in public : ), Jeremy Mazner, Longhorn Technical Evangelist, starts with his real life use of Windows Movie Maker and trying to find music to use as a soundtrack. Let Jeremy know what you think.

    I think the scenario is compelling. In fact, the only issue I have with the WinFS scenario that Jeremy outlines is that he implies that the metadata about music files Windows Media player exposes is tied to the application but in truth most of it is tied to the actual media files as regular file info [file location, date modified, etc] or as ID3 tags [album, genre, artist, etc]. This means that there doesn't even need to be explicit inter-application sharing of data.

    If the file system had a notion of a music item which exposed the kind of information one sees in ID3 tags which was also exposed by the shell in standard ways then you could do lots of interesting things with music metadata without even trying hard. I also like it's quite compelling because metadata attached to music files is such a low hanging fruit that one can get immediate value out of and which exists today on the average person's machine.


     

    Categories: Technology

    The folks at Slashdot have Indian Techies Answer About 'Onshore Insourcing'. Excellent stuff.


     

    I just saw an entry in Ted Leung's blog about SMS messages where he wrote

    [via Trevor's ETech notes]

    become rude to make a phone call without first checking via sms. [this is becoming more and more the case in europe also]

    I would love it if this became the etiquette here in the US as well. For all telephone calls, not just cell calls. People seem to believe that they have the right to call you simply because you have a telephone.

    The so-called SMS craze that has hit Europe and Asia seems totally absurd to me. I can understand teenagers and college students using SMS as a more sophisticated way of passing notes to each other in class but can't see any other reason why if I have a device I could use to talk to someone I'd instead send them a hastily written and poorly spelled text message instead. Well maybe if text messages were free and voice calls were fairly expensive but since that isn't the case in the US I guess that's why I don't get it.


     

    Categories: Ramblings

    Daniel Cazzulino has been writing about his work with XML Streaming Events which combines the ability to do XPath queries with the .NET Framework's forward-only, pull based XML parser. He shows the following code sample

    // Setup the namespaces XmlNamespaceManager mgr = new XmlNamespaceManager(temp.NameTable); mgr.AddNamespace("r", RssBanditNamespace); // Precompile the strategy used to match the expression IMatchStrategy st = new RootedPathFactory().Create( "/r:feeds/r:feed/r:stories-recently-viewed/r:story", mgr); int count = 0; // Create the reader. XseReader xr = new XseReader( new XmlTextReader( inputStream ) ); // Add our handler, using the strategy compiled above. xr.AddHandler(st, delegate { count++; }); while (xr.Read()) { } Console.WriteLine("Stories viewed: {0}", count);

    I have a couple of questions about his implementation the main one being how it deals with XPath queries such as /r:feeds/r:feed[count(r:stories-recently-viewed)>10]/r:title which can't be done in a forward only manner?

    Oleg Tkachenko also pipes in with some opinions about streaming XPath in his post Warriors of the Streaming XPath Order. He writes

    I've been playing with such beasts, making all kinds of mistakes and finally I came up with a solution, which I think is good, but I didn't publish it yet. Why? Because I'm tired to publish spoilers :) It's based on "ForwardOnlyXPathNavigator" aka XPathNavigator over XmlReader, Dare is going to write about in MSDN XML Dev Center and I wait till that's published.

    May be I'm mistaken, but anyway here is the idea - "ForwardOnlyXPathNavigator" is XPathNavigator implementation over XmlReader, which obviously supports forward-only XPath subset...

    And after I played enough with and implemented that stuff I discovered BizTalk 2004 Beta classes contain much better implementation of the same functionality in such gems as XPathReader, XmlTranslatorStream, XmlValidatingStream and XPathMutatorStream. They're amazing classes that enable streaming XML processing in much rich way than trivial XmlReader stack does. I only wonder why they are not in System.Xml v2 ? Is there are any reasons why they are still hidden deeply inside BizTalk 2004 ? Probably I have to evangelize them a bit as I really like this idea.

    Actually Oleg is closer and yet farther from the truth than he realizes. Although I wrote about a hypothetical ForwardOnlyXPathNavigator in my article entitled Can One Size Fit All? for XML Journal my planned article which should show up when the MSDN XML Developer Center launches in a month or so won't be using it. Instead it will be based on an XPathReader that is very similar to the one used in BizTalk 2004, in fact it was written by the same guy. The XPathReader works similarly to Daniel Cazzulino's XseReader but uses the XPath subset described in Arpan Desai's Introduction to Sequential XPath paper instead of adding proprietary extensions to XPath as Daniel's does.

    When the article describing the XPathReader is done it will provide source and if there is interest I'll create a GotDotNet Workspace for the project although it is unlikely I nor the dev who originally wrote the code will have time to maintain it.


     

    Categories: XML

    February 15, 2004
    @ 05:50 PM

    A few months ago I attended XML 2003 where I first learned about Semantic Integration which is the buzzword term for mapping data from one schema to another with a heavy focus on using Semantic Web technologies such as ontologies and the like. The problem that these technologies solve is enabling one to map XML data from external sources to a form that is compatible with how an application or business entity manipulates them internally.

    For example, in RSS Bandit we treat feeds in memory and on disk as if they are in the RSS 2.0 format even though it supports other flavors of RSS as well such as RSS 1.0. Proponents of semantic integration technologies would suggest using a technology such as the W3C's OWL Web Ontology Language.  If you are unfamiliar with ontolgies and how they apply to XML a good place to understand what they are useful for is taking a look at the OWL Web Ontology Language Use Cases and Requirements. The following quote from the OWL Use Cases document gives a glimpse into what the goal of ontology languages

    In order to allow more intelligent syndication, web portals can define an ontology for the community. This ontology can provide a terminology for describing content and axioms that define terms using other terms from the ontology. For example, an ontology might include terminology such as "journal paper," "publication," "person," and "author." This ontology could include definitions that state things such as "all journal papers are publications" or "the authors of all publications are people." When combined with facts, these definitions allow other facts that are necessarily true to be inferred. These inferences can, in turn, allow users to obtain search results from the portal that are impossible to obtain from conventional retrieval systems

    Although the above example talks about search engines it is clear that one can also use this for data integration. In the example of RSS Bandit, one could create an ontology that maps the terms in RSS 1.0 to those in RSS 2.0 and make statements such as

    RSS 1.0's <title> element sameAs RSS 2.0's <title> element 

    Basically, one could imagine schemas for RSS 1.0 and RSS 2.0 represented as two trees and an ontology a way of drawing connections between the leaves and branches of the trees. In a previous post entitled More on RDF, The Semantic Web and Perpetual Motion Machines I questioned how useful this actually would be in the real world by pointing out the dc:date vs. pubDate problem in RSS. I wrote

    However there are further drawbacks to using the semantics based approach than using the XML-based syntactic approach. In certain cases, where the mapping isn't merely a case of showing equivalencies between the semantics of similarly structured elemebts  (e.g. the equivalent of element renaming such as stating that a url and link element are equivalent) an ontology language is insufficient and a Turing complete transformation language like XSLT is not.  A good example of this is another example from RSS Bandit. In various RSS 2.0 feeds there are two popular ways to specify the date an item was posted, the first is by using the pubDate element which is described as containing a string in the RFC 822 format while the other is using the dc:date element  which is described as containing a string in the ISO 8601 format. Thus even though both elements are semantically equivalent, syntactically they are not. This means that there still needs to be a syntactic transformation applied after the semantic transformation has been applied if one wants an application to treat pubDate and dc:date as equivalent. This means that instead of making one pass with an XSLT stylesheet to perform the transformation in the XML-based solution, two  transformation techniques will be needed in the RDF-based solution and it is quite likely that one of them would be XSLT.

    Teh above is a simple example, one could imagine more complex examples where the vocabularies to be mapped differ much more syntactically such as

    <author>Dare Obasanjo (dareo@example.com)</author> <author>
     <fname>Dare</fname>
     <lname>Obasanjo</lname>
     <email>dareo@example.com</email>
    </author>

    The aformentioned examples point out technical issues with using ontology based techniques for mapping between XML vocabularies but I failed to point out the human problems that tend to show up in the real world. A few months ago I was talking to Chris Lovett about semantic integration and he pointed out that in many cases as applications evolve semantics begin to be assigned to values in often orthogonal ways.

    An example of semantics being addd to values again shows up in an example that uses RSS Bandit. A feature of RSS Bandit is that feeds are cached on disk allowing a user to read items that have long since disappeared from the feed. At first we provided the ability for the user to specify how long items should be kept in the cached feed ranging from a day up to a year. We used an element named maxItemAge embedded in the cached feed which contained a serialized instance of the System.Timespan structure. After a while we realized we needed ways to say that for a particular feed use the global default maxItemAge, never cache items for this feed or never expire items for this feed so we used the TimeSpan.MinValue, TimeSpan.Zero, or TimeSpan.MaxValue values of the TimeSpan class respectively.

    If another application wanted to consume this data and had a similar notion of 'how long to keep the items in a feed' it couldn't simply map maxItemAge to whatever internal property it used without taking into account the extra semantics embedded in when certain values occur in that element. Overloading the meaning of properties and fields in a database or class is actually fairly commonplace [after all how many different APIs use the occurence of -1 for a value that should typically return a positive number as an error condition?] and something that must also be considered when applying semantic integration technologies to XML.

    In conclusion, it is clear that Semantic Web can be used to map between XML vocabularies however in non-trivial situations the extra work that must be layered on top of such approaches tends to favor using XML-centric techniques such as XSLT to map between the vocabularies instead.  


     

    Categories: XML

    February 15, 2004
    @ 03:07 AM

    Just as it looks like my buddy Erik Meijer is done with blogging (despite his short lived guest blogging stint at Lamda the Ultimate) it looks like a couple more of the folks who brought Xen to the world have started blogging. They are

    1. William Adams: Dev Manager for the WebData XML team.

    2. Matt Warren: Formerly a developer on the WebData XML team, now works on the C# team or on the CLR (I can never keep those straight).

    Both of them were also influential in the design and implementation of the System.Xml namespace in version 1.0 of the .NET Framework.