February 13, 2004
@ 03:30 PM

Mark Pilgrim has a post entitled Determining the character encoding of a feed where he does good job of sumarizing what the various specs say about determining the character encoding of an XML document retrieved on the World Wide Web via HTTP. The only problem with his post is that although it is a fairly accurate description of what the specs say it definitely does not reflect reality. Specifically

According to RFC 3023..., if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the encoding is

  1. the encoding given in the charset parameter of the Content-Type HTTP header, or
  2. us-ascii.

So for this to work correctly it means that if the MIME type of an XML document is text/xml then the web server should look inside it before sending it over the wire and send the correct encoding or else the document will be interpreted incorrectly since it is highly likely that us-ascii is not the encoding of the XML document. In practice, most web servers do not do this. I have confirmed this by testing against both IIS and Apache.

Instead what happens is that an XML document is created by the user and dropped on the file system and the web server assumes it is text/xml which it most likely is and sends it as is without setting the charset in the content type header.   

A simple way to test this is to go to Rex Swain's HTTP Viewer and download the following documents from the W3 Schools page on XML encodings

  1. XML document in windows-1252 encoding
  2. XML document in ISO-8859-1 encoding
  3. XML document in UTF-8 encoding
  4. XML document in UTF-16 encoding

All files are sent with a content type of text/xml and no encoding specified in the charset parameter of the Content-Type HTTP header. According to RFC 3023 which Mark Pilgrim quoted in his article that clients should treat them as us-ascii. With the above examples this behavior would be wrong in all four cases.

The moral of this story is if you are writing an application that consumes XML using HTTP you should use the following rule of thumb for the forseeable future [slightly modified from Mark Pilgrim's post]  

According to RFC 3023, if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml then the encoding is

  1. the encoding given in the charset parameter of the Content-Type HTTP header, or
  2. the encoding given in the encoding attribute of the XML declaration within the document, or
  3. utf-8.

Some may argue that this discussion isn't relevant for news aggregators because they'll only consume XML documents whose MIME type application/atom+xml or application/rss+xml but again this ignores practice. In practice most web servers send back RSS feeds as text/xml, if you don't believe me test ten RSS feeds chosen at random using Rex Swain's HTTP Viewer  and see what MIME type the server claims they are. 


 

Categories: XML

February 12, 2004
@ 11:47 PM

According to Jeremy Zawodney the “My Yahoo's RSS module also groks Atom. It was added last night. It took about a half hour.” Seeing that he said it took only 30 minutes to implement this and there are a couple of things about ATOM that require a little thinking about it even if all you are interested in is titles and dates as My Yahoo! is I decided to give it a try and subscribe to Mark Pilgrim's Atom feed and this is what I ended up being shown in My Yahoo!

dive into mark  Remove

The first minor issue is that the posts aren't sorted chronologically but that isn't particularly interesting. What is interesting is if you go to the article entitled The myth of RSS compatibility its publication date is said to be “Wednesday, February 4, 2004” which is about a week ago and if you go to the post entitled  Universal Feed Parser 3.0 beta its publication date is said to be Wednesday, February 1, 2004 which is almost 2 weeks ago not a day ago like Yahoo! claims.  

The simple answer to the confusion can be gleaned from Mark's ATOM feed, that particular entry has a <modified> date of 2004-02-11T16:17:08Z, an <issued> date of 2004-02-01T18:38:15-05:00 and a <created> date of 2004-02-01T23:38:15Z. My Yahoo! is choosing to key the freshness of article of its modified date even though when one gets to the actual content it seems much older.

It is quite interesting to see how just one concept [how old is this article?] can lead to some confusion between the end user of a news aggregator and the content publisher. I also suspect that My Yahoo! could be similarly confused by the various issues with escaping content in Atom when processing titles but since I don't have access to a web server I can't test some of my theories.

I tend to wonder whether the various content producers creating Atom feeds will ditch their feeds for Atom 0.4, Atom 0.5 up until it becomes a final IETF spec or whether they'll keep parallel versions of these feeds so Atom 0.3 continues to live in perpetuity.

It's amazing how geeks can turn the simplest things into such a mess. I'm definitely going to sit it out until the IETF Atom 1.0 syndication format spec before spending any time working on this for RSS Bandit.


 

Categories: Technology

February 11, 2004
@ 04:02 PM

One of the big problems with arguing about metadata is that one persons data is another person's metadata. I was reading Joshua Allen's blog post entitled Trolling EFNet, or Promiscuous Memories where he wrote

  • Some people deride "metacrap" and complain that "nobody will enter all of that metadata".  These people display a stunning lack of vision and imagination, and should be pitied.  Simply by living their lives, people produce immense amounts of metadata about themselves and their relationships to things, places, and others that can be harvested passively and in a relatively low-tech manner.
  • Being able to remember what we have experienced is very powerful.  Being able to "remember" what other people have experienced is also very powerful.  Language improved our ability to share experiences to others, and written language made it possible to communicate experiences beyond the wall of death, but that was just the beginning.  How will your life change when you can near-instantly "remember" the relevant experiences of millions of other people and in increasingly richer detail and varied modality?
  • From my perspective it seems Joshua is confusing data and metadata. If I had a video camera attached to my forehead recording I saw then the actual audiovisual content of the files on my harddrive are the data while the metadata is information such as what date it was, where I was and who I saw. Basically the metadata is the data about data. The interesting thing about metadata is that if we have enough good quality metadata then we can do things like near-instantly "remember" the relevant experiences of ourselves and millions of other people. It won't matter if all my experiences are cataloged and stored on a hard drive if the retrieval process isn't automated (e.g. I can 'search' for experiences by who they were shared with, where they occured or when they occured) as opposed to me having to fast forward through gigabytes of video data. The metadata ideal would be that all this extra, descriptive information would be attached to my audiovisual experiences stored on disk so I could quickly search for “videos from conversations with my boss in October, 2003”.

    This is where metacrap comes in. From Cory Doctorow's excellent article entitled Metacrap

    A world of exhaustive, reliable metadata would be a utopia. It's also a pipe-dream, founded on self-delusion, nerd hubris and hysterically inflated market opportunities.

    This applies to Joshua's vision as well. Data acquisition is easy, anyone can walk around with a camcorder or digital camera today recording everything they can. Effectively tagging the content so it can be categorized in a way you can do interesting things with it search-wise is unfeasible. Cory's article does a lot better job than I can at explaining the many different ways this is unfeasible, cameras with datestamps and built in GPS are just a tip of the iceberg. I can barely remember dates once the event didn't happen in the recent past and wasn't a special occassion. As for built in GPS, until the software is smart enough to convert longitude and latitude coordinates to “that Chuck E Cheese in Redmond“ then they only solve problems for geeks not regular people.  I'm sure technology will get better but metacrap is and may always be an insurmountable problem on a global network like the World Wide Web without lots of standardization.


     

    Categories: Technology

    February 11, 2004
    @ 03:36 PM

    Besides our releases, Torsten packages nightly builds of RSS Bandit for folks who want to try out bleeding edge features or test recent bug fixes without having to set up a CVS client. There is currently a bug pointed out by James Clarke that we think is fixed but would like interested users to test

    I'm hitting a problem once every day or so when the refresh thread seems to be exiting - The message refreshing Feeds never goes away and the green download icons for the set remain green forever. No feed errors are generated. Only way out is to quit.

    If you've encountered this problem on recent versions of RSS Bandit, try out the RSS Bandit build from February 9th 2004 and see if that fixes the problem. Once we figure out the root of the problem and fix it there'll be a refresh of the installer with the updated bits.

     


     

    Categories: RSS Bandit

    February 11, 2004
    @ 02:51 AM

    From Sam Ruby's slides for the O'Reilly Emerging Technology Conference

    Where are we going?

    • A draft charter will be prepared in time to be informally discussed at the IETF meeting is Seoul, Korea on the week of 29 February to 5 March 
    •  Hopefully, the Working Group itself will be approved in March 
    •  Most of the work will be done on mailing lists 
    •  Ideally, a face to face meeting of the Working Group will be scheduled to coincide with the August 1-6 meeting of the IETF in San Diego

    Interesting. Taking the spec to IETF implies that Sam thinks it's mostly done.  Well, I just hope the IETF's errata process is better than the W3C's.


     

    Categories: Technology

    February 10, 2004
    @ 05:30 PM

    Robert Scoble has a post entitled Metadata without filling in forms? It's coming where he writes

    Simon Fell read my interview about search trends and says "I still don't get it" about WinFS and metadata. He brings up a good point. If users are going to be forced to fill out metadata forms, like those currently in Office apps, they just won't do it. Fell is absolutely right.But, he assumed that metadata would need to be entered that way for every photo. Let's go a little deeper....OK, I have 7400 photos. I have quite a few of my son. So, let's say there's a new kind of application. It recognizes the faces automatically and puts a square around them. Prompting you to enter just a name. When you do the square changes color from red to green, or just disappears completely.
    ...
    A roadblock to getting that done today is that no one in the industry can get along for enough time to make it possible to put metadata into files the way it needs to be done. Example: look at the social software guys. Friendster doesn't play well with Orkut which doesn't play well with MyWallop, which doesn't play well with Tribe, which doesn't play well with ICQ, which doesn't play well with Outlook. What's the solution? Fix the platform underneath so that developers can put these features in without working with other companies and/or other developers they don't usually work with.

    The way WinFS is being pitched by Microsoft folks reminds me a lot of Hailstorm [which is probably unsurprising since a number of Hailstorm folks work on it] in that there are a lot of interesting and useful technical  ideas burdened by bad scenarios being hung on them. Before going into the the interesting and useful technical ideas around WinFS I'll start with why I consider the two scenarios mentioned by Scoble as “bad scenarios”.

    The thought that if you make the file system a metadata store automatically makes search better is a dubious proposition to swallow when you realize that a number of the searches that people can't do today wouldn't be helped much by more metadata. This isn't to say some searches wouldn't work better (e.g. searching for songs by title or artist), however there are some search scenarios such as searching for a particular image or video from a bunch of image files with generic names or searching for a song by lyrics which simply having the ability to tag media types with metadata doesn't seem like enough. Once your scenarios start having to involve using “face recognition software” or “cameras with GPS coordinates” for a scenario to work then it is hard for people not to scoff. It's like a variation of the popular Slashdot joke

    1. Add metadata search capabilities to file system
    2. ???
    3. You can now search for “all pictures taken on Tommy's 5th birthday party at the Chuck E Cheese in Redmond”.

     with the ??? in the middle implying a significant dfficulty in going from step 1 to 3.

    The other criticism is the fact that Robert's post implies that the reason applications can't talk to each other are technical. This is rarely the case. The main reasons applications don't talk to each other isn't a lack of technology [especially now that we have an well-defined format for exchanging data called XML] but for various social and business reasons. There are no technical reasons MSN Messenger can't talk to ICQ or which prevent Yahoo! Messenger from talking to AOL Instant Messenger. It isn't technical reasons that prevent my data in Orkut from being shared with Friendster or my book & music preferences in Amazon from being shared with other online stores I visit. All of these entities feel they have a competitive advantage in making it hard to migrate from their platforms.

    The two things Microsoft needs to do in this space is are to (i) show how & why it is beneficial for different applications to share data locally and (ii) provide guidelines as well as best practices for applications to share data their data in a secure manner.

    While talking to Joshua Allen, Dave Winer, Robert Scoble, Lili Cheng, and Curtis Wong yesterday it seemed clear to me that social software [or if you are a business user; groupware that is more individual-focused which gives people more control over content and information sharing] would be a very powerful and useful tool for businesses and end users if built on a platform like Longhorn with a smart data store that know how to create relationships between concepts as well as files (i.e. WinFS) and a flexible, cross platform distributed computing framework (i.e. Indigo).

    The WinFS folks and Longhorn evangelists will probably keep focusing on what I have termed “bad scenarios” because they demo well but I suspect that there'll be difficulty getting traction with them in the real world. Of course, I may be wrong and the various people who've expressed incredulity at the current pitches are a vocal minority who'll be proved wrong once others embrace the vision. Either way, I plan to experiment with these ideas once Longhorn starts to beta and seeing where the code takes me.


     

    Categories: Technology

    February 10, 2004
    @ 05:59 AM

    As Joshua wrote in his blog we had lunch with Dave Winer this afternoon. We talked about the kind of stuff you'd have expected; RSS, ATOM and "social software". An interesting person at lunch was Lili Cheng who's the Group Manager of the Social Computing Group in Microsoft Research*. She was very interested in the technologies around blogging and thought “social software“ could become a big deal if handled correctly. Her group is behind Wallop and I asked if she'd be able to wrangle an invitation so I could check it out. Given my previous negative impressions of Social Software I'm curious to see what the folks at MSR have come up with. She seemed aware of the limitations of the current crop of “social software” that her hip with some members of the blogging crowd so I'd like to see what she thinks they do differently. I think a fun little experiment would be seeing what it would be like to integrate some interaction with “social software“ like Wallop into RSS Bandit. Too bad my free time is so limited.

    * So MSFT has a Social Computing Group and Google has Orkut? If I worked at Friendster it would seem the exit strategy is clear, try to get bought by Yahoo! before the VC funds dry up.


     

    Categories: Ramblings

    In his blog post entitled Namepaces in Xml - the battle to explain Steven Livingstone wrote

    It seems that Namespaces is quickly displacing Xml Schema as the thing people "like to hate" - well at least those that are contacing me now seem to accept Schema as "good".

    Now, the concept of namespaces is pretty simple, but because it happens to be used explicitly (and is a more manual process) in Xml people just don't seem to get it. There were two core worries put to me - one calling it "a mess" and the other "a failing". The whole thing centered around having to know what namespaces you were actually using (or were in scope) when selecing given nodes. So in the case of SelectNodes(), you need to have a namespace manager populated with the namespaces you intend to use. In the case of Schema, you generally need to know the targetNamespace of the Schema when working with the XmlValidatingReader. What the guys I spoke with seemed to dislike is that you actually have to know what these namespaces are. Why bother? Don't use namespaces and just do your selects or validation.

    Given that I am to some degree responsible for both classes mentioned in the above post, XmlNode (where SelectNodes()comes from) and XmlValidatingReader,  I feel compelled to respond.

    The SelectNodes() problem is that people would like to perform XPath expressions over nodes and have it not worry about namespaces. For example given XML such as

    <root xmlns=”http://www.example.com”>

    <child />

    </root>

    to perform a SelectNodes() or SelectSingleNode() that returns the <child> element requires the following code

      XmlDocument doc = new XmlDocument(); 
      doc.LoadXml("<root xmlns='http://www.example.com'><child /></root>"); 
      XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable); 
      nsmgr.AddNamespace("foo", "http://www.example.com");  //this is the tricky bit 
      Console.WriteLine(doc.SelectSingleNode("/foo:root/foo:child", nsmgr).OuterXml);   

    whereas developers don't see why the code isn't something more along the lines of

      XmlDocument doc = new XmlDocument(); 
      doc.LoadXml("<root xmlns='http://www.example.com'><child /></root>"); 
      Console.WriteLine(doc.SelectSingleNode("/root/child").OuterXml);   

    which would be the case if there were no namespaces in the document.

    The reason the latter code sample is not the case is because the select methods on the XmlDocument class are conformant to the W3C XPath 1.0 recommendation which is namespace aware. In XPath, path expressions that match nodes based on their names are called node tests. A node test is a qualified name or QName for short. A QName is syntactically an optional prefix and local name separated by a colon. The prefix is supposed to be mapped to a namespace and is not to be used literally in matching the expression. Specifically the spec states

    A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). It is an error if the QName has a prefix for which there is no namespace declaration in the expression context.

    There are a number of reasons why this is the case which are best illustrated with an example. Consider the following two XML documents

    <root xmlns=“urn:made-up-example“>

    <child xmlns=”http://www.example.com”/>

    </root>

    <root>

    <child />

    </root>

    Should the query /root/child also match the <child> element for the above two documents as it does for the original document in this example? The 3 documents shown [including the first example] are completely different documents and there is no consistent, standards compliant way to match against them using QNames in path expressions without explicitly pairing prefixes with namespaces.

    The only way to give people what they want in this case would be to come up with a proprietary version of XPath which was namespace agnostic. We do not plan to do this. However I do have a tip for developers showing how to reduce the amount of code it does take to write the examples. The following code does match the <child> element in all three documents and is fully conformant with the XPath 1.0 recommendation

    XmlDocument doc = new XmlDocument(); 
    doc.LoadXml("<root xmlns='http://www.example.com'><child /></root>"); 
    Console.WriteLine(doc.SelectSingleNode("/*[local-name()='root']/*[local-name()='child']").OuterXml);  

    Now on to the XmlValidatingReader issue. Assume we are given the following XML instance and schema

    <root xmlns="http://www.example.com">
     <child />
    </root>

    <xs:schema targetNamespace="http://www.example.com"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                elementFormDefault="qualified">
           
      <xs:element name="root">
        <xs:complexType>
          <xs:sequence>
            <xs:element name="child" type="xs:string" />
          </xs:sequence>
        </xs:complexType>
      </xs:element>

    </xs:schema>

    The instance document can be validated against the schema using the following code

    XmlTextReader tr = new XmlTextReader("example.xml");
    XmlValidatingReader vr = new XmlValidatingReader(tr);
    vr.Schemas.Add(null, "example.xsd");

    vr.ValidationType = ValidationType.Schema;
    vr.ValidationEventHandler += new ValidationEventHandler (ValidationHandler);

    while(vr.Read()){ /* do stuff or do nothing */  

    As you can see you do not need to know the target namespace of the schema to perform schema validation using the XmlValidatingReader. However many code samples in our SDK to specify the target namespace where I specified null above when adding schemas to the Schemas property of the XmlValidatingReader. When null is specified it indicates that the target namespace should be obtained from the schema. This would have been clearer if we'd had an overload for the Add() method which took only the schema but we didn't. Hindsight is 20/20.


     

    Categories: XML

    February 8, 2004
    @ 10:15 PM

    I noticed Gordon Weakliem reviewed ATOM.NET, an API for parsing and generating ATOM feeds. I went to the ATOM.NET website and decided to take a look at the ATOM.NET documentation. The following comments come from two perspectives, the first is as a developer who'll most likely have to implement something akin to ATOM.NET for RSS Bandit's internal workings and the other is from the perspective of being one of the folks at Microsoft whose job it is to design and critique XML-based APIs.

    • The AtomWriter class is superflous. The class that only has one method Write(AtomFeed) which makes more sense being on the AtomFeed class since an object should know how to persist itself. This is the model we followed with the XmlDocument class in the .NET Framework which has an overloaded Save() method. The AtomWriter class would be quite useful if it allowed you to perform schema driven generation of an AtomFeed, the same way the XmlWriter class in the .NET Framework is aimed at providing a convenient way to programmatically generate well-formed XML [although it comes close but doesn't fully do this in v1.0 & v1.1 of the .NET Framework]

    • I have the same feelings about the AtomReader class. This class also seems superflous. The functionality it provides is akin to the overloaded Load() method we have on the  XmlDocument class in the .NET Framework. I'd say it makes more sense and is more usable if this functionality was provided as a Load() method on an AtomFeed class than as a separate class unless the AtomReader class actually gets some more functionality.

    • There's no easy way to serialize an AtomEntry class as XML which means it'll be cumbersome using this ATOM.NET for the ATOM API since it requires sending  elements as XML over the wire. I use this functionality all the time in RSS Bandit internally from passing entries as XML for XSLT themes to the CommentAPI to IBlogExtension.

    • There is no consideration for how to expose extension elements and attributes in ATOM.NET. As far as I'm concerned this is a deal breaker that makes the ATOM.NET useless for aggregator authors since it means they can't handle extensions in ATOM feeds even though they may exist and have already started popping up in various feeds.


     

    Categories: XML

    February 8, 2004
    @ 09:37 PM

    Lots of people seem to like the newest version of RSS Bandit. The most recent praise was the following post by Matt Griffith

    I've been a Bloglines user for almost a year. I needed a portable aggregator because I use several different computers. Then a few months ago I got a TabletPC. Now portability isn't as critical since I always have my Tablet with me. I stayed with Bloglines though because none of the client-side aggregators I tried before worked for me.

    I just downloaded the latest version RSS Bandit. I love it. It is much more polished than it was the last time I tried it. Combine that with the dasBlog integration and the upcoming SIAM support and I'm in hog heaven. Thanks Dare, Torsten, and everyone else that helped make RssBandit what it is.

    Also it seems that at least one user liked RSS Bandit so much that he [or she] was motivated to write an article on Getting Started with RSS Bandit. Definitely a good starting point and something I wouldn't mind seeing become part of the official documentation once it's been edited and more details fleshed out.

    Sweet.


     

    Categories: RSS Bandit