I recently got a number of bug reports that in certain situations RSS Bandit would report a proxy authentication error when fetching certain RSS feeds over the Web when connecting through a proxy server. It seemed most feeds would work fine but a particular set of feeds would result in the following message

The remote server returned an error: (407) Proxy Authentication Required.

Examples of sites that had problems include the feeds for Today on Java.netMartin Fowler's bliki and Wired News. It dawned on me that the one the one thing all these feeds had in common was that they referenced a DTD. The problem was that although I was using an instance of the System.Net.IWebProxy interface in combination with an HttpWebRequest when fetching the RSS feed I did not provide the XmlValidatingReader used to process the feed that it should use the proxy information when resolving DTDs.  

This is where things got less intuitive. All XmlReaders have an XmlResolver property used to retrieve resources external to the file. However the XmlResolver class does not provide a way to specify proxy information, only authenticattion information. To solve this problem I had to create a subclass of the XmlResolver class which used the proxy connection when retrieving external resources. It seems I'm not the only person who's come up across this problem and the solution was presented on the microsoft.public.dotnet.xml newsgroup a while ago in the thread entitled XmlValidatingReader, XmlResolver, Proxy Authentication, Credentials, Remote schema. This post shows how to create a custom XmlResolver which utilizes proxy information and how to use this class to prevent the errors I was seeing.

I checked in the fix to RSS Bandit this morning, so very soon a number of users of the most sophisticated news aggregator on the Windows platform will be very happy campers seeing this annoying bug fixed.  


Categories: XML

August 1, 2004
@ 02:45 AM

The August issue of Playboy magazine has an article entitled “Detroit, Death City” which cites some depressing statistics about this once great city. Excerpts from the article include

“Beyond the murder rate, there are three statistics that tell you a lot of what's happening in Detroit,” says Wayne State's Herron. “More than half the residents don't have high school diplomas, 47 percent of adults are functionally illiterate, and 44 percent of the people between the ages of 16 and 60 are either unemployed or not looking for work. Half the population is disqualified from participating in the official economy except at the lowest levels.”
Married couples head only 36.9 percent of Detroit families. Single fathers head 8.2 percent, single mothers 54.9 percent.

Detroit is 82.8 percent African American, second only to Gary, Indiana. Livonia, nine miles from the city is 96.5 percent white.

Detroit is the nation's number one city for auto arson. In 1999, more than 3300 cars were torched, costing insurers $22 million.

In 1950 Detroit's population was 1.9 million making it the fifth largest US city. By 2000 its population was 950,000.

Detroit's yearly pedestrian fatality rate is the nation's highest at 5.05 per 100,000 residents. New York City's rate is half that.

The author of the article tells the story of his father in-law, a 1960s revolutionary who became a well-known figure in the fight to save Detroit, and his brother-in-law who became a drug dealer. I found the juxtaposition of the life of the father and the son presented an interesting contrast. The article was definitely one of the better things-are-really-screwed-up-in-America's-inner-cities style articles  I've read in a while.

There's also an interview with Spike Lee in this month's issue. This subscription is definitely working out.


Categories: Ramblings

Since I wrote my What is Google Building? I've seen lots of interesting responses to it in my referrer logs. As usual Jon Udell's response gave me the most food for thought. In his post entitled Bloglines he wrote

Finally, I'd love it if Bloglines cached everything in a local database, not only for offline reading but also to make the UI more responsive and to accelerate queries that reach back into the archive.

Like Gmail, Bloglines is the kind of Web application that surprises you with what it can do, and makes you crave more. Some argue that to satisfy that craving, you'll need to abandon the browser and switch to RIA (rich Internet application) technology -- Flash, Java, Avalon (someday), whatever. Others are concluding that perhaps the 80/20 solution that the browser is today can become a 90/10 or 95/5 solution tomorrow with some incremental changes.
It seems pretty clear to me. Web applications such as Gmail and Bloglines are already hard to beat. With a touch of alchemy they just might become unstoppable.

This does seem like the missing part of the puzzle. The big problem with web applications (aka thin client applications) is that they cannot store a lot of local state. I use my primary mail readers offline (Outlook & Outlook Express) and I use my primary news aggregator (RSS Bandit) offline on my laptop when I travel or in meetings when I can't get a wireless connection. There are also lots of dial up users out there who don't have the luxury of an 'always on' broadband connection who also rely on the offline capabilities of such applications.

I suspect this is one of the reasons Microsoft stopped trying to frame the argument as thin client vs fat rich client. This discussion basically is arguing that an application with zero deployment and a minimalistic user interface is inferior to a desktop application that needs to be installed, updated and patched but has a fancier GUI. This is an argument that holds little water to most people which is why the popularity of Web applications has grown both on the Internet and on corporate intranets.

Microsoft has attempted to tackle this problem in two ways. The first attempt is to make rich client applications as easy to develop and deploy as web applications by creating a rich client markup language, XAML as well as the ClickOnce application deployment technology. The second is with better positioning by emphasizing the offline capabilities of rich clients and coming up with a new monicker for them, smart clients.

Companies that depend on thin client applications such as Google with GMail do realize their failings. However Google is in a unique position of being able to attract some very smart people who've been working on this problem for a while. For example, their recent hire Adam Bosworth wrote about technologies for solving this limitation in thin clients in a number of blog posts from last year; Web Services Browser, Much delayed continuation of the Web Services Browser and When connectivity isn't certain. The latter post definitely has some interesting ideas such as

the issue that that I want a great user experience even when not connected or when conected so slowly that waiting would be irritating. So this entry discusses what you do if you can't rely on Internet connectivity.

Well, if you cannot rely on the Internet under these circumstances, what do you do? The answer is fairly simple. You pre-fetch into a cache that which you'll need to do the work. What will you need? Well, you'll need a set of pages designed to work together. For example, if I'm looking at a project, I'll want an overview, details by task, breakout by employee, late tasks, add and alter task pages, and so on. But what happens when you actually try to do work such as add a task and you're not connected? And what does the user see.

To resolve this, I propose that we separate view from data. I propose that a "mobile page" consists both of a set of related 'pages' (like cards in WML), an associated set of cached information and a script/rules based "controller" which handles all user gestures. The controller gets all requests (clicks on Buttons/URL's), does anything it has to do using a combination of rules and script to decide what it should do, and then returns the 'page' within the "mobile page" to be displayed next. The script and rules in the "controller" can read, write, update, and query the associated cache of information. The cache of information is synchronized, in the background, with the Internet (when connected) and the mobile page describes the URL of the web service to use to synchronize this data with the Internet. The pages themselves are bound to the cache of information. In essence they are templates to be filled in with this information. The mobile page itself is actually considered part of the data meaing that changes to it on the Internet can also be synchronized out to the client. Throw the page out of the cache and you also throw the associated data out of the cache.

Can you imagine using something like GMail, Google Groups or Bloglines in this kind of environment? That definitely would put the squeeze on desktop applications.


Categories: Technology

About a week ago my article Designing Extensible, Versionable XML Formats appeared on XML.com. However due to a “pilot error” on my end I didn't send the final draft to XML.com. By the time I realized my mistake the article was already live and changing it would have been cumbersome since there were a few major changes in the article.

You can read the final version of the article Designing Extensible, Versionable XML Formats on MSDN. The main differences between the MSDN article and the XML.com one are

  1. Added sections on Message Transfer Negotiation vs. Versioning Message Payloads and Version Numbers vs. Namespace Names

  2. Added more content to the section Using XML Schema to Design an Extensible XML Format especially around usage of substitution groups, xsi:type and xs:redefine.

  3. Amended all sample schemas to use blockdefault="#all".

  4. Added an Acknowledgements section

  5. Schema in for section New constructs in a new namespace approach uses a fixed value instead of a default value for mustUnderstand attribute on isbn element.


Categories: XML

July 25, 2004
@ 12:30 AM

Recently there have been some complaints about duplicate entries showing up in RSS Bandit. This is due to a change I made in the most recent version of RSS Bandit. In RSS 2.0 there is an optional guid element that can be used to uniquely identify an item in an RSS feed. Unfortunately this element is optional so most aggregators end up using the link element instead in feeds that don't use guids. 

For the most part this worked fine. However I stumbled across a feed that used the same link for each item from a given day; the Cafe con Leche RSS feed. This meant that RSS Bandit couldn't differentiate between items posted on the same day. This was particularly important when tracking what items a user has read or whether an item has already been downloaded or not. I should have pinged the owner of the feed to point this problem out but instead I decided to code around this issue by using the combination of the link and title elements for uniquely identifying items. This actually turned out to be worse.

Although this fixed the problems with the Cafe con Leche RSS feed it caused other issues. This means that any time an item in a feed changed its title but kept the permalink the same (for example, if a typo was fixed in the title) then RSS Bandit thinks it's a different post and a duplicate entry shows up in the list view. Since popular sites like Boing Boing and Slashdot tend to do this almost every other day it means I turned a problem with a niche site that affects a few users to one that affects a number of popular websites thus affecting lots of users.

This problem will be fixed in the next version of RSS Bandit.


Categories: RSS Bandit

July 24, 2004
@ 08:51 PM

In the past couple of months Google has hired four people who used to work on Internet Explorer in various capacities [especially its XML support] who then moved to BEA; David Bau, Rod Chavez, Gary Burd and most recently Adam Bosworth. A number of my coworkers used to work with these guys since our team, the Microsoft XML team, was once part of the Internet Explorer team. It's been interesting chatting in the hallways with folks contemplating what Google would want to build that requires folks with a background in building XML data access technologies both on the client side, Internet Explorer and on the server, BEA's WebLogic.

Another interesting recent Google hire is Joshua Bloch. He is probably the most visible guy working on the Java language at Sun behind James Gosling. Based on recent interviews with Joshua Bloch about Java his most recent endeavors involved adding new features to the language that mimic those in C#.

While chomping on some cheap sushi at Sushi Land yesterday some friends and I wondered what Google could be planning next. So far, the software industry including my employer has been playing catchup with Google and reacting to their moves. According to news reports MSN is trying to catch up to Google search and Hotmail ups their free storage limit to compete with GMail. However this is all reactive and we still haven't seen significant competition to Google News, Google Image Search, Google Groups or even to a lesser extent Orkut and Blogger. By the time the major online networks like AOL, MSN or Yahoo! can provide decent alternatives to this media empire Google will have produced their next major addition.

So far Google doesn't seem to have stitched all its pieces into a coherent media empire as competitors like Yahoo! have done but this seems like it will only be a matter of time. What is of more interest to the geek in me is what Google could build next that could tie it all together. As Rich Skrenta wrote in his post the Secret Source of Google's Power

Google is a company that has built a single very large, custom computer. It's running their own cluster operating system. They make their big computer even bigger and faster each month, while lowering the cost of CPU cycles. It's looking more like a general purpose platform than a cluster optimized for a single application.

While competitors are targeting the individual applications Google has deployed, Google is building a massive, general purpose computing platform for web-scale programming.

A friend of mine, Justin, had an interesting idea at dinner yesterday. What if Google ends up building the network computer? They can give users the storage space and reliability to run place all their data online. They can mimic the major desktop applications users interact with daily by using Web technologies. This sounds far fetched but then again, I'd have never imagined I'd see a free email service that gave 1GB of free email.

Although I think Justin's idea is outlandish but suspect the truth isn't much further from that.

Update: It seems Google also picked up another Java language guy from Sun. Neal Gafter who worked on various Java compiler tools including javac, javadoc and javap. Curiouser and curiouser.


Categories: Technology

During the most recent Download.Ject Internet Explorer incident [which was significant enough I saw newspaper headlines and TV news reports advicing people to switch browsers] I got some requests from RSS Bandit users to switch the browser used by RSS Bandit since they'd switched from using Internet Explorer due to security concerns.

Torsten and I looked around to see how feasible this would be and found the Mozilla ActiveX control which enables one to embed the Mozilla browser engine (Gecko) into any ActiveX application. The control implements the same APIs as the Internet Explorer control so it may be straightforward to make this change. 

I have some concerns about doing this.

  1. We've had weird interactions with COM interop between RSS Bandit and IE which result in weird bugs like dozens of IE windows being spawned and most recently memory corruption errors. I am wary of moving to an unknown quantity like Gecko and facing similar issues without the benefit of having a background of working with the component.

  2. There's a question of whether we replace our dependency on IE or ship an option to use Gecko instead of IE. Or whether we just ship a Gecko version and an IE version. The installer for the Mozilla ActiveX control is currently larger than the RSS Bandit download so we'd more than double the size of our download if we tied ourselves to Gecko.

I'm curious as to what RSS Bandit users think. Currently I don't think I'm going to add making such a switch to our plans but I am always interested in feedback from our users on what they think the right thing to do should be.


Categories: RSS Bandit

July 21, 2004
@ 07:15 AM

It seems every 3 months some prominent online publication complains about the amount of traffic RSS news readers cause to websites that provide RSS feeds. This time it is Slashdot with their post When RSS Traffic Looks Like a DDoS which references a post by Chad Dickerson, the CTO of Infoworld, entitled RSS growing pains. Chad writes

Several months ago, I spoke to a Web architect at a large media site and asked why his site didn’t support RSS. He raised the concern that thousands (or even millions) of dumb clients could wreak havoc on a popular Web site. Back when I was at CNN.com, I recall that our servers got needlessly pounded by a dumb client (IE4) requesting RSS-like CDF files at frequent intervals regardless of whether they had changed. As the popularity of RSS feeds at InfoWorld started to surge, I began to notice that most of the RSS clients out there requested and downloaded our feeds regardless of whether the feeds themselves had changed. At the time, we hadn’t quite reached the RSS tipping point, so I filed these thoughts away for later -- but “later” came sooner than I thought.

At this point I'd like to note that HTTP provides two mechanisms for web servers to tell clients if a network resource has changed or not. The basics of this mechanism is explained in the blog post HTTP Conditional Get for RSS Hackers which provides a way to prevent clients such as news readers from repeatedly downloading a Web document if it hasn't been updated. At this point I'd like to point out that at the current time, the InfoWorld RSS feed supports neither.

Another technique for reducing bandwidth consumption by HTTP clients is to use HTTP compression which greatly reduces the amount of data that has to be sent to a client when the feed has to be downloaded. For example, the current InfoWorld feed is 7427 bytes which shrinks to 2551 bytes when zipped using GZip on my home machine. This is a reduction by a factor of 3, on larger files the ratio of the reduced size to the original size is even better. Again, InfoWorld doesn't support this technique for reducing bandwidth consumption.

It is unsurprising that they are seeing significant bandwidth consumption from news aggregators. An RSS reader polling the InfoWorld site once an hour over an 8 hour period would download about 60 kilobytes of XML, on the other hand if they supported HTTP conditional GET requests and HTTP compression via GZip encoding this number would be under 3 kilobytes.  

The one thing that HTTP doesn't provide is a way for clients to deal with numerous connections being made to the site at once. However this problem isn't much different from the traditional scaling problem that web sites have to deal with today when they get a lot of traffic from regular readers.  


Today Arpan (the PM for XML query technologies in the .NET Framework) and I were talking about features we'd like to see on our 'nice to have' list for the Orcas release of the .NET Framework. One of the things we thought would be really nice to see in the System.Xml namespace was XPath 2.0. Then Derek being the universal pessimist pointed out that we already have APIs that support XPath 1.0 that only take a string as an argument (e.g. XmlNode.SelectNodes) so we'd have difficulty adding support for another version of XPath without contorting the API.

Not to be dissuaded I pointed out that XPath 2.0 has a backwards compatibility mode which makes it compatible with XPath 1.0. Thus we wouldn't have to change our Select methods or introduce new methods for XPath 2.0 support since all queries that used to work in the past against our Select methods would still work if we upgraded our XPath implemention to version 2.0. This is where Arpan hit me with the one-two punch. He introduced me to a section of the XPath 2.0 spec called Incompatibilities when Compatibility Mode is true which reads

The list below contains all known areas, within the scope of this specification, where an XPath 2.0 processor running with compatibility mode set to true will produce different results from an XPath 1.0 processor evaluating the same expression, assuming that the expression was valid in XPath 1.0, and that the nodes in the source document have no type annotations other than xdt:untypedAny and xdt:untypedAtomic.

I was stunned by what I read and I am still stunned now. The W3C created XPath 2.0 which is currently backwards incompatible with XPath 1.0 and added a compatibility mode option to make it backwards compatible with XPath 1.0 but it actually still isn't backwards compatible even when in this mode?  This seems completely illogical to me. What is the point of having a backwards compatibility mode if it isn't backwards compatible? Well, I guess now I know if we do decide to ship XPath 2.0 in the future we can't just add support for it transparently to our existing classes without causing some API churn. Unfortunate.

Caveat: The fact that a technology is mentioned as being on our 'nice to have' list or is suggested in a comment to this post is not an indication that it will be implemented in future versions of the .NET Framework.


Categories: XML

July 17, 2004
@ 02:40 AM

Dave Winer writes

Russ Beattie says we should be careful not to give the Republicans ammo to kill Kerry. I am sorry Russ, I'm not worried about that. I'm more worried that the Dems are too flustered by the hardball tacticts of the Reps to fight back.

The only time I tend to watch regular TV that isn't TiVo is while working out in the morning at the health club. I've noticed that while John Kerry's ads tend to be about the qualities that  make him a good candidate for president, George Bush's ads have mostly been negative ads attacking John Kerry. Personally I would love it if Kerry's campaign continues to take the high ground and shows the Republican party up for the rabid attack dogs that they are. The problem with this is that negative ads work and some people tend to look at not hitting back as a sign of weakness, which is what it seems Dave Winer is doing.

Whatever happened trying to change the tone in Washington and elevate the discourse? Just another case of "Do what I say, not what I do" I guess.



Categories: Ramblings