Adam Bosworth's "Learning From the Web"

November 9, 2005

@ 09:12 PM

Last week Andrew Conrad told me to check out a recent article by Adam Bosworth in the ACM Queue because he wondered what I thought about. I was rather embarassed to note that althought I'd seen some mention of it online, I hadn't read it. I read it today and as usual, Adam Bosworth is on point.

The article is entitled Learning from THE WEB and it begins by listing eight "unintuitive lessons" we have learned from the Web. The lessons are listed below

Simple, relaxed, sloppily extensible text formats and protocols often work better than complex and efficient binary ones.
It is worth making things simple enough that one can harness Moore’s law in parallel.
It is acceptable to be stale much of the time.
The wisdom of crowds works amazingly well.
People understand a graph composed of tree-like documents (HTML) related by links (URLs).
Pay attention to physics.
Be as loosely coupled as possible.
KISS. Keep it (the design) simple and stupid.

Where the paper gets interesting is that then tries to apply these lessons to XML. Remember that Adam was one of the founder of the XML team at Microsoft and knows a thing or two about it. So he writes

In my humble opinion, however, we ignored or forgot lessons 3, 4, and 5. Lesson 3 tells us that elements in XML with values that are unlikely to change for some known period of time (or where it is acceptable that they are stale for that period of time, such as the title of a book) should be marked to say this. XML has no such model.
...
Lesson 4 says that we shouldn’t over-invest in making schemas universally understood.
...
Lessons 1 and 5 tell us that XML should be easy to understand without schemas

I totally agree with his assessment of the lessons learned from lessons 4 & 5. However the issue of being able to mark an element in an XML file as 'relatively unchanging' in a generic way seems to be lost on me. He then goes on to point out more of the problems with XML [and the Semantic Web/RDF]

There are some interesting implications in all of this.

One is that the Semantic Web is in for a lot of heartbreak. It has been trying for five years to convince the world to use it. It actually has a point. XML is supposed to be self-describing so that loosely coupled works. If you require a shared secret on both sides, then I’d argue the system isn’t loosely coupled, even if the only shared secret is a schema. What’s more, XML itself has three serious weaknesses in this regard:

It doesn’t handle binary data well.

It doesn’t handle links.

XML documents tend to be monolithic.

Now it's gotten pretty interesting and at this point, Adam throws the curve ball.

Recently, an opportunity has arisen to transcend these limitations. RSS 2.0 has become an extremely popular format on the Web. RSS 2.0 and Atom (which is essentially isomorphic) both support a base schema that provides a model for sets. Atom’s general model is a container (a <feed>) of <entry> elements in which each <entry> may contain any namespace scoped elements it chooses (thus any XML), must contain a small number of required elements (<id>, <updated>, and <title>), and may contain some other well-known ones in the Atom namespace such as <link>s. Even better, Atom clearly says that the order doesn’t matter.This immediately gives a simple model for sets missing in XML.
...
Atom also supports links of other sorts, such as comments, so clearly an Atom entry can contain links to related feeds (e.g., Reviews for a Restaurant or Complaints for a Customer) or links to specific posts. This gives us the network and graph model that is missing in XML. Atom contains a simple HTTP-based way to INSERT, DELETE, and REPLACE s within a . There is a killer app for all these documents because the browsers already can view RSS 2.0 and Atom and, hopefully, will soon natively support the Atom protocol as well, which would mean read and write capabilities.

Now that's deep. Why not move up one level of abstraction from exchanging XML documents to exchanging Web Feeds (RSS/Atom documents)? Adam ends his article by throwing a challenge out to database vendors who he believes have failed to learn the lessons of the Web by writing

All of this has profound implications for databases. Today databases violate essentially every lesson we have learned from the Web.

Are simple relaxed text formats and protocols supported? No.

Have databases enabled people to harness Moore’s law in parallel? This would mean that databases could scale more or less linearly to handle both the volume of the requests coming in and even the complexity. The answer is no.

Do databases optimize caching when it is OK to be stale? No.

Do databases let schemas evolve for a set of items using a bottom-up consensus/tipping point? Obviously not.

Do databases handle flexible graphs (or trees) well? No, they do not.

Have the databases learned from the Web and made their queries simple and flexible? No, just ask a database if it has anyone who, if they have an age, are older than 40; and if they have a city, live in New York; and if they have an income, earn more than $100,000. This is a nightmare because of all the tests for NULL.

The article ends by arguing that database vendors should add native support for the Atom Protocol and wire format. I find this interesting since based on conversations on the atom-protocol list, it is clear that Google is very interested in the Atom API. Perhaps they have already built this Atom store that Adam is arguing for and will expose the Atom API as a way to interact with it. Perhaps this Atom store accessible via Atom feeds and the Atom API is Google Base? Speculation is fun.

As for me, I tend to agree with Adam that moving up layers of abstraction is a good idea. We've all agreed on XML, the next thing to do is to agree on applications of XML. We've all agreed on RSS, the next thing to do is figure out what scenarios are enabled by the subscribe model. This is one of the reasons why I disliked the unnecessary fragmentation caused by the RSS vs. Atom battles. As for whether we need to start seeing databases with native RSS/Atom support, I think it's too early in the game to jump there. Heck, RDF has been around for a while but we are just know seeing some decent things happening with SPARQL and various RDF stores. Similarly with XML and XQuery. I don't think enough lessons have been learned from either to start thinking about what it would mean to have a native RSS/Atom store. It is an interesting idea though.