February 23, 2006
@ 10:37 PM

In his post More SOAP vs. REST arguments Stefan Tilkov askes

I just noticed an interesting thing in the most recent iteration of the SOAP-vs.-REST debate: this time, nobody seems to have mentioned the benefit — if you believe that’s what it is — of protocol independence. Why is that?

For the record, I personally believe it’s one of the weakest arguments.

When I was on the XML team, we used to talk about XML infosets and data format independence to imply that it made sense if people could use transfer formats that were optimized for their use cases but still get the benefit of the XML machinary such as XML APIs (DOM, SAX, etc), XPath querying, XSLT transformations, etc. This philosophy is what has underpined the arguments for protocol independence at the SOAP level.

Now that I'm actually a customer of web services toolkits as opposed to a builder of the framework that they depend on, my perspective has changed. Protocol independence isn't really that important at the SOAP level. Protocol independence is important at the programming model/toolkit level. I should be able to write some business logic once and then simply be able to choose to expose it as SOAP, XML-RPC, RSS, or a proprietary binary protocol without rewriting a bunch of code. That's what is important to my business needs.  

It took a while but I eventually convinced some of the key Indigo folks that this was the right direction to go with in the Windows Communication Foundation. I damn near clapped when Doug demoed a WS-Transfer RSS service exposing a HTTP/POX endpoint, a HTTP/SOAP endpoint, and a TCP/Binary SOAP endpoint at an internal summit a couple of months ago.

Bottom Line: Protocol independence is important to providers of Web services. However it isn't required at the SOAP level.


 

Categories: XML Web Services

From Matt Cutt's blog post about the Google Page Creator we learn

Oh, and by the way, it looks like Google has released a tool to make mini-websites. The Google Page Creator at http://pages.google.com/ lets you throw up a quick set of pages without a ton of hassle. Looks like a bunch of different look ‘n’ feel choices:

I feel like I'm in a time warp. Did Google just ship their own version of GeoCities? Isn't this space dead? End users have graduated from personal home pages to blogs and social networking tools which is why sites like MySpace, MSN Spaces and Xanga have tens of millions of users. Business users are likely to want an entire package like Office Live instead of just a web page creation tool.

Who exactly is the target audience for this offering?

Update: I just noticed that username@gmail.com is equal to username.googlepages.com. How do you ship a product with such an obvious privacy bug? I guess if you are creating a 20% project you don't need to have privacy reviews. Doh!


 

Categories: Web Development

When you build web applications that have to scale up to millions of users, you sometimes end up questioning almost every aspect of your design as you hit scalability problems. One thing I hadn't expected was to notice that a number of people in our shoes had begun to point out the limitations of SQL databases when it comes to building modern Web applications. Below are a sampling of such comments primarily gathered together so I have easy access to them next time I want an example of what I mean by limitations of SQL databases when building large scale Web applications.

  1. From Google's Adam Bosworth we have the post Where Have all the Good databases Gone

    The products that the database vendors were building had less and less to do with what the customers wanted...Google itself (and I'd bet a lot Yahoo too) have similar needs to the ones Federal Express or Morgan Stanley or Ford or others described, quite eloquently to me. So, what is this growing disconnect?

    It is this. Users of databases tend to ask for three very simple things:

    1) Dynamic schema so that as the business model/description of goods or services changes and evolves, this evolution can be handled seamlessly in a system running 24 by 7, 365 days a year. This means that Amazon can track new things about new goods without changing the running system. It means that Federal Express can add Federal Express Ground seamlessly to their running tracking system and so on. In short, the database should handle unlimited change.

    2) Dynamic partitioning of data across large dynamic numbers of machines. A lot people people track a lot of data these days. It is common to talk to customers tracking 100,000,000 items a day and having to maintain the information online for at least 180 days with 4K or more a pop and that adds (or multiplies) up to a 100 TB or so. Customers tell me that this is best served up to the 1MM users who may want it at any time by partioning the data because, in general, most of this data is highly partionable by customer or product or something. The only issue is that it needs to be dynamic so that as items are added or get "busy" the system dynamically load balances their data across the machines. In short, the database should handle unlimited scale with very low latency. It can do this because the vast majority of queries will be local to a product or a customer or something over which you can partion...

    3) Modern indexing. Google has spoiled the world. Everyone has learned that just typing in a few words should show the relevant results in a couple of hundred milliseconds. Everyone (whether an Amazon user or a customer looking up a check they wrote a month ago or a customer service rep looking up the history for someone calling in to complain) expects this. This indexing, of course, often has to include indexing through the "blobs" stored in the items such as PDF's and Spreadsheets and Powerpoints. This is actually hard to do across all data, but much of the need is within a partioned data set (e.g. I want to and should only see my checks, not yours or my airbill status not yours) and then it should be trivial.
    ...
    Users of databases don't believe that they are getting any of these three. Salesforce, for example, has a lot of clever technology just to hack around the dynamic schema problem so that 13,000 customers can have 13,000 different views of what a prospect is.

    If the database vendors ARE solving these problems, then they aren't doing a good job of telling the rest of us.

  2. Joshua Schachter of del.icio.us is quoted as saying the following in a recent talk

    Scaling: avoid early optimization. SQL doesn't map well to these problems - think about how to split up data over multiple machines. Understand indexing strategies, profile every SQL statement. Nagios or similar for monitoring.

    Tags don't map well to SQL. Sometimes you can prune based on usage - only index the first few pages for example. This keeps indexes small and fast.

  3. Mark Fletcher of Bloglines wrote the following in his post Behind the Scenes of the Bloglines Datacenter Move (Part 2)

    The Bloglines back-end consists of a number of logical databases. There's a database for user information, including what each user is subscribed to, what their password is, etc. There's also a database for feed information, containing things like the name of each feed, the description for each feed, etc. There are also several databases which track link and guid information. And finally, there's the system that stores all the blog articles and related data. We have almost a trillion blog articles in the system, dating back to when we first went on-line in June, 2003. Even compressed, the blog articles consist of the largest chunk of data in the Bloglines system, by a large margin. By our calculations, if we could transfer the blog article data ahead of time, the other databases could be copied over in a reasonable amount of time, limiting our downtime to just a few hours.

    We don't use a traditional database to store blog articles. Instead we use a custom replication system based on flat files and smaller databases. It works well and scales using cheap hardware.

Interesting things happen when you question everything. 


 

Categories: Web Development

February 21, 2006
@ 06:59 PM

Yesterday, Mark Baker asked Why all the WS Interop problems?.

If you'd have asked me six or seven years ago - when this whole Web services things was kicking off - how things were likely to go with them, I would have said - and indeed, have said many times since - that they would fail to see widespread use on the Internet, as their architecture is only suitable for use under a single adminstrator, i.e. behind a firewall. But if you'd asked me if I would have thought that there'd be this much trouble with basic interoperability of foundational specifications, I would have said, no, I wouldn't expect that. I mean, despite the architectural shortcomings, the job of developing interoperable specifications, while obviously difficult, wouldn't be any more difficult because of these shortcomings... would it?

In my opinion, the answer to his question is obvious. A few months ago I wrote in my post The Perils of Premature Standardization: Attention Data and OPML that

I used be the program manager responsible for a number of XML technologies in the .NET Framework while I was on the XML team at Microsoft. The technology I spent the most time working with was the XML Schema Definition Language (XSD). After working with XSD for about three years, I came to the conclusion that XSD has held back the proliferation and advancement of XML technologies by about two or three years. The lack of adoption of web services technologies like SOAP and WSDL on the world wide web is primarily due to the complexity of XSD.

If you read the three posts that Mark Baker links to about SOAP interop problems, you'll notice that two of them are about the issues with mapping xsi:nil and minOccurs="0" to concepts in traditional object oriented programming languages [specifically C#]. If a value is null does that map to xsi:nil or minOccurs="0"? How do I differentiate the two in a programming language that doesn't have these concepts? How do I represent xsi:nil in a programming  language where primitive types such as integers can't be null? 

The main problem with WS-* interop is that vendors decided to treat it as a distributed object programming technology but based it on a data typing language (i.e. XSD) which does not map at all well with traditional object oriented programming languages. On the other hand, if you look at other XML-Web-Services-as distributed-objects technology like XML-RPC, you don't see as many issues. This is because XML-RPC was meant to map cleanly to traditional object oriented programming languages.

Unfortunately I don't see the situation improving anytime soon unless something drastic is done.


 

Categories: XML Web Services

February 21, 2006
@ 06:05 PM

Mike Gunderloy hits a number of my pet peeves in his Daily Grind 823 post where he writes

The Problem with Single Sign-In Systems - Dare Obesanjo attempts to explain why Passport is so annoying. But hey, you know what - your customers don't care about the technical issues. They just want it to work. Explaining how hard the problem is just makes you look like a whiner. This principle applies far beyond Passport.

Pet peeve #1, my name is spelled incorrectly. I find this really irritating especially since anyone who is writing about something I blogged about can just cut and paste my name from their RSS reader or my webpage. I can't understand why there are so many people who mispell my name as Dare Obesanjo or  Dare Obsanjo. Is it really that much of a hassle to use cut & paste? 

Pet peeve #2, projecting silly motives as to why I wrote a blog post. My blog is a personal weblog where I talk about [mostly technical] stuff that affects me in my professional and personal life. It isn't a Microsoft PR outlet aimed at end users. 

Pet peeve #3, not being able to tell the difference between what I wrote and what someone else did. Trevin is the person who wrote a blog post trying to explain the user experience issues around Passport sign-in, I just linked to it.


 

Categories: Ramblings

February 20, 2006
@ 09:14 PM

Patrick Logan has a post on the recently re-ignited discussion on REST vs. SOAP entitled REST and SOAP where he writes

Update: Mike Champion makes an analogy between messaging technologies (SOAP/WSDL and HTTP) and road vehicle types (trucks and cars). Unfortunately this is an arbitrary analogy. That is, saying that SOAP/WSDL is best "to haul a lot of heavy stuff securely and reliably, use a truck" does not make it so. The question is how to make an objective determination.

Mike is fond of implying that you need to use WS-* if you want security and reliability while REST/POX is only good for simple scenarios. I agree with Patrick Logan that this seems to be an arbitrary determination not backed by empirical evidence. As an end user, the fact that my bank allows me to make financial transactions using REST (i.e. making withdrawals and transfers from their website) is one counter example to the argument that REST isn't good enough for secure and reliable transactions. If it is good enough for banks why isn't it good enough for us?

Of course, the bank's website is only the externally focused aspect of the service and they probably do use systems that ensure reliability and security internally that go beyond the capabilities of the Web's family of protocols and formats. However as someone who builds services that enable tens of millions of end users communicate with each other on a daily basis I find it hard to imagine how WS-* technologies would significanlty improve the situation for folks in my situation.

For example, take the post Clemens Vasters entitled The case of the missing "durable messaging" feature where he writes

I just got a comment from Oran about the lack of durable messaging in WCF and the need for a respective extensibility point. Well... the thing is: Durable messaging is there; use the MSMQ bindings. One of the obvious "problems" with durable messaging that's only based on WS-ReliableMessaging is that that spec (intentionally) does not make any assertions about the behavior of the respective endpoints.

There is no rule saying: "the received message MUST be written do disk". WS-ReliableMessaging is as reliable (and unreliable in case of very long-lasting network failures or an endpoint outright crashing) and plays the same role as TCP. The mapping is actually pretty straightforward like this: WS-Addressing = IP, WS-ReliableMessaging = TCP.

So if you do durable messaging on one end and the other end doesn't do it, the sum of the gained reliability doesn't add up to anything more than it was before.

The funny thing about Clemens's post is that scenarios like the hard drive of a server crashing are the exact kind of reliability issues that concern us in the services we build at MSN Windows Live. It's cool that specs like WS-ReliableMessaging allow me to specify semantics like AtMostOnce (messages must be delivered at most once or result in an error) and InOrder (messages must be delivered in the order they were sent) but this only scratches the surface of what it takes to build a reliable world class service. At best WS-* means you don't have to reinvent the building blocks when building a service that has some claims around reliability and security. However the specifications and tooling aren't mature yet. In the meantime, many of us have services to build.   

I tend to agree with Don's original point in his Pragmatics post. REST vs. SOAP is mainly about reach of services and not much else. If you know the target platform of the consumers of your service is going to be .NET or some other platform with rich WS-* support then you should use SOAP/WSDL/WS-*. On the other hand, if you can't guarantee the target platform of your customers then you should build a Plain Old XML over HTTP (POX/HTTP) or REST web service.


 

Categories: XML Web Services

For the past few releases, we've had work items in the RSS Bandit roadmap around helping users deal with information overload. We've added features like newspaper views and search folders to make it easier for users to manage the information in feeds they consume. Every release I tried to make sure we add a feature that I know will make it easier for me to get the information I want from the feeds I am subscribed to without being overwhelmed.

For the Jubilee release I had planned that the new feature we'd add in the "dealing with information overload" bucket would be the ability to rate posts and enable filtering based on this rating. After thinking about this for a few weeks, I'm not sure this is the right route any more. There are tough technical problems to surmount to make the feature work well but I think the bigger problems are the expected changes to user behavior. Based on my experiences with rating systems and communities, I suspect that a large percentage of our user base will not be motivated to start rating the feeds they are subscribed to or the new items that show up in their aggregator.

On a related note, I've recently been using meme trackers like Memeorandum and TailRank which try to show the interesting topics among certain technology blogs. I think this is very powerful concept and is the next natural evolution of information aggregators such as RSS Bandit. The big problem with these sites is that they only show the current topics of interest among a small sliver of blogs which in many cases do not overlap with the blogs one might be interested in. For example, today's headline topic on Tech.Memeorandum is that a bunch of bloggers attended a house party which I personally am not particularly interested in. On the other hand, I'd find it useful if another way I could view my subscriptions in RSS Bandit pivoted around the current hot topics amongst the blogs I read. This isn't meant to replace the existing interface but instead would be another tool for users to customize their feed reading experience the same way that newspaper views and search folders do today. 

If you are an RSS Bandit user and this sounds like a useful feature I'd like to hear your thoughts on what functionality you'd like to see here. A couple of opening questions that I'd like to get opinions on include

  • Would you like to see the most popular links in new posts? For an example of what this looks like, see the screenshot in Nick Lothian's post on  Personalized meme tracking
  • How would you like to classify new posts; unread posts or items posted within the last day? The reason I ask this is that you may already have read a few posts that linked to a very popular topic, in which case should it be ranked higher than another link for which you haven't read any of the related posts but hasn't been linked to as much?
  • Would you like a 'mark discussion as read' feature? Would it be nice to be able to mark all posts that link to a particular item as read?

I have a bunch of other questions but these should do for now.


 

Categories: RSS Bandit

February 19, 2006
@ 06:33 PM

This is basically a "me too" post. Dave Winer has a blog post entitled Blogging is part of life where he writes

I agree with the author of the Slate piece that’s getting so much play in the blogosphere, up to a point. The things that called themselves blogs that came from Denton and Calacanis are professional publications written by paid journalists that use blogging software for content management. That’s fine and I suppose you can call them blogs, but don’t get confused and think that their supposed death (which itself is arguable) has anything to do with the amateur medium that is blogging. They’re separate things, on separate paths with different futures.

To say blogging is dead is as ridiculous as saying email or IM or the telephone are dead. The blog never belonged on the cover of magazines, any more than email was a cover story (it never was) but that doesn’t mean the tool isn’t useful inside organizations as a way to communicate, and as a way for businesses to learn how the public views them and their competitors.

Whenever Dave Winer writes about blogging I tend to agree with him completely. This time is no exception. Blogs are social software, they facilitate communication and self expression between individuals. Just like with email and IM, there are millions of people interacting using blogs today. There are more people reading and writing blogs on places like MySpace and MSN Spaces than the populations of a majority of the countries on this planet. Blogs are here to stay. 

Debating on whether companies that build companies around blogs will survive is orthogonal to discussing the survival of blogging as a medium. It's not like debating whether companies that send out email newsletters or make mailing list software will survive is equivalent to discussing the survival of email as a communication medium. Duh. 


 

Categories: Social Software

Recently there was a question asked on the RSS Bandit forums from a user who was Unable to Import RSSBandit-Exported OPML into IE7. The question goes

I exported my feeds from RSSBandit 1.3.0.42 to an OPML file in hopes of trying the feed support in IE7. IE7 seems to try to import, but ultimately tells me no feeds were imported. The exported file must have over a 100 feeds, so it's not that. Has anyone else been able to import feeds from RSSBandit into IE7?

I got an answer for why this is the case from the Internet Explorer's RSS team.  The reason is provided in the the RSS Bandit bug report, Support type="rss" for export of feeds, where I found out that somewhere along the line someone came up with the convention of adding a type="rss" attribute to indicate which were the RSS feeds in an OPML file. The Internet Explorer RSS team has decided to enforce this convention for indicating RSS feeds in an OPML file and will ignore entries that don't have this annotation.

Since RSS Bandit supports both RSS/Atom feeds and USENET newsgroups, I can see the need to be able to differentiate which are the feeds in an OPML file without having applications probe each URL. However I do think that type="rss" is a misnomer since it should also apply to Atom feeds. Perhaps type="feed" instead?


 

One of the more thankless jobs at MSN Windows Live is to work on the Passport team. Many of the product teams that are customers of the service tend to view it as a burden, myself included. One of the primary reasons for this is that instead of simply being the username/password service for MSN Windows Live it is actually a single-sign in system which encompasses a large number of sites besides those owned by Microsoft. For example, you can use the same username and password to access your email, travel plans or medical information.

Trevin Chow of the Passport team has written a blog post entitled Why does Passport sign-in suck? where he addresses one of the pain points its customers face due to its legacy as a single sign-in system. He writes

Q: Why do you keep asking me to sign in over and over again even though I've checked "automatically sign me in"?  What don't you understand about "automatic"?!
 
One of the biggest problems with see in the network of MSN, Windows Live and Microsoft sites is that Passport sign-in is seen way too often by users.  It appears as if we are disregarding your choice of "automatically sign me in" and randomly asking you to sign in when we want with no rhyme or reason...
 
Passport sign-in 101
Passport sign in is based on cookies. Because HTTP is stateless, we have only 2 ways of persisting information across requests -- the first being to carry it on the query string, and second via HTTP cookies.  The first method (query string) isn't useful across browser sessions (open IE, close it, and re-open), which leaves us only option 2 (cookies).  Cookies are the mainstay of modern web sites, and allows very powerful personalization and state management.  Passport leverages this to provide the world's largest web authentication (aka sign-in) system in the world.
 
Passport first validates your identity by validating your "credentials" (email address and password combination) that you typed in on our sign-in UI.  Once validated, Passport uses cookies in the passport.com and the partner's domain (eg. www.live.com, MSN Money, MSDN) to vouch for your identity.  The cookies in our partner's domain act as assertions that you are who you say you are.    Because each partner site trusts Passport, the sign-in authority, assertions about a user's identity from Passport are also trusted by the partner.
...
After you sign into one partner site in the "passport network", users can freely go to subsequent partner sites and sign in. This is where the magic of Passport comes into play and single sign-on is achieved.  When you visit another partner site, and click "sign in" you are redirected to Passport servers. Because you already authenticated once to Passport (represented through your passport.com cookies), we don't need to validate your credentials again and can issue a service ticket for this new partner website.
 
But Trevin, you just said that "because you already authenticated once to Passport <snip>, we don't need to validate you credentials again...".  That clearly isn't the case since I seem to keep getting asked for my password!
 
In the last section, especially the last paragraph, I purposely left out some detail for simplicity. We can dive into more detail now that you have a better high-level understanding of the flow of passport sign-in.
 
In order to have a secure single sign-on system, you simply cannot have one prompt for a login then be able to access any site.  It sounds counter-intuitive, since that's what "single sign-on" seems to imply.  This would only be possible if every single website you accessed had the same level of security and data sensitivity.  We all know that this is not the case, and instead, sites vary in the level of security needed to protect it. 
 
On the lower end of the spectrum (least sensitive), we have sites like www.live.com, which is merely personalization.  In the middle, have sites like Live Mail, which has personal information such as email from your friends.  On the extreme end of the scale (most senstitive) we have sites like Microsoft Billing which contains your credit card information.  Because of this varying levels of data sensitivity, each site in the Passport network configures what we'll call their "security policy" which tells passport parameters to enforce during sign in which is supposed to be directly related to their data sensitivity -- the more sensitive the information therein, the "tighter" the security policy.
...
All our partner websites currently have a mis-matched set of security policies, each set at their own discretion of their team's security champ.  It's because of the inconsistent security plicies, you keep getting asked for your password over and over.
 
Wow, so this sounds like a tough problem to solve.  How are you going to fix this? 
 
Our team is absolutely committed to make the sign in experience the best on the internet.  To fix this specific problem, our team is moving to a centralized definition of security policies.  What does this mean? Instead of each partner website telling us the specific parameters of the security policy (such as time window), they instead will tell us an ID of a security policy to enforce, whose definition will be on the Passport sign-in servers.  This means, that by offering a limited set of security policies we limit the mistakes partner websites can make, and we will inherently have more consistency across the entire network for sign in.  Additionally, it gives us more agility to tweak both the user experience and security of the network since Passport is in total control of the parameters.

This is just one consequence of Passport's legacy as a single-sign in system causing issues for MSN Windows Live sites. Another example of an issue we've faced was when deciding to provide APIs for MSN Spaces. If you read the Getting Started with the MetaWeblog API for MSN Spaces document you'll notice that instead of using the user's Passport credentials for the MetaWeblog API, we instead use a different set of credentials. This is because a user's Passport credentials were deemed to be too valuable to have them being entered into random blog editing tools which may or may not be safeguarding the user's credentials properly.

I now consider identity systems to be one big headache based on my experiences with Passport. This is probably why I've steadfastly avoided learning anything about InfoCard. I know there are folks trying to make this stuff easier at Microsoft but it seems like everytime I think about identity systems it just makes my teeth hurt. :(


 

Categories: Windows Live