These are my notes on the Scaling Fast and Cheap - How We Built Flickr session by Cal Henderson.

This was an 8 hour tutorial session which I didn't attend. However I did get the summary of the slide deck in my swag bag. Below are summaries of the slide deck Cal presented at the tutorial.

Overview and Environments
Flickr is a photo sharing application that started off as a massively multiplayer online game called Game Never Ending (GNE). It has 2 million users and over 100 million photos. The site was acquired by Yahoo! in May 2005.

A key lesson they learned is that premature optimization is the root of all evil. Some general rules they've stuck to is

  1. buy commodity hardware
  2. use off-the-shelf software instead of building custom code

When trying to scale the site there were a number of factors that needed to be considered. When buying hardware these factors included availability, lead times, reliability of vendors, and shipping times. Other factors that affected purchase decisions included rack space, power usage, bandwidth, and available network ports in the data center.

Load balancing adds another decision point to the mix. One can purchase an expensive dedicated device such as a Cisco 'Director' or a Netscale device or go with a cheap software solution such as Zebra. However the software solution will still require hardware to run on. One can apply several load balancing strategies both at the layer 4 network level such as round robin, least connections and least load and at the layer 7 network layer by using URL hashes. Sites also have to investigate using GSLB, AkaDNS and LB Trees for dealing with load balancing at a large scale. Finally, there are non-Web related load balancing issues that need to be managed as well such as database or mail server load balancing.

The Flickr team made the folowing software choices

  • PHP 4 (specifically PHP 4.3.10)
  • Linux (2.4 kernel on x86_64 and 2.6 kernel on i386)
  • MySQL 4/4.1 with InnoDB and Character sets
  • Apache 2 with mod_php and prefork MPM

There were 3 rules in their software process

  1. Use source control
  2. Have a one step build process
  3. Use a bug tracker

Everything goes into source control from code and documentation to configuration files and build tools.

For development platforms they chose an approach that supports rapid iteration but enforces some rigor. They suggest having a minimum of 3 platforms

  • Development: Working copy of the site which is currently being worked on
  • Staging: Almost live version of the site where changes to the live site are tested before deployment
  • Production: The customer facing site

Release management consists of staging the application, testing the app on the staging site then deploying the application to the production servers after successful test passes.

Everything should be tracked using the bug tracker including bugs, feature requests, support cases and ops related work items. The main metadata for the bug should be title, notes, status, owner and assigning party. Bug tracking software ranges from simple and non-free applications like FogBugz to complex, open source applications like Bugzilla.

Consistent coding standards are more valuable than choosing the right coding standards. Set standards for file names, DB table names, function names, variable names, comments, indentation, etc. Consistency is good.

Testing web applications is hard. They use unit testing for discrete/complex functions and automate as much as they can such as the public APIs. The WWW::Mechanize library has been useful in testing Flickr

Data and Protocols
Unicode is important for internationalization of a site. UTF-8 is an encoding [not a character set] which is compatible with ASCII. Making a UTF-8 web application is tricky due to inconsistent support in various layers of a web application; HTML, XML, JavaScript, PHP, MySQL, and Email all have to be made to support Unicode. For the most part this was straightforward except for PHP which needed custom functions added and filtering out characters below 0x20 from XML files [except for normalizing carriage returns]. A data integrity policy is needed as well as processes for filtering out garbage input from the various layers of the system.

Filtering bad input doesn't just refer to unicode. One also has to filter user input to prevent SQL injection and Cross Site Scripting (XSS) attacks.

The ability to receive email has been very useful to Flickr in a number of scenarios such as enabling mobile 'blogging' and support tracking. Their advice for supporting email is to leverage existing technology and not write an SMTP server from scratch. However you may need to handle parsing MIME yourself because support is weak in some platforms. For Flickr, PEAR's Mail::mimeDecode was satisfactory although deficient. You will also have to worry about uuencoded text and Transport Neutral Encapsulation Format (TNEF) which is only used by Microsoft Outlook. Finally, you may also have to special case mail sent from mobile phones due to idiosyncracies of wireless carriers.

When communicating with other services, XML is a good format to use to ensure interoperability. It is fairly simple unless namespaces are involved. The Flickr team had to hack on PEAR's XML::Parser to make it meet their needs. In situations when XML is not performant enough they use UNIX sockets.

When building services one should always assume the service call will fail. Defensive programming is key. As a consequence, one should endeavor to make service calls asynchronous since they may take a long time to process and it makes callers more redundant to failure.

Developing and Fixing
Discovering bottlenecks is an important aspect of development for web applications. Approaches include

  • CPU usage - rarely happens unless processing images/video, usually fixed by adding RAM
  • Code profiling - rarely causes problems unless doing crypto
  • Query profiling - usually fixed by denormalizing DB tables, adding indexes and DB caching
  • Disk IO - usually fixed by adding more spindles
  • Memory/Swap - usually fixed by adding RAM

Scaling
Scalability is about handling platform growth, dataset growth and maintainability. There are two broad approaches to scaling; Vertical scaling and Horizontal scaling. Vertical scaling is about buying a big servers, to scale one buys even bigger servers. Horizontal scaling is about buying one server and to scale one buys more of the same kind of server. In todays world, Web applications have embraced horizontal scaling. The issues facing services that adopt horizontal scaling are

  • increased setup/admin cost
  • complexity
  • datacenter issues - power / space / available network ports

  • underutilized hardware - CPU/Disks/Mem may not be used to full capacity

Services need to scale once they hit performance issues. When scaling MySQL one has to worry about

  • Choosing the right backend - MyISAM, BDB, InnoDB, etc
  • Replication
  • Partitioning/Clustering
  • Federation

One big lesson learned about database scalability is that 3rd normal form tends to cause performance problems in large database. Denormalizing data can give huge performance wins.

The rest of the slides go on about case studies specific to Flickr which are interesting but I don't feel like summarising here. :)
 

Categories: Trip Report

Sometimes reading blogs makes you feel like you are in high school. People sometimes chase popularity and wear their immaturity on their sleeve in ways you haven't seen since you were struggling with puberty. One such example is the hubbub around the MashupCamp vs. BarCamp started by Ryan King in his post MashupCamp jumped the shark. Basically some folks who got upset because they weren't invited to Tim O'Reilly's FooCamp came up with a knockoff conference called BarCamp then got upset when another conference called MashupCamp started getting press for the knocking off the same concept. Amazed? I can barely believe it either.

Although the debate [if one can call it that] has been pretty pointless, I did find one quote from Ryan King that I thought was worth highlighting. In his post Live by the Snark, die by the Snark Ryan writes

Wait, no it’s not just me.

David Berlind busted his ass to put together something not vendor-controlled that unavoidably involved vendors because vendor APIs were what was being mashed up.

I see no reason why vendors have to be involved because their services are being mash-up’ed. Aren’t the people writing the mash-ups actually more important here?

The above argument seems weird to me. If I was attending a conference about building Java applications, I'd expect to see Java vendors like Sun and IBM there. If I was attending a conference on building Windows applications, I'd want to see Microsoft developers there. So, if there is a conference about building applications based on data and services from various service providers, why is wouldn't you expect to see the data/service providers at this conference? I think sometimes people take the big companies are bad meme to ridiculous extremes. This is one of those examples.

Personally, I think one of the problems with the discussions around Mashups I've seen at conferences and in blogs is the lack of high level discussion between providers of services/data and developers who use these APIs and data. This is one of the things I miss from when I worked on XML APIs for the .NET Framework. Back then it was clear what developers were interested in our APIs and what they wanted us to do next. The uncertainties were around prioritizing various developer requests and when we could deliver solutions to their problems. On the other hand, now that I've been in discussions around various aspects of the Windows Live platform I've found it hard to figure out who our customers were and what APIs they'd like us to provide. What we need is more dialog between developers building mashups and vendors that provide APIs & data not an us vs. them mentality that builds unnecessary animosity. 


 

Categories: Web Development

A few days ago, someone at work asked about tips on getting traffic to their newly launched team blog. One of the responses suggested that the team should pick blogging at blogs.msdn.com instead of on MSN Spaces because the MSDN blogs get more traffic. This turns out to an example of the myopic view folks have of the blogging when they view it out of the lens of the blogs I read. I'm sure there are lots of bloggers who have never read a blog on MSN Spaces but read dozens on blogs.msdn.com, similarly there are millions of people who read blogs on MSN Spaces that don't even know blogs.msdn.com exists. The main advice the team got was to provide good content and traffic would follow.

However the discussion get me interested in the relative popularity of blogs on MSN Spaces compared to other blogging services. Below are the blogs hosted on MSN Spaces that are on the list of top 100 most linked blogs according to Technorati.

6.   spaces.msn.com/after1s - 27,380 links from 8,916 sites

7.    spaces.msn.com/lin28379801400 - 30,841 links from 8,506 sites

18.  spaces.msn.com/l1n - 16,733 links from 5,960 sites

21.  spaces.msn.com/lovingyo - 16,922 links from 5,215 sites

22.  Herramientas para Blogs14,396 links from 5,207 sites

25.  spaces.msn.com/members/MSN SA - 14,021 links from 4,802 sites

37.  The Space Craft - 11,235 links from 4,119 sites (the MSN Spaces team blog)

There two surprises here for me. The first is that two blogs hosted on MSN Spaces are in the top 10 most linked blogs tracked by Technorati and the second is that the MSN Spaces team blog is in the top 50. I'll be quite interested in seeing how these statistics change once Technorati figures out how to add MySpace blogs to their index.

Update: Added Herramientas para Blogs which I missed the first time around.


 

Categories: Social Software

March 4, 2006
@ 02:16 AM

In the past I've mentioned that I don't like the phrase Web 2.0 because it is vacuous and ill-defined making people who use it poor communicators. However, some of the neologisms used by computer geeks are much worse because they just plain dumb. One such neologisms is blogosphere which started off as a joke but is now taken seriously by various pundits.

The blog post that has gotten my goat is Steve Ruble's post The Center of Gravity is Shifting where he writes

One of the themes I kept hitting over and over is that the blogosphere is not where all the action is going to be in the months ahead. Yes, you read that right. Don't adjust your set.

For sure the b'sphere will continue to remain the largest galaxy in the social media universe in the short term. It's a major center of gravity that pulls people toward it. However, over the last few months a number other social media galaxies have rapidly risen to prominence. Take YouTube, digg and MySpace. These are just three examples, but they are drawing huge audiences. Richard Edelman is gushing over a fourth - StupidVideos.com.

The first thing that confuses me about this post is that it implies MySpace isn't part of the blogosphere. Why not? Is it because Technorati's coverage of MySpace is sorely lacking as Steve Rubel claims? Do the media talking heads really think that Technorati covers all the blogs in the world? After all services like MySpace, MSN Spaces and Xanga each have more than the 30 million blogs that Technorati claims to cover. This isn't the first time I've seen someone assume Technorati's numbers actually measure the total number of blogs out there.

The thought that one can lump all the blogs in the world into a lump category called the blogosphere and generalize about them seems pretty silly to me. We don't make similar generalizations about people who use other social applications like email (the mailosphere), instant messaging (the IM-osphere) or photo sharing sites (the photosphere). So what is it about blogs that makes such a ridiculous word continue to be widely used?


 

From the Techdirt post Google's Moves Chinese Search Records So They Can Be Subpoenaed By The US we learn

With Yahoo getting slammed for giving up info to the Chinese government, leading to the arrest of some political dissidents, it would appear that Google has begun to rethink where they should keep their Chinese search engine data. It looks like Google has gone with a compromise route, and is moving all of its Chinese search data out of China and into the US -- which still raises some questions. After all, it was just this week that the US Department of Justice claimed that no one should worry when it subpoenas search terms from Google here in the US -- something Google has fought vehemently. Perhaps the next suggestion would be for them to move the US data into China. Then everyone can subpoena whoever they want, and Google can claim the data is out of the country and they can't do anything about it.

Wow.


 

After procrastinating for what seems like half a year, I finished my article Seattle Movie Finder: An AJAX and REST-Powered Virtual Earth Mashup which has now been published on O'Reilly XML.com. The article is a walkthrough of how I built my Seattle Movie Finder application with a few tips on building mapping mashups.

I think the most useful tip from the article is letting people know about the geocoder.us API which provide REST, SOAP and XML-RPC services for converting addresses to latitudes and longitudes. That discovery helped a great deal. The Virtual Earth folks currently advise people who want geocoding to register for the MapPoint SOAP Web services which was too much of a hassle for me. On the other hand, the free and zero hassle geocoder.us got the job done. 

I'm thinking of turning this into a series with the next article explaining how I built my MSN Spaces Photo Album Browser gadget for Live.com. Let me know what you think of the article. 


 

Categories: Web Development | Windows Live

A great thing about blogs is that they let you join the conversation when the conversation is about you. Today there were a bunch of rumors about Passport and InfoCard. Trevin Chow of the Passport team addresses them in his post Official word on Infocard and Passport where he writes

Ever since RSA, rumours have been flying aroung the web and blogosphere about Passport's supposed demise at the hands of Infocard:
 
As much as I hate to disappoint folks like CNet, ZDNet, The Boston Herald, IT Business Edge, etc. but this is absolutely false...Here it is in as easy to understand language as possible, and feel free to quote :)
 
Today, Passport supports different types of credentials.  A more verbose definition of a "credential" from Wikipedia is:
"A credential is a proof of qualification, competence, or clearance that is attached to a person, and often considered an attribute of that person."
Today, Passport supports email address with either passwords or mobile PINs as credential types.  Infocard will simply be another credential that will be supported by Passport. In other words, Infocard will not replace Passport, but rather Infocard will supplement Passport.  So in a nutshell:
 
1. Infocard will not be replacing Passport, contrary to the popular belief, rumour and conjecture.
2. Inforcard will be another accepted credential type for the Passport network.  You will be able to link an Infocard to your Passport and use it to access Microsoft, MSN and Windows Live services.
 
This is not to say that Infocard is not a valuable and worthwhile technology.  I'm extremely excited about the possbility of the proliferation of infocards in the future and putting the control of sharing user information in the hands of the user.  The point being made here is that Passport will not be wholesale replaced by Infocard.

The Infocard hype keeps getting louder and louder each day. One of these days,  I may have to get off my butt and actually find out what exactly it is.  :)


 

Categories: Windows Live

Steve Kafka of the Windows Live mail team has a blog post entry on their team blog entitled M5 is alive! where he talks about some of the new features. Some of my favorites include

Hotmail Classic View
OK, I know it's a contradiction to name anything with the name “classic” as NEW.  But it is.  We know our customers roam….and that they don't always log in to Windows Live Mail on computers with IE (and many times they aren’t even logging in from a computer at all).  We want to help make sure you guys can get your mail any time you need.  Now for people not using Internet Explorer 6.0 and higher, we have a new view of WL Mail, what we're calling the Hotmail Classic View.
...
Offline Mail and other good stuff
Announcing Windows Live Mail Desktop Beta!  The next generation of desktop mail is coming.  Check out the team blog for all the details.
...
Configurable reading pane
Did you know that you can turn the reading pane off?  That's ok, no one else did either.  Now you can configure the reading pane while reading your mail.  No hunting through options, you can change it on the fly.  Even better, we've added an option: the bottom reading pane!  Now you can chose between having the reading pane on the right, bottom, or off.  You pick.  Change it whenever you like.
...
Outlook-like shortcuts
You can now use the shortcuts menu in the left hand navigation to switch between Mail, Contacts, Calendar and the Today page.  Need more space? Minimize the shortcuts to give yourself the maximum amount of room to view your mail.
...
Contact picker
I know auto-complete is awesome.  Start typing a name and we complete the address for you. It's perfect for writing a mail to one or two people.  But admit it, sometimes you just want to browse.  You want to peruse your contact list and choose your contacts and groups.  With contact picker you can browse your contacts and groups while composing a mail and choose which addresses you want to add.
...
Find in contacts
If your contact list is bigger than 20 people, you're probably tired of scrolling through the list looking for the right contact.  Well, hunt no longer. Find in Contacts will actually word wheel through your contact list as you type.  For those of you lucky enough to be on the Windows Live Messenger beta, this will probably look familiar.  Select the contact you are looking for and you'll jump right to that contact. Your contact management just got a whole lot easier.
...
Spaces integration
Those cool contact cards aren't just for Messenger anymore.  The "contact control" now pulls in the profile picture for your contacts. You can view their contact card, jump to their Space or profile and more. This feature won’t be ready immediately when M5 is released but we were too excited to keep it a secret.
...
Custom filters
While Windows Live Mail continues to use the custom filters you set up in Hotmail, there hasn't been any way to edit the old rules or create new one.  And boy did our beta testers miss it. Until now.  Custom filters are back, allowing you to have mail sent directly to a folder of your choosing based on the criteria you select.

This release is hot. I finally had to get a Windows Live Mail beta invite for my girlfriend. She likes it, I do too.


 

Categories: Windows Live

The Windows Live Expo team has a blog post entitled Hello world... which begins

The Expo team is very proud to unveil the Windows Live Expo service today. Our public beta will cater to all users in the every location across the US.
 
To get started on the service, we've produced a nice Flash product tour (thanks Becky!) that outlines all of our cool features. Head over to the homepage here and you'll find a link for the tour near the bottom of the page.
 
I wanted to call out a few known issues with the Beta that we've acknowledges and are working on:
  • Spaces integration: The new Spaces module will be activated very soon - hold tight..
  • Firefox niggles: Yes, we know you can't drag and drop the windows and that the rich-text editing has problems. We already have fixes in the pipeline for this so expect to see it patched shortly.

As I've mentioned before, I've been working closely with the Expo team to get their service off the ground and it's been a fun journey. Try it out and tell them what you think.


 

Categories: Windows Live

From the blog post Virtual Earth Team Launches Street-Side Drive-by we learn

The Virtual Earth team is pleased to launch a preview of a new feature we have been working on – interactive Street-side browsing. You can try it out at http://preview.local.live.com Street-side imagery allows you to drive around a city looking at the world around you as if you were in a car. But unlike the real world, you can stop your car anywhere you like and rotate your view around 360degrees. Currently we have street-side imagery for San Francisco and Seattle online, and we are planning to have many more cities added soon.

One of the most interesting features is to put yourself in ‘Street’ view map style. In this mode, all of the street-side images are pasted flat on the map to give you a very unique overview of an area. It takes some getting used to, but once you adjust to it you’ll find it provides a very compelling companion view for our Hybrid maps. Street view helps you orient yourself quickly in an area, while the street side views then show more detail presented as you would see it in the real-world.

This technology preview is just that – a means for us to get a feature we are working on in your hands to play with and provide feedback on, before it is ready for prime time integration into the Windows Live Local site.

Sweet.


 

Categories: Windows Live