I just read the post on the Skype weblog entitled What happened on August 16 about the cause of their outage which states

On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short time frame as they re-booted after receiving a routine set of patches through Windows Update.

The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.

Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly.

This problem affects all networks that handle massive numbers of concurrent user connections whether they are peer-to-peer or centralized. When you deal with tens of millions of users logged in concurrently and something causes a huge chunk of them to log-in at once (e.g. after an outage or a synchronized computer reboot due to operating system patches) then your system will be flooded with log-in requests. All the major IM networks (including Windows Live) have all sorts of safeguards in place within the system to prevent this from taking down their networks although how many short outages are due to this specific issue is anybody’s guess.

However Skype has an additional problem when such events happen due to it’s peer-to-peer model which is described in the blog post All Peer-to-Peer Models Are NOT Created Equal -- Skype's Outage Does Not Impugn All Peer-to-Peer Models 

According to Aron, like its predecessor Kazaa, Skype uses a different type of Peer-To-Peer network than most companies. Skype uses a system called SuperNodes. A SuperNode Peer-to-Peer system is one in which you rely on your customers rather than your own servers to handle the majority of your traffic. SuperNodes are just normal computers which get promoted by the Skype software to serve as the traffic cops for their entire network. In theory this is a good idea, but the problem happens if your network starts to destabilize. Skype, as a company, has no physical or programmatic control over the most vital piece of its product. Skype instead is at the mercy of and vulnerable to the people who unknowingly run the SuperNodes.

This of course exposes vulnerabilities to any business based on such a system -- systems that, in effect, are not within the company's control.

According to Aron, another flaw with SuperNode models concerns system recovery after a crash. Because Skype lost its SuperNodes in the initial crash, its network can only recover as fast as new SuperNodes can be identified.

This design leads to a virtuous cycle when it comes to recovering from an outage. With most of the computers on the network being rebooted, they lost a bunch of SuperNodes and so when the computers came back online they flooded the remaining SuperNodes which in turn went down and so on…

All of this is pretty understandable. What I don’t understand is why this problem is just surfacing. After all, this isn’t the first patch Tuesday. Was the bug in their network resource allocation process introduced in a recent version of Skype? Has the service been straining for months and last week was just the tipping point? Is this only half the story and there is more they aren’t telling us?


Now playing: Shop Boyz - Party Like A Rockstar (remix) (feat. Lil' Wayne, Jim Jones & Chamillionaire)