I just read the post on the Skype weblog entitled What happened on August 16 about the cause of their outage which states

On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short time frame as they re-booted after receiving a routine set of patches through Windows Update.

The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.

Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly.

This problem affects all networks that handle massive numbers of concurrent user connections whether they are peer-to-peer or centralized. When you deal with tens of millions of users logged in concurrently and something causes a huge chunk of them to log-in at once (e.g. after an outage or a synchronized computer reboot due to operating system patches) then your system will be flooded with log-in requests. All the major IM networks (including Windows Live) have all sorts of safeguards in place within the system to prevent this from taking down their networks although how many short outages are due to this specific issue is anybody’s guess.

However Skype has an additional problem when such events happen due to it’s peer-to-peer model which is described in the blog post All Peer-to-Peer Models Are NOT Created Equal -- Skype's Outage Does Not Impugn All Peer-to-Peer Models 

According to Aron, like its predecessor Kazaa, Skype uses a different type of Peer-To-Peer network than most companies. Skype uses a system called SuperNodes. A SuperNode Peer-to-Peer system is one in which you rely on your customers rather than your own servers to handle the majority of your traffic. SuperNodes are just normal computers which get promoted by the Skype software to serve as the traffic cops for their entire network. In theory this is a good idea, but the problem happens if your network starts to destabilize. Skype, as a company, has no physical or programmatic control over the most vital piece of its product. Skype instead is at the mercy of and vulnerable to the people who unknowingly run the SuperNodes.

This of course exposes vulnerabilities to any business based on such a system -- systems that, in effect, are not within the company's control.

According to Aron, another flaw with SuperNode models concerns system recovery after a crash. Because Skype lost its SuperNodes in the initial crash, its network can only recover as fast as new SuperNodes can be identified.

This design leads to a virtuous cycle when it comes to recovering from an outage. With most of the computers on the network being rebooted, they lost a bunch of SuperNodes and so when the computers came back online they flooded the remaining SuperNodes which in turn went down and so on…

All of this is pretty understandable. What I don’t understand is why this problem is just surfacing. After all, this isn’t the first patch Tuesday. Was the bug in their network resource allocation process introduced in a recent version of Skype? Has the service been straining for months and last week was just the tipping point? Is this only half the story and there is more they aren’t telling us?


Now playing: Shop Boyz - Party Like A Rockstar (remix) (feat. Lil' Wayne, Jim Jones & Chamillionaire)


Monday, 20 August 2007 18:57:21 (GMT Daylight Time, UTC+01:00)
Interestingly enough, I got an email this morning from Skype alerting me that a new version was available for download. I've never gotten an email alert before.
Monday, 20 August 2007 19:50:31 (GMT Daylight Time, UTC+01:00)
This sounds fishy to me, partly because (as you note) this is hardly the first patch tuesday, and also because during the outage there was a statement (from somewhere, I don't remember where) that the skype network was fine, but that the authentication system that was not working correctly.
Monday, 20 August 2007 21:28:10 (GMT Daylight Time, UTC+01:00)
Sounds like chaotic behaviour. One day it tips, one day it doesn't. Who knows, Skype is growing at a rate, windows update users maybe too, and lately I didn't need to restart on patch day. This was the first time in a while.
Tuesday, 21 August 2007 00:37:44 (GMT Daylight Time, UTC+01:00)
"virtuous cycle" should be "vicious cycle"
Tuesday, 21 August 2007 01:43:45 (GMT Daylight Time, UTC+01:00)
Hi All. Yeah, the Skype outage was annoying. I do hope they've got the problem sorted. I don't know which is more confusing Dare: The real reasons behind Skype's outage or the reasons you've stated for your engagement "postponement"! Keep up the good tech blogging!
Tuesday, 21 August 2007 01:51:16 (GMT Daylight Time, UTC+01:00)
That comment was from someone who regularly posts trolls in my comment section. It seems he's now grown bored with posting irrelevant comments using various WWE wrestlers as his name and has decided to start impersonating me.

It must be boring wherever he works since he always seems to post his nuisance comments during work hours.
Tuesday, 21 August 2007 15:30:22 (GMT Daylight Time, UTC+01:00)
I agree that this may not be the first Patch Tuesday but maybe over a period of time more customers bought computers loaded with windows, downloaded skype promptly and hence increased the total simultaneous login attempts beyond a certain watermark?
Comments are closed.