Whenever you read stories about how Web companies like Facebook have 10,000 servers including 1800 database servers or that Google has one million servers, do you ever wonder how the system administrators that manage these services deal with deployment, patching, failure detection and system repair without going crazy? This post is the first in a series of posts that examines some of the technologies that successful Web companies use to manage large Web server farms.

Last year, Michael Isard of Microsoft Research wrote a paper entitled Autopilot: Automatic Data Center Management which describes the technology that Windows Live and Live Search services have used to manage their server farms. The abstract of his paper is as follows

Microsoft is rapidly increasing the number of large-scale web services that it operates. Services such as Windows Live Search and Windows Live Mail operate from data centers that contain tens or hundreds of thousands of computers, and it is essential that these data centers function reliably with minimal human intervention. This paper describes the first version of Autopilot, the automatic data center management infrastructure developed within Microsoft over the last few years. Autopilot is responsible for automating software provisioning and deployment; system monitoring; and carrying out repair actions to deal with faulty software and hardware. A key assumption underlying Autopilot is that the services built on it must be designed to be manageable. We also therefore outline the best practices adopted by applications that run on Autopilot.

The paper provides a high level overview of the system, it's design principles and the requirements for applications/services that can be managed by the system. It gives a lot of insight into what it takes to manage a large server farm while keeping management and personnel costs low.

The purpose of AutoPilot is to automate and simplify the basic tasks that system administrators typically perform in a data center. This includes installation and deployment of software (including operating systems and patches), monitoring the health of the system, taking basic repair actions and marking systems as needing physical repair or replacement.

However applications that will be managed by AutoPilot also have their responsibilities. The primary responsibility of these applications include being extremely fault tolerant (i.e. applications must be able to handle processes being killed without warning) and being capable of running in the case of large outages in the cloud (i.e. up to 50% of the servers being out of service). In addition, these applications need to be easy to install and configure which means that they need to be xcopy deployable. Finally, the application developers are responsible for describing which application specific error detection heuristics AutoPilot should use when monitoring their service.

Typical AutoPilot Architecture

 

The above drawing is taken from the research paper. According to the paper the tasks of the various components is as follows

The Device Manager is the central system-wide authority for configuration and coordination. The Provisioning Service and Deployment Service ensure that each computer is running the correct operating system image and set of application processes. The Watchdog Service and Repair Service cooperate with the application and the Device Manager to detect and recover from software and hardware failures. The Collection Service and Cockpit passively gather information about the running components and make it available in real-time for monitoring the health of the service, as well as recording statistics for off-line analysis. (These monitoring components are ―Autopiloted like any other application, and therefore communicate with the Device Manager and Watchdog Service which provide fault recovery, deployment assistance, etc., but this communication is not shown in the figure for simplicity.)

The typical functioning of the system is described in the following section.

What Does AutoPilot Do?

The set of machine types used by the application (e.g. Web crawler, front end Web server, etc) needs to be defined in a database stored by on the Device Manager. A server's machine type dictates what configuration files and application binaries need to be installed on the server. This list is manually defined by the system administrators for the application. The Device Manager also keeps track of the current state of the cluster including what various machine types are online and their health status.

The Provisioning Service continually scans the network looking for new servers that have come online. When a new member of the server cluster is detected, the Provisioning Service asks the Device Manager what operating system image it should be running and then images the machine with a new operating system before performing burn-in tests. If the tests are successful, the Provisioning Service informs the Device Manager that the server is healthy. In addition to operating system components, some AutoPilot specific services are also installed on the new server. There is a dedicated filesync service that ensures that the correct files are present on the computer and an application manager that ensures that the expected application binaries are running.

Both services determine what the right state of the machine should be by querying the Device Manager. If it is determined that the required application binaries and files are not present on the machine then they are retrieved from the Deployment Service. The Deployment Service is a host to the various application manifests which map to the various application folders, binaries and data files. These manifests are populated from the application's build system which is outside the AutoPilot system.

The Deployment Service also comes into play when a new version of the application is ready to be deployed. During this process a new manifest is loaded into the Deployment Service and the Device Manager informs the various machine types of the availability of the new application bits. Each machine type has a notion of an active manifest which allows application bits for a new version of the application to be deployed as an inactive manifest while the old version of the application is considered to be "active". The new version of the application is rolled out in chunks called "scale units". A scale unit is a group of multiple machine types which can number up to 500 machines. Partitioning the cluster into scale units allows code roll outs to be staged. For example, if you have a cluster of 10,000 machines with scale units of 500 machines, then AutoPilot could be configured keep roll outs to under 5 scale units at a time so that never more than 25% of the cloud is being upgraded at a time.

Besides operating system installation and deployment of application components, AutoPilot is also capable of monitoring the health of the service and taking certain repair actions. The Watchdog Service is responsible for detecting failures in the system. It does so by probing each of the servers in the cluster and testing various properties of the machines and the application(s) running on them based on various predetermined criteria. Each watchdog can report one of three results for a test; OK, Warning or Error. A Warning does not initiate any repair action and simply indicates a non-fatal error has occurred. When a watchdog reports an error back to the Device Manager, the machine is placed in the Failure state and one of the following repair actions is taken; DoNothing, Reboot, ReImage or Replace. The choice of repair action depends on the failure history of the machine. If this is the first error that has been reported on the machine in multiple days or weeks then it is assumed to be a transient error and the appropriate action is DoNothing. If not, the machine is rebooted and if after numerous reboots the system is still detected to be in error by the watchdogs it is re-imaged (a process which includes reformatting the hard drive and reinstalling the operating system as well redeploying application bits). If none of these solve the problem then the machine is marked for replacement and it is picked up by a data center technician during weekly or biweekly sweeps to remove dead servers.

System administrators can also directly monitor the system using data aggregated by the Collection Service which collects information from various performance counters is written to a large-scale distributed file store for offline data mining and to a SQL database where the data can be visualized as graphs and reports in a visualization tool known as the Cockpit

Now Playing: Nirvana - Jesus Doesn't Want Me For A Sunbeam