Availability

Computer designer apply multiple components to one task for the same reasons that airplanes are equipped with multiple engines: redundancy increases availability (in the airplane case one would say safety). If one engine fails the pilots still have enough thrust left to land the aircraft at the next airport.

Redundancy helps to improve availability under the assumptions that the components have independent fault zones (a failure of one does not fail the other) and the remaining have the capacity to continue the workload.

If you're not scared by math formulas click here for a quick excursion into probability theory that shows why this is the case. Otherwise you just have to believe that

Clustering works:
under the assumption that one node is able to take over the workload of a failed one you can drastically improve your application availability.
Clustering is required:
assembling lots of silicon and iron in one place without using cluster technologies to improve availability typically is not a good idea.

How high is high enough?

Availability is expressed as a percentage and often classified according to the number of nines given. Which class of availability is required mainly depends on the application type, the cost of outages (s.b.) and the patience of your users.

Telephone equipment is usually class 6. Stock trading and credit card authorization are examples for class 5. Manufacturing applications that control assembly lines with hundreds or thousands of workers require to be at least class 4. Machines running electronic mail or file and print services should be class 3 but often fail to provide this level of service.

When confronted with availability claims by a vendor keep in mind that these numbers are typically generated in the marketing department.

Availability classification
Class Availability Outage time
(in min per year)
Application examples
1 90-98% 52560-42048 Home PC, Office PC
2 99.0% 5256 Standalone Server
3 99.9% 526 File and Print Server
Electronic Mail
4 99.99% 53 Assembly line controller
5 99.999% 5 Stock trading
Credit card authorization
6 99.9999% 0.5 Telephone equipment

Why do computers stop?

Computer failures are typically classified by cause as

Errors are either permanent (hard) or transient (soft, intermittent). Permanent errors are rock solid and show up with a high probability when a particular module or part is used. The term Bohrbug is often used following Niels Bohr‘s atom model. In contrast transient errors only materialize when certain conditions are met: e.g. voltage and temperature for hardware or load, sequence, timing and parameter values for software modules. Following Werner Heisenberg‘s uncertainty theory they are typically called Heisenbugs. While Bohrbugs are usually detected early on during quality assurance (QA) tests Heisenbugs are nasty because they may not be uncovered by the vendor‘s test suite and in addition have a tendency to disappear when you look at them.

Most high availability designs cover only a single point of failure (SPOF). The failure of one or more nodes could lead to a service degradation. The complete loss of service is called an outage and of course should not happen.

Outages are classified as either planned or unplanned. Planned outages are scheduled downtime caused e.g. by

Unplanned outages are unscheduled downtime and caused by faults mentioned above. Very often they follow Murphy‘s laws.

Depending on the industry and application the cost of outages can be substantial and justify the spending for additional equipment by saving just a few hours of downtime.

Source: 1998 Gartner Group Figures for Downtime Cost per Hour (in US$)
Business Industry Average Cost
(US$ per hour)
Brokerage Operations Finance 6,450,000
Credit Card/Sales Authorization Finance 2,600,000
Pay-per-View Media 150,000
Home Shopping (TV) Retail 113,000
Catalog Sales Retail 90,000
Airline Reservations Transportation 89,000

The exact reason for downtime is typically a well hidden secret. Few empirical studies ( [Gray 86] , [Gray 90] , [Mourad 87] , [Oppenheimer 02] , [Xu 99] ) are available because

In addition these studies are hard to compare because they use different taxonomies and report on very different types of computer systems (e.g. fault tolerant vs. non fault tolerant design, mainframe vs. PC's etc.).

The conventional wisdom on the street says that

What can be done against it?

Fault avoidance

In a perfect world we would be able to design and implement fault free hardware and software. Since this is not the case and we have to cope with errors in design and implementation.

The computer as a design tool is no help either. Formal verification methods and tools to proof the correctness of a design have limited usability in hardware and some software areas (e.g. communication protocols) but cannot master the complexity of todays applications or even operating systems.

The following recommendations for avoiding faults are no formal methods you can simply apply. They are at best proven guidelines or best of breed recipes that worked well in the past.

KISS
Keep it Simple, Stupid! Simplicity in design allows you to master the increasing complexity in requirements, software and new technologies. In the end, simple designs work, complex ones don't.
Design and programming
Requirement and design documentation, defensive programming, code reviews
Test
separate QA teams co-located with development , extensive alpha and beta tests, test coverage tools, fault injection
Administration and operating
limit or at least master change in hardware, software and configuration

Fault tolerance

Even the strict application of the above guidelines will not lead to fault free hardware and software. It is estimated that even well engineered software contains in the order of one to three bugs per thousand lines of code.

Hardware fault tolerant machines use hardware to mask faults. Computers like Stratus Continuum or Tandem Integrity use three (TMR) or four (pair & spare) CPU's for the job of one (also for the performance of one!) and only 50% of the installed memory is available (since memory is mirrored) for operating system and applications. If one CPU or memory fails the other(s) take over without interruption. Almost all hardware failures are transparent and invisible to applications and clients but software and human faults lead to system outages.

In contrast clusters use software to mask faults. In case one node fails its services fail over to surviving nodes. After reconfiguration and a short recovery (several seconds to minutes) services will be continued but typically the interruption is visible to applications and clients. The type and duration of the interruption depends on the application and other software involved.

The fact that clusters use software to mask faults should not be confused with software fault tolerance, although your chances are good to survive Heisenbugs in the operating system kernel. Since each node runs a copy of the operating system and shows a different run time behaviour (timing, number of CPU's and processes, CPU load, amount of memory and memory demand, ... just to name a few) there is a high probability you won't hit the same bug in a different node. Unfortunately this is not true for operating system components above the kernel since their execution environment is less dynamic.

In order to address the sticky problem of software fault tolerance in applications you'll have to dive into the intrinsics of

N-version programming
You develop multiple versions of your software (N >= 3) which receive the same input and execute in parallel. A voting algorithm will check all results and continue only if it finds a majority. Problems for this approach are the voting algorithm plus making sure that the different versions have enough diversity.
Recovery blocks
Another multi-version approach (N >= 2) where a primary block of code is followed by an application dependent acceptance test. In case the test fails an alternative (recovery) block is scheduled using the same input again followed by the acceptance test. Another failure of the test either schedules a third recovery block or fails the whole block.
Process pairs
A single version technique that re-executes the same code but at different times. Process pairs build on the assumption that most bugs in production software are Heisenbugs and therefore transient in nature. A simple re-execution will make them go away.

In summary both N-version programming and recovery blocks have the potential to tolerate design bugs but come at a high price for the additional versions and the necessary voter or acceptance test. Since we're still in the software crisis where we cannot create software fast enough, both techniques are typically limited to application areas where a failure may endanger human life, e.g. in airplane flight control systems.

Process pairs however have found their way into today's computing environments in several versions. The original implementation [Bartlett 81] uses an active primary and a dormant backup process in a different node. In order to minimize recovery time in case of a failure the primary keeps the backup up-to-date by frequently sending checkpoint messages. Should the primary fail the backup will start executing from the checkpoint. It may sound simple but checkpointing process pairs are notoriously difficult to program.

That's why there were several attempts to push back checkpointing into the operating system which is known as transparent or automatic checkpointing. Here the operating system should provide the abstraction of fault-free execution by automatically creating backup processes and checkpointing under the cover enough state so that the backup can recover without programmer intervention. So far nobody found the holy grail of checkpointing partly because of high resource consumptions, partly because of a negative impact on performance. Some recent research even indicates that there may be no holy grail [Lowell 00] .

Checkpointing process pairs are difficult to program, automatic checkpointing is costly to run but there's a third form: persistent process pairs do not use checkpointing. The backup either exists as a dormant process (and there is no communication between the two) or does not exist at all. In the latter case some high availability service knows how to re-create the process should the primary fail. The basic requirement your program has to meet is to be stateless between invocations. In combination with transactions (which nicely clean up stale data in case of a failure) persistent process pairs have proven to be a powerful yet simple approach to deal with failures.

Disaster Tolerance

Many businesses that operate globally demand the tolerance of disasters like earthquakes, flood, fire, storm, terrorist or hacker attacks. These catastrophes have in common that they affect all cluster nodes in a site, i.e. all nodes are in the same fault zone. In addition, the typical recovery time for a disaster is not quick: even if backup tapes are available, new computer equipment can be leased or purchased fast and personnel is not affected the recovery will take weeks if not months. Too long for most companies that operate globally and depend on being accessible.

In order to continue running the business after a disaster one needs to invest in a backup site in a remote location. Normally the backup is a complete mirror of the primary site with applications installed and monitored by a separate group of operators. The data is copied from primary to backup either in batches of file or table snapshots or continuesly via log file shipping or transaction mirroring.