Availability Basics

The probability of an event E is a number between zero and one and noted P(E). Zero means it never, one means it always happens. If you consider two independent events E1 and E2 then the probability that both happen at the same time is

P(E1 and E2) = P(E1) x P(E2)

while the probability that either E1 or E2 or both happen is

P(E1 or E2) = P(E1) + P(E2)

(The real formula is P(E1 or E2) = P(E1) + P(E2) - P(E1) x P(E2) but since failure probabilities are typically very small numbers you can safely ignore the product P(E1) x P(E2))

Assuming a failure probability of a typical computer of 0.01 (equals 99% availability) and a cluster of two nodes then the probability that both nodes fail is

P(E1 and E2) = 0.01 x 0.01 = 0.0001

an improvement by a factor 100!

The mean time to event E is the reciprocal of the probability

MT(E) = 1 / P(E)

With a failure probability of 0.01 the Mean Time To Failure (MTTF) for the computer would be 1/0.01 or 100 days while the two-node cluster has an MTTF of 10000 days or 27 years.

Don‘t get carried away: in reality cluster nodes are not as independent as one would like and the above formulas assume. Think about identical power sources, other electrical connections between nodes (e.g. shared SCSI), a software bug in the operating system or the erroneous command of a tired operator just to mention a few obstacles on your way to high availability.

On the down side you‘ll see more failures with the two-node cluster because the probability that one of the nodes fails is

P(E1 or E2) = 0.01 + 0.01 = 0.02

which translates into an MTTF of 50 days or half that of a single node.

A more realistic availability calculation needs to take repair into account. Without repair you would experience a complete loss of service from your two-node cluster on average after MTTF/2 + MTTF days. Using the above values would yield an MTTF of 150 days (an improvement of factor 1.5) for a two-node cluster that comes with a cost factor of 2 + X over the single computer (X is the cost of the cluster infrastructure). Not a very efficient use of your money.

While MTTF measures the average time a module is in service, Mean Time To Repair (MTTR) quantifies a modules service interruption. Module availability can then be expressed as the ratio of service accomplishment to elapsed time

MTTF / (MTTF + MTTR)

Using a few mathematical transformations explained in [Gray 93] lets us express the mean time to failure of a two node cluster as

MTTF2N = MTTF2 / 2 x MTTR

Assuming the same 99% availability for the single node and a 24 hours MTTR would give us (100^2 / 2x1) 5000 days or 13 years availability. A much better return of investment. To see the importance of a short repair time use a one week MTTR which drops the overall availability to two years.