# Availability Basics

The probability of an event**E**is a number between zero and one and noted

**P(E)**. Zero means it never, one means it always happens. If you consider two independent events E1 and E2 then the probability that both happen at the same time is

while the probability that either E1 or E2 or both happen is

(The real formula is P(E1 or E2) = P(E1) + P(E2) - P(E1) x P(E2) but since failure probabilities are typically very small numbers you can safely ignore the product P(E1) x P(E2))

Assuming a failure probability of a typical computer of 0.01 (equals 99% availability) and a cluster of two nodes then the probability that both nodes fail is

an improvement by a factor 100!

The mean time to event E is the reciprocal of the probability

With a failure probability of 0.01 the **Mean Time To Failure (MTTF)** for the computer would be
1/0.01 or 100 days while the two-node cluster has an MTTF of 10000 days or 27 years.

Don‘t get carried away: in reality cluster nodes are not as independent as one would like and the above formulas assume. Think about identical power sources, other electrical connections between nodes (e.g. shared SCSI), a software bug in the operating system or the erroneous command of a tired operator just to mention a few obstacles on your way to high availability.

On the down side you‘ll see more failures with the two-node cluster because the probability that one of the nodes fails is

which translates into an MTTF of 50 days or half that of a single node.

A more realistic availability calculation needs to take repair into account.
Without repair you would experience a complete loss of service from your two-node cluster
on average after **MTTF/2 + MTTF** days. Using the above values would yield an
MTTF of 150 days (an improvement of factor 1.5) for a two-node cluster that comes with a cost factor of 2 + X over
the single computer (X is the cost of the cluster infrastructure). Not a very efficient
use of your money.

While MTTF measures the average time a module is in service, **Mean Time To Repair (MTTR)**
quantifies a modules service interruption. Module availability can then be expressed as
the ratio of service accomplishment to elapsed time

Using a few mathematical transformations explained in [Gray 93] lets us express the mean time to failure of a two node cluster as

_{2N}= MTTF

^{2}/ 2 x MTTR

Assuming the same 99% availability for the single node and a 24 hours MTTR would give us (100^2 / 2x1) 5000 days or 13 years availability. A much better return of investment. To see the importance of a short repair time use a one week MTTR which drops the overall availability to two years.