Cluster Technology

Cluster configurations cover a wide range. On one end there are pure scalability clusters that consist of a collection of regular PC's with local disks connected via 100 or 1000 Mbit Ethernet, add some cluster management software to ease the installation and maintenance plus a parallel execution environment (e.g. cluster aware scheduler, load balancing, MPI , PVM ) and parallel applications.

On the other more complex end of cluster configurations there are availability or hybrid clusters using e.g. rack mounted blades, connected via special high bandwidth, low latency networks and a full featured SAN. The operating system is totally cluster aware and creates the illusion of a single computing environment. It sports a cluster file system, cluster virtual IP, horizontal scalability and upgrades/repair in hardware and software on the fly (i.e. without downtime). Distributed, scalable databases and transaction monitors are commercially available and complete the picture.

Some aspects of the technology are discussed in the following pages. The text is meant to serve as a starter and not as a complete reference or online guide to clusters. For a much more complete coverage you should get yourself a copy of Greg F. Pfister's excellent textbook [Pfister 98] .Do not hesitate to contact us or visit the Services section.

Terminology

The terminological confusion in all of IT is huge - not just in cluster computing. E.g. fault tolerance, high availability, resilience, reliability are all facets of the desire to make computing less error prone. They have been well defined more than once and their meaning often depends on the author. Therefore I resisted the temptation to define my own terminology and instead will use the one outlined in the article Improve your scalability vocabulary from Bill Devlin, Jim Gray, Bill Laing and George Spix of Microsoft Research. This paper not only lays out a simple and consistent set of terms around scalability and availability but also nicely summarizes what has been achieved during the last 30 years. The remainder of this page uses excerpts from the article as far as they are useful in the context of this site. The original paper introduces additional terms, basic design issues and can be found at Microsoft.com and Clustercomputing.org.

Shared nothing clonesClusters may grow in two ways: cloning and partitioning. A service (e.g. an apache web server) can be cloned (replicated) on many nodes of a cluster each having the same software and (access to) data. If the service demand approaches the current capacity an operator could duplicate software and data on another node and use a load balancing system to distribute the workload among them. Cloning increases CPU power, memory bandwidth, network capacity and disk read throughput.

Clones offer both scalability (as outlined above) and availability: if one clone fails the other nodes continue to offer service. If the load balancer is notified, node failures can be completely masked (transparent to clients).

Shared disk clones Diskless clones are useful mainly in HPC environments where primarily CPU horsepower and memory bandwidth are important. Shared nothing clones typically use only local disks which are not accessible by other nodes. They are best suited for read-only applications since all nodes must perform all writes. Write throughput is limited and guaranteeing data consistency is a challenge in larger clusters. Shared disk clones use a common HA storage manager which gives concurrent read/write disk access to all or a subset of nodes. In large clusters diskless, shared nothing and shared disk clones may co-exist.

Partitions Partitions grow a service by cloning software and dividing the data among the nodes following the roman divide et impera, divide and rule. In addition to cloning advantages partitioning adds disk read and write throughput each time a node is added. In order to be transparent to applications partitions need additional software to

Packs Partitions do not improve availability since data is stored in only one place. Duplex disks (RAID level 1) and parity protection (RAID level 5) are able to mask storage failures. But to protect against hardware, software and other failures partitions are implemented as packs of two or more nodes each providing access to storage. Strictly speaking, a shared nothing partition cannot be packed because packing requires some kind of disk sharing. But in the literature this fact is typically ignored. Tandem's NonStop system is always described as a shared nothing design although each duplexed disk is accessible by two nodes. With storage managers giving shared access to a large number of nodes the difference between a shared nothing pack, a shared disk clone and a shared disk pack boils down to a configuration option.