Management

The installation, operation, administration and maintenance (OAM) of large clusters can be the source of major headaches if appropriate concepts are not implemented and suitable cluster management software is not available. It is not by coincidence that the early clusters supported only a very limited number of nodes and even today you'll find cluster supporting only two or four nodes. If every node has to be installed, configured, operated and administered as a separate machine the management issue gets out of hand pretty quickly.

Installation

From an installation point of view clusters fall into two camps: clusters that use a separate system disk per node (SDPN) and those that don't. This decision has a strong influence on all aspects of cluster management. The former is easier to achieve but requires that a copy of the operating system is installed on every node. An advanced cluster using SDPN will ship with a software manager that

The alternative to SDPN is a system disk per cluster (SDPC). Here the operating system is installed only once on the first node. All other nodes boot via the cluster interconnect and discover the system disk (and other file systems that already exist) when joining the cluster. SDPC requires a substantial amount of cluster software but results in faster installations.

In summary clusters with a system disk per node while clusters with a system disk per cluster

Configuration

If you consider the number of configuration parameter that you need to specify for a single machine and multiply this by 512 or 1024 you get an idea about the potential problem size. Hardware auto detection, self configuration, useful defaults and an intelligent cluster management software are proven ways out of the dilemma.

In addition, since the value of configuration parameters is often tight to particular software versions a configuration database for the cluster is highly desirable.

Operating, Administration and Maintenance

Various reports on outage distribution indicate that system management is a major cause. Between 15 and 60 percent of all system outages are attributed to OAM. As a consequence the management of a cluster should be automated as much as possible.

The cluster membership service maintains the state of the cluster and is the first candidate that needs to function operator-less. It lets nodes join and leave the cluster, validates node availability and triggers recovery in case a node fails. It may sound simple but avoiding split brain problems, guaranteeing that nodes which were declared down don't do any harm (via I/O fencing, switching off power or poison packets) and reestablishing a reliable cluster state after power fail are among the hard issues.

Recovery after failure needs to be initiated without human intervention as much as possible. The failure of a node, disk or communication link is not business as usual and operators have a tendency to go into panic mode when it happens.

Since clusters are (ideally) single entities they need to be managed from a single console. The sheer number of devices and services make it necessary to filter exception events through an automated operator that summarizes and archives events and guides the operations staff through repair, reintegration and fallback.