The storage architecture for clusters in the 70s and 80s was simple: proprietary controllers and dual-ported disks were your expensive first choice. Shared SCSI (two HBA 's in different nodes controlling a single SCSI channels with multiple disks) lowered the price point but were the source for numerous reliability problems. Both technologies allowed you to implement packed partitions: to attach disks to two nodes where the primary node was actively driving the disk (and also the file system or database accessing the disk) while the other node served as a backup. When the primary failed the backup was able to take over the disks (plus the file systems and/or databases) from the primary. This is called dual exclusive access.
In contrast disk controllers like Digital's StarCoupler enabled shared disk clones by allowing multiple nodes to access all disks. Software was necessary acting as an arbiter to determine which node could access the disk. In a shared exclusive setup (only one node controls a disk, as before) this decision is made after boot or when the cluster configuration changes. But in a shared concurrent (non-exclusive) mode the arbiter needs to decide which node gets access before each I/O operation. This is typically done by a distributed lock manager that has an instance running on every participating node.
The 90s saw dramatic changes in the storage area. RAID (redundant array of inexpensive disks) subsystems brought smarter and cheaper solutions to protect from disk failures and improve disk throughput. Thanks to new high-end interconnect technologies like fibre channel, Infiniband, iSCSI, HyperSCSI and others the shared concurrent access today is the rule rather than the exception giving higher throughput and reliability. Disk attachments over Firewire (IEEE1394) and USB enable shared concurrent access even for low budgets although the number of configurable disks is limited and care has to be taken to avoid single points of failure.
There are many ways to configure file systems on a cluster. Which one fits is mainly determined by usage type, availability and scalability requirements.
Local file systems are visible only to applications running on the same node and show excellent scalability as no synchronization among nodes is required. There is no availability advantage unless the application implements data replication.
Network file systems export local file systems via protocols like NFS, SMB and others. They enable file access for applications on remote nodes. From an implementation point of view they can be seen as a network layer stacked upon a physical file system. Depending on the configuration, network utilization and product used applications in different nodes may see files under different path names and experience different levels of performance. Network file systems with failover may be a good choice for simple scalability clusters. In case the node serving the file system fails a new node is selected to access the disk, recover the file system and restart the network service. In order to minimize the recovery time the physical file system should support journaling.
Cluster file systems (CFS) remove the node boundary by creating the illusion of a single, cluster wide file system. Among the list of advantages are
- a unified name space: path names are identical on all participating nodes giving benefits for both management and load balancing
- single site semantics: they often support all caching options incl. read-ahead and write-behind, cache both data and file attributes, use full, built-in support for file locking and atomic file updates
- improved data integrity: protection against loss of data via modified buffer replication in a second node
Cluster file systems come in two flavours namely
- distributed CFS uses a client/server architecture where one node acts as the file server with a dual or shared exclusive access to disk. The other nodes are clients which use inter node RPC to function-ship file requests to the server. In case of failure a new node is selected to run the file service. After a recovery the file system state needs to be reconstructed from stable storage or file system clients. Examples of this type are SUN Cluster3.0, HP (aka Compaq aka Digital) Trucluster, SCO (aka Tandem) NonStop Cluster and its open source follow-on OpenSSI
- parallel CFS implements a peer to peer relationship between participating nodes which all require shared concurrent disk access. Nodes communicate to implement file locking and to synchronize disk I/O. In case one node fails a short recovery will undo any ongoing modifications and free the locks held by the failed node. Examples of parallel CFS are IBM GPFS, Sistina GFS and its open source cousin OpenGFS and Polyserve
Relational databases are very popular among application software developers. Quite often they represent the central building block around which applications are designed. As a consequence databases on clusters face stringent scalability and availability requirements.
Without a cluster aware database application designers and administrators were limited to vertical scalability, i.e. the option to use large SMP or ccNUMA machines as dedicated database nodes to increase the transaction throughput.
Originating in the 80s parallel databases today are available from several vendors that show excellent scalability on clusters. As usual, they come in two flavours and there has been an almost religious debate over the question which architecture is best. Not surprisingly, both have pros and cons.
- Parallel databases implemented as shared disk clones require read/write disk access from every participating node. The buffer cache is distributed over all nodes and a distributed lock manager takes care of row, page and table locking and I/O synchronization. Oracle 9i RAC and IBM DB2 for z/OS and OS/390 are both shared disk clones.
- When implemented as packed partitions each node is fully responsible for only one part of the data. The responsibility includes locking and disk I/O and needs no synchronization with instances on other nodes. Requests need to be routed to the right node and an access spanning multiple partitions requires a coordinator (but can be executed in parallel). Examples of this type are Teradata, Tandem (now HP) NonStop SQL, IBM DB2 for Unix, Linux and Windows, and Microsoft SQL server.
HA for parallel databases almost comes for free since multiple clones exist already. In case of a node failure parallel databases use a rollback recovery scheme based on the transaction log: for the instance running on the failed node all in flight transactions are undone and committed transactions are redone. There are possibilities to shorten the recovery time but they are typically in the minutes range. Only Tandem (now HP) NonStop SQL recovers in just a few seconds. The database is architected into the kernel and implemented as checkpointing process pairs.
Images ©2002 IEEE Task Force on Cluster Computing