A Modern Taxonomy of High Availability

Ron I. Resnick
resnick@interlog.com
http://www.interlog.com/~resnick/ron.html

(c) 1996. All Rights Reserved. This document may be freely distributed and copied provided this copyright notice is preserved.


Table of Contents


Introduction

This document defines and discusses the basic principles, assumptions and techniques employed in providing software based Fault Tolerance and High Availability in modern systems. It also surveys some of the leading commercial technologies and assesses their relative strengths and characteristics.

This document was originally written as an internal assessment of High Availability design approaches and technologies for Nortel. Such proprietary references and considerations have been omitted to make the material suitable for a more public audience. Any remaining idiosyncrasies in the presentation style or content can be attributed to this editing. In particular, a basic familiarity with the terminology of modern distributed object systems is assumed, since this was the level of readership initially addressed by the internal audience.

The survey begins with some definitions, which formally lay out terms such as system, faults, failures, Basic, High and Continuous Availability, and replication. The usage of these terms in this document is generally consistent with industry practice, but where there are discrepancies, these are pointed out.

The document then goes on to consider the policies and considerations which the builders of a software system should go through when contemplating what level of Availability is appropriate for their project, and which class of technologies is most suitable.

Finally, it concludes with a survey of some commercially available products which address some aspect of the general High Availability problem.

Acronyms and Abbreviations

Acronym
Name
Definition
BA
Basic Availability
A system engineered to offer its functionality, but which makes no provision to mask failures.
CA
Continuous Availability
Complete masking of planned and unplanned outages. CA is a superset of High Availability, and assumes a strictly Hot Standby (Active Replication) model.
HA
High Availability
Automatic masking of unplanned outages: Can be Cold, Warm or Hot Standby.
MM
Manual Masking
A "pseudo" (and weak) form of HA, which has replication built into the system, but requires manual detection and recovery action, and so does not satisfy the HA condition of Automatic masking.
CS
Cold Standby
Weakest form of High Availability: replicas are entirely initialised subsequent to failures.
WS
Warm Standby
Moderate form of High Availability: replicas are partially initialised, and may have some processing state in preparation for failures.
HS
Hot Standby
Strongest form of High Availability: "replicas" share complete state with "primaries". There is in fact no distinction between primaries and standbys, making the term a misnomer.
AR
Active Replication
Technique used to achieve "Hot Standby" availability.
LR
Passive (Lazy) Replication
Set of techniques used to achieve Warm Standby.
MFP
Make Forward Progress
A term applied to Active Replication/Hot Standby systems, which emphasises that clients of the systems are not interrupted in the event of failures, and that they continue to progress their work.
RR
Rollback and Recover
A term applied to Warm and Cold Standby systems, which emphasises that clients of the systems are interrupted after a failure, and must rollback to a consistent (committed) state of processing, before beginning the recovery process.
MTTF
Mean Time To Failure
Average interval of time that a component will operate before failing.
MTTR
Mean Time To Restore
Average amount of time needed to repair a component, recover a system, or otherwise restore service after a failure.
MTBF
Mean Time Between Failures
MTBF= MTTF+MTTR
Availability
A = MTTF / (MTTR + MTTF)
HVAC
Heating, Venting and Air Conditioning equipment
UPS
Uninterruptable Power Supply

Definitions

The terms relevant to a discussion of High Availability (HA) and Continuous Availability (CA) are now defined, to clarify the usage in later sections.

System

The system under discussion includes all those components and aspects of the execution environment to which the design for Availability applies. This does not mean that all of these components are necessarily made Highly Available, but only that they lie in the scope of the "system", and that the design for HA is aware of them.

Included Components and Aspects

The system includes the following components and aspects:

  1. The server hardware used in a distributed deployment.
  2. The disks and file systems associated with the server hardware.
  3. The application software deployed as part of the server environment.
  4. The system software and middleware deployed as part of the server environment (e.g. ORB, Naming Services, Database, management and administration utilities, etc.). This category includes both third party software as well as software constructed as part of the application development effort.
  5. The network connections (LAN and WAN) spanning the distributed environment.
  6. The desktop client hardware used in a distributed environment.
  7. The application software deployed on the desktop client hardware.
  8. The system software and middleware deployed as part of the client environment (e.g. ORB, GUI tools).
  9. The user communities (people) who manipulate Presentation GUI clients in order to operate on applications in the server environment.
  10. The administrative communities (people) who administer and maintain the deployed system and its user community.
  11. The administrative procedures and policies used in the operational environment (e.g. data backup and archival, user profile management, etc.)

Non-Included Components and Aspects

The following items are not part of the runtime system to which an Availability solution typically applies.

  1. The Build Environments, including definition and creation environments used to configure the behaviour of the system, but which are not themselves part of the runtime system. These are considered "off-line" and not in the scope of a discussion of HA.
  2. Legacy systems which the system interfaces to and interacts with, but which are not part of the system per se. They may have HA designs of their own, but these are not coordinated to the system's design.
  3. User communities "once removed" from the computing environment. This includes, for example, customers who interact with users of the system but who do not interact directly with it. It does not include customers who have onsite Customer Premise Equipment they use to interact directly to system; such users are in the category discussed in item 9) in above. This category also includes other personnel (management, planning, etc.) in the organization in which the system is deployed who do not interact directly with the system.

System Diagram

Based upon the definitions above, the following diagram illustrates the system with which the HA design is concerned:

Figure 1 : "System" for purposes of HA Design (Shaded gray). Components excluded from the system are in unshaded areas.

Faults & Outages

Faults

A fault is a deviation from the expected behaviour of the system. In other words, if the system is specified to exhibit a certain functionality, and in the process of execution the system produces a discernibly different functionality, a fault has occurred. Functionality is typically delivered from the system by enacting a procedure to execute the logic contained in software that runs in a hardware environment (containing client and server machines, networks, data storage, and other peripherals). Faults can occur in any of the procedures, software, or hardware. Faults can be classified as reproducible (a prescribed set of actions leads to the observance of the fault in a predictable manner) or non-reproducible (the appearance of the fault is random, or is linked to a root cause outside of the environment for which the system was engineered).

Note that the literature makes a distinction between faults and failures: Faults are defined as non-compliances within the system which may or may not be externally visible to the end user, while the set of Failures contains those faults which are externally visible. This distinction is not observed in this document, and the terms Fault and Failure are used synonymously.

There are different types of faults:

  1. An error in design, in which the system is specified to produce behaviour A, but is incorrectly designed to produce behaviour B. Such faults are typically reproducible (they occur repeatedly with identical results). Such faults can be in hardware, software or procedures. Example: the system is specified to calculate a+b but is incorrectly designed or implemented to calculate a-b. Example: the system is specified to authenticate a user before granting access, but the design and/or implementation fails to do so.
  2. A failure in a hardware component. Such faults are typically non-reproducible (they occur on a one-time basis). Example: a hard drive crashes. Example: a network connection is broken. Example: a server panics.
  3. An overloading of engineered system resources. Such faults can be reproducible or non-reproducible (i.e. the fault reappears whenever the overload condition occurs, but the condition may or may not repeat). They can occur in hardware, software or procedures. Example: 100 clients attempt to connect to a system engineered to support 50. Example: A daemon attempts to spawn 1000 processes on a machine configured to support 500.
  4. An error in executing procedures. Such faults can be reproducible or non-reproducible (i.e. the fault reappears whenever the procedure is not executed correctly, but the incorrect execution may or may not be repeated). Example: the procedure calls for performing nightly backups, but the procedure is ignored. Example: the procedure states that executing "kill -9" operations on certain server processes is unsupported, but the operation is performed regardless.

There are other fault types not necessarily covered in the above categories (e.g. race conditions between software processes, data garbled in transmission over a network, etc.). There are further distinctions between benign faults and malicious (or Byzantine) faults, but a discussion of this is beyond the present scope. A basic distinction can be drawn however between reproducible and non-reproducible faults. A further distinction between "hard" reproducible faults (ones which occur identically on every run with the same inputs) versus "soft" reproducible faults (ones which may occur, with a certain probability, on identical runs). These distinctions are important in a distinction of High Availability, since HA is quite useful in dealing with "soft" reproducible and non-reproducible faults, but less effective with "hard" reproducible faults.

Outages

An outage is a deviation from the specified behaviour of the system, whether expected or unexpected. In other words, outages include all the fault categories discussed above (these are known as unplanned outages) as well as planned outages. In these latter cases, the system (or parts of it) are intentionally prevented from delivering their specified functionality in order to perform maintenance operations, software upgrades, etc.

Basic Availability

A system which is designed, implemented and deployed with sufficient components (hardware, software and procedures) to satisfy the system's functional requirements, but no more, has Basic Availability (BA). Such a system will deliver the correct functionality so long as no faults occur and no maintenance operations, etc. are performed. Whenever a fault occurs or a maintenance operation is performed, however, an outage may be observed. Typically, Basic Availability systems are deployed as simplex (non-replicated) systems.

High Availability

A system which is designed, implemented and deployed with sufficient components to satisfy the system's functional requirements but which also has sufficient redundancy in components (hardware, software and procedures) to mask certain defined faults, has High Availability (HA). This definition is ambiguous: the terms "sufficient", "mask" and "certain" require further clarification. Before doing so, however, it is emphasized that due to this ambiguity there is a continuum of configurations which can be classified "High Availability" with this definition.

The ambiguous terms are now further defined:

Continuous Availability

Continuous Availability (CA) extends the definition of High Availability, as given in the previous section, and applies it to planned outages as well. Recall that faults were identified as unplanned outages, but that the total set of outages to a system consists both of unplanned outages as well as planned outages. High Availability, as defined, compensates for unplanned outages. Continuously Available systems must have a masking strategy that deals with planned outages as well.

Implicit in the more strenuous definition is that, whereas the definition of HA permitted a variety of masking strategies (cold, warm and hot standby), a CA system is confined to being exclusively Hot Standby/Active Replication- transparent masking must be complete, not just for faults, but for planned outages as well.

To recognize the difficulties inherent in this, consider the implications for a software upgrade. Initially, two identical processes are replicas of each other (see Figure 2, step 1.). To perform an upgrade, one replica is taken out of service, and replaced with the upgraded Version B (steps 2. and 3.). When it comes on-line, Replica 2 must absorb the processing state from its (Version A) peer. That peer is then shut down, at which time clients are left with only the Version B replica (step 4.). Finally, the fully replicated system is recreated with both replicas upgraded to Version B. Note that to achieve this, server versions A and B must be sufficiently compatible to share processing state between them. Also, clients (which typically have an independent lifecycle, represented as Version X), must be able to transparently interoperate across server versions A and B. Similar issues arise when the clients are upgraded from X to Y. All of these issues of compatibility depend upon the application state and behaviour; there are no "easy" middleware solutions to these requirements. In fact, if the upgrade from Version A to B is significant enough (e.g. a large amount of new functionality added in Version B), it may not be possible at all.

Figure 2 : Lifecycle of a software upgrade in a "Continuously Available" system.

Domains of Availability

The definition of the System identified the components with which the HA design is concerned. It did not imply that all of these components are, in fact, replicated to provide High or Continuous Availability. Rather, the System must be further decomposed to identify those components which are Basic, High, or Continuous Availability, and for those that are High Availability, identify whether they are Manually Masked, Cold, Warm or Hot Standby/Active Replication. The system is thus divided into Domains of Availability. This is done for any of the following reasons:

The task of a High Availability design, then, is to:

  1. Identify the middleware technology to be used for each of CA, or MM, CS, WS, HS/AR variants of HA (BA is assumed not to require any technology support).
  2. Identify, based upon the criteria above, which components of the system should be BA, MM, CS, WS, HS/AR or CA.
  3. Organize these components into their "domains", and provide the middleware support to these domains.

Figure 3 demonstrates graphically this logical division. The challenges confronting a High Availability design are to determine the appropriate partitioning, and to ensure that the system as a whole is "knit together" in a meaningful fashion. That is, there is little value in making a database server Continuously Available, if the server software which accesses it is only Basic Availability.

Figure 3 : The "System", as shown in Figure 1, divided into Availability domains. Increasingly more strenuous forms of availability are shown right to left.

Active Replication

Previously, the term Active Replication was introduced as a synonym for Hot Standby (one that better reflected the symmetric relationship between replicas). Active Replication (AR) is also used to describe the set of techniques used to achieve Hot Standby, by actively and completely sharing processing state between replicas. Since "state" is a dynamic entity in a distributed system, and changes constantly, sharing state implies not only transferring information between replicas, but also coordinating and synchronizing the information. This is needed so that replicas present their state externally, and process it internally, in a consistent manner.

In general, changes in state are non-commutative (i.e. applying two state changes {A,B} for a state transformation S -> SA -> SAB results in a different end state than {B,A} applied: S -> SB -> SBA, SAB<>SBA). Hence, "synchronizing" a system's state really means ensuring that operations are applied in the same order to all participant replicas (i.e. that all replicas are either SAB or SBA, but not a combination of both).

Active replication techniques are thus largely concerned with ordering operations. The best known Active Replication technique is the Virtual Synchrony model found in Isis, whose name is an attempt to convey this sense of enforced synchronization through ordering. The various AR techniques available (causal ordering, total ordering, etc.) are attempts to trade off more or less rigorous ordering guarantees with their respective performance overheads (in general, the more rigorous the order, the more degraded is the performance).

In addition to providing an ordered communications vehicle, AR systems need mechanisms to support the notions of "groups" of replicas, and have membership services which identify, in a distributed sense, which processes are members of a group at any time. Mechanisms are needed to allow members to join a group and "catch up" to the processing state of the group, and to be removed from a group either voluntarily or in response to a detected failure in a member. Hence, AR systems invariably include failure detection and heartbeat mechanisms as part of their group membership services.

AR is frequently associated with the notion of "Make Forward Progress" (MFP) systems, since by maintaining consistent server replicas, clients are not interrupted in the event of a failure, and are always able to advance the state of their work.

The terms Hot Standby, Active Replication, Virtual Synchrony, and Make Forward Progress have been identified as being largely synonymous, merely putting different emphasis on different aspects of the same behaviour. This observation is summarized in the following table:

Acronym
Term
Description
HS
Hot Standby
Emphasizes the relationship of this model to its less stringent relatives in the HA family (Cold & Warm Standby). A misleading term, since "Standby" conveys an asymmetry, whereas the model is symmetric.
AR
Active Replication
Emphasizes the server view, by identifying the "replication" that must occur in server state's to achieve the aim. This term is descriptive of the mechanism, not the function delivered by that mechanism.
VS
Virtual Synchrony
A term (trademarked by Isis?) which emphasizes the general computing paradigm attained when groups of processes have guaranteed ordering of messages between themselves. I.e., Virtual Synchrony stresses that Ordering and replication are general techniques, with High Availability being only one (very important) usage of the techniques.
MFP
Make Forward Progress
Emphasizes the client view. This term is descriptive of the desired function of a High Availability system, not its implementation mechanisms.

Passive (Lazy) Replication

Passive, or lazy, replication (LR) describes a set of techniques used to achieve Warm Standby. Recall that Warm Standby was defined as a replicated system in which some amount of initialization and state sharing occurs between replicas. In practice, Warm Standby solutions are often used as a relaxed form of "Hot Standby" High Availability, because Active Replication can be quite costly in terms of performance and application design complexity. The amount of "relaxation" from the stringent HS/AR varies, however, so that there are many approaches which can legitimately call themselves "Warm Standby", but which have widely differing amounts of initialization and state sharing (see Figure 4).

Figure 4 : A "Spectrum" of High Availability, showing that Cold & Hot Standby solutions occupy relative extremes, whereas Warm Standby covers a broad middle ground, with many approaches that can legitimately be considered "Warm" Standby.

Examples of Warm Standby approaches include:

N.B. - the above are intended as examples only - there are many other replication approaches which could qualify as "Warm Standby".

LR is frequently associated with the notion of "Rollback & Recover" (RR) systems, since even with some state maintenance, clients are interrupted in the event of a failure, and must explicitly rebind to the standby. The following table summarizes the terminology used in this section:

Acronym
Term
Description
WS
Warm Standby
Emphasizes the "middle ground" of the HA spectrum, with a variable degree of initialization occurring on the standby. The term connotes the compromise Warm Standby represents between the "Hot" and "Cold" alternatives.
LR
Lazy Replication
Emphasizes the server view, by identifying the "replication" that must occur in server state's to achieve the aim. This term is descriptive of the mechanism, not the function delivered by that mechanism.
RR
Rollback & Recover
Emphasizes the client view. This term is descriptive of the behaviour of a client in a Warm Standby system (i.e. the disruptive and visible effects of a failure), not its implementation mechanisms.

Load Balancing

Load Balancing is a mechanism used to achieve scalability, by spreading the work load across a pool of servers. Hence, it technically has nothing to do with High Availability and Replication. In practice, however, the concepts are frequently linked because the investment made in purchasing redundant systems for High Availability cannot be justified if the additional equipment is idle, or is merely duplicating the work performed in the primary servers.

Rather, a frequent customer requirement is to have a pool of replicated servers which are prepared to substitute for each other in the event of a failure (their High Availability role), but which divide the client workload under normal (non-failure) conditions (their Load Balancing role). Such a requirement, while common, can significantly complicate the design of both aspects (HA and Load Balancing), since the computing configuration cannot be optimized for either role.

In addition, the functional implications of such compromise designs must be considered. If a system is tuned to support a normal client load with a load balanced pool of N servers, and subsequently suffers a failure on one server, the load must now be spread across N-1 servers. This will clearly impact overall system performance, and hence not properly "mask" the failure (since the performance degradation will be observed). Of course, the system can be over-provisioned with more than N servers, allowing a prescribed number of failures to occur before performance is affected. However, this returns to the initial point of purchasing spare capacity. In most cases, N=2, and the Load Balancing requirement is to spread the load across both. Adding a third server for over-provisioning is not an option.

An analogy may be useful here: consider a whole life insurance policy, with an accumulated positive balance. The purpose of the policy is to safeguard the holder's family in the event of death or similar catastrophe. However, the balance in the policy can be used, at its owner's discretion, to fund normal living expenses. The danger, of course, is that by using the policy's resources during life, its benefits may not be there when they are truly needed, at death.

High Availability, and the supporting technology, systems and software design, should be thought of as an insurance policy that has a net positive balance (i.e. the spare processing capacity of the redundant servers). By "borrowing" from this balance in normal non-failure operations, the benefit of the HA insurance is reduced in the event of system failure.

This trade-off must be considered by the system planners and customers.

Groups

We earlier introduced the concepts of Active Replication of state information between processes collected in "groups", which in turn possess membership services to enable processes to join or leave groups. This notion of tightly synchronized groups of processes, which exchange well-ordered messages between them, can be applied to problem domains other than High Availability. That is, group-based computing can be considered as a generic distribution paradigm, and its usage for Active Replication and High Availability is only one (important) example of this paradigm.

A great deal of research has been done on distributed models that involve groups (see [Doug Lea], [Transis work], [Isis coordinator/cohort]). One application is in the construction of compromise Load Balancing/High Availability systems, as described in the previous section. Other, more general applications, include models of cooperative computing, in which the specialized roles of "clients" and "servers" are dropped, and all distributed application components (objects) cooperate by using "divide and conquer" techniques to solve a problem. The ordered messaging of the group assists these objects to shared their partial solutions with each other, in constructing a total solution. Such a concept could potentially be applied to agents, as well. Agents are distributed software entities which are able to "roam" a network, and perform work at various locations in it. Again, the integration of the results of these agents into a coherent whole could be achieved by a totally ordered communication model.

There are many issues to consider in such models, which are still the subject of active research. For example, the tight ordering imposed by group computing also enforces a determinism by removing the natural asynchrony (and hence non-determinism) in the original non-ordered distributed system. Such deterministic systems are provably weaker in their problem-solving ability than non-deterministic ones. Hence, one of the main benefits which can be obtained by moving to a distributed model, i.e. a more powerful computational model, can be lost by imposing too much order.

A further discussion of these subjects is beyond the scope of this document.

Costs of Availability

This section is primarily intended for Product Management and Technology Planning groups, and perhaps even for customers, to convey the cost/benefit trade-off of the different Availability options. I.e., Continuous Availability and Active Replication are nice in concept, but they carry a monetary, performance and complexity cost that must be understood in the product. This is needed in order to understand the core Availability requirements that must supported in the software system, and what the costs are of supporting these requirements.

The Definitions given in the previous section attempted to give a rigorous and consistent taxonomy to terms like 'High' and 'Continuous' Availability, and to note the distinctions and relationships between the various 'temperatures' (Cold, Warm and Hot) used to modify the notion of a mated-pair configuration.

However, typically when designing a system with High Availability requirements, those requirements are expressed by customers with their own lexicon and meaning applied to the common terms. It is therefore extremely important to be able to translate from a more 'canonical' terminology, such as was offered above, to the customer's needs. Frequently, those needs are expressed simply as 'the system needs to be 24X7', as if that says it all! When pressed, this may be amended to 'No data should ever be lost, and the system should remain functional and operational at all times with no perceived loss of service. Maintenance and upgrade activities should not interfere with operational service.' A tall order, no doubt.

Without being properly informed of the total costs and consequences of getting a Rolls Royce model, it is natural to want precisely that. It is only after realizing that in addition to the higher up front sticker cost, there is an increased cost for insurance, gas mileage, spare parts and qualified servicing that a consumer is sufficiently educated to weigh the options between a deluxe and economy model more appropriately. Caveat emptor!

Just as when shopping for cars and other consumer items, investing in complex system software requires some education and critical thought. And, just as the marketing and advertising for consumer items frequently attempts to cloud the issues and present their products as being suitable to tasks they are not, similarly the High Availability marketplace is full of glossy brochures trumpeting '24X7 capability'. No wonder, then, that customers believe this is a straightforward requirement to impose on their systems.

In short, both the users of computing systems and the creators of those systems tend to be considerably more interested in specifying the functional requirements of a system: what usable capabilities does it possess. Discussions of the system's performance, scalability, reliability, etc. tend to be decidedly less glamorous. Nonetheless, these 'non-functional' requirements are no less important for the overall usefulness of the system, and imply no less (and frequently more) complexity in design than do the functional requirements.

It is ultimately the responsibility of the users of the system and its builders to work together in communicating (i) what is actually required by the users, as opposed to what they might like to have (ii) the technological alternatives that could be used to meet those needs (iii) all the costs: monetary, but also performance degradation and system complexity due to replication algorithms, group membership protocols, etc.

Service Class Matrix

Service Class Matrix - A Customer Example

The following "Service Class Matrix" encapsulates the HA requirements of one such customer. As customer requirements go, this presentation shows a reasonable amount of sophistication in the subject matter on the part of its author. It is presented as an example to identify the types of considerations that the builders of a new system need to go through.

The matrix identifies 4 'quadrants' of system behaviour: Basic Availability, High Availability, Continuous Operations and Continuous Availability. The intent of this customer on accepting a new system would be to identify the independent MTTF and MTTR values for each component (hardware, software, network connection, manual procedure, etc.) and combine them in series and parallel, as appropriate, to identify the total MTTF/MTTR and availability for the system.

Critique of Assumptions in SC Matrix

The goal of this particular company was to aggressively aim to get all of their operational business systems initially into the High Availability quadrant, and subsequently the Continuous Availability quadrant. This was a blanket statement, with no further analysis of why CA should be desired across the board. Again, with no analysis of the costs of CA, it is easy to ask for it.

Furthermore, note that the matrix defines 'Basic Availability' as having an MTTR in the range of several hours! This essentially implies that all systems are already replicated, with some form of automatic detection and recovery strategy. That is, BA on the Service Class (SC) Matrix corresponds approximately to what we defined earlier as a Cold Standby, or perhaps Manual Masking, variant of High Availability. Unlike our earlier taxonomy, the SC Matrix makes no allowance for a simple, run of the mill, unreplicated system. Since the vast majority of business systems are simplex, this is a very curious omission.

Finally, the SC Matrix makes no allowance for Domains of Availability. The entire system is expected to have one categorization. This is exceedingly inflexible in a large distributed system. For example, it is quite unreasonable to expect a desktop client used to perform a non-critical audit function to share the same availability criteria with the core data servers.

Applying SC Matrix to System Design

Considering that the SC Matrix is the product of a fairly enlightened approach to the subject of Availability, and that customer requirements are typically far less well communicated, what then is a design community to do when searching for a High Availability technical architecture?

Here are a number of suggestions:

In the remainder of this document, a number of commercial technologies are described, and for each, their fit to both the neutral model given earlier in this document, as well as the SC Matrix, is assessed.

Technologies

This section summarizes the middleware technologies which have capabilities which may be of use in implementing a design for High Availability. Some of these products are specifically designed for High Availability. Others are tools with some other primary purpose (database, TP monitor, etc.), but which make some provision for replication and availability. The list is by no means exhaustive; rather it reflects only the technologies examined by one particular design group in the course of a specific project.

HP Technologies

HP ServiceGuard

Description

ServiceGuard is HP's primary High Availability middleware offering, available on HPUX. It allows application software to be organized into packages which are installed on server clusters (typically pairs). A package is assigned disk resources (HP Logical Volumes) and network resources (an IP address). These resources remain attached to the package, even across faults. I.e., ServiceGuard performs a heartbeat between the machines in its cluster, and in the event of a perceived failure, it can shut down a package and restart it on a healthy machine with the same network address it had previously, and with the same disk resources it had previously. It achieves the network address transparency by mapping the "floating" network addresses to the "fixed" IP addresses which remain attached to the machines.

It achieves the disk resource transparency by connecting multi-ported disks (e.g. F/W SCSI) to the participating machines in the cluster. A Distributed Lock Manager (DLM) ensures that only a single host has write access, to prevent data corruption. The activities of starting and stopping application packages, and performing any specific startup and shutdown activities required, is done in shell scripts which are written by the application developers.

Figure 5: HP ServiceGuard architecture

Domain of Applicability

ServiceGuard is HP's offering in a field which has very similar offerings from its competitors, such as IBM HACMP, and the Openvision*HA product for Sun. All of these products are targeted at systems which want a straightforward replication mechanism on off-the-shelf UNIX equipment (unlike expensive Tandem or Stratus Fault Tolerant systems), with little or no application development. ServiceGuard and its competitors are designed to slide underneath existing applications, such as databases, transaction monitors, applications servers, etc. with no code modifications. The only development required is to create the shell scripts used to manage the start-up and shut-down operations. Clients attempting to access their servers can also remain unaware of the underlying HA middleware, and hence not require code modifications, since on failures they attempt to rebind to the same network address, which migrates to the new host of their servers. This is very useful for third party software systems for which there is no luxury of tailoring them to a "smart" HA system, and which must be installed and configured on their binary forms alone. Finally, by supporting arbitrary package definitions, and allowing packages to be configured onto all nodes in the ServiceGuard cluster, a limited form of Load Balancing can be achieved.

There are several disadvantages to the technology as a consequence of ServiceGuard's convenience and simplicity:

Classification

As has been indicated, ServiceGuard is a RR (Rollback & Recover) system. It corresponds most closely to a Cold Standby, or perhaps a primitive Warm Standby, form of High Availability, since there is only limited opportunity for state sharing occurring between replicas, and even this must be managed by the application itself, through the shell scripts and common disk storage, rather than by the middleware.

The system MTTR it would offer in a full application setting would place it squarely in the "Basic Availability" category in the Service Class Matrix.

HP MirrorDisk

Description

MirrorDisk is a complementary (but independent) technology to ServiceGuard. MirrorDisk provides a device level mirroring of two physical storage devices, so that both contain identical replicas of the same logical volumes. Figure 5 identified that ServiceGuard can mask single points of failure in the processors and network connections. The disk, however, is a conspicuous single point of failure. MirrorDisk addresses this by converting the single dual ported disk into two active replicas (see Figure 6).

Figure 6 : HP MirrorDisk architecture, in combination with ServiceGuard and disk multiporting.

Domain of Applicability

MirrorDisk can be used independently of ServiceGuard, as an online disk replication solution. Its true usefulness, however, is in combination with ServiceGuard, in order to obtain a system that has eliminated all significant single points of failure.

As with ServiceGuard, since MirrorDisk relies on hardware configurations, it is a proprietary solution available only with HP equipment. Further, it can only be used for disks in close physical proximity (same system room), and hence cannot be used for a geographical diversity system.

Classification

MirrorDisk offers an online and active form of mass-storage replication. For this specific system component, then, it can be classified as an Active Replication/Hot Standby system. For example, should a disk device fail while under the control of MirrorDisk, the applications using that disk are able to remain running uninterrupted with the remaining disk mirror, hence MTTR=0 for this type of failure.

It must be stressed, however, that AR exists only for the storage device itself. File systems, database systems, etc. which reside on top of the physical device are not actively replicated, and come under the jurisdiction of ServiceGuard, if it is being used.

HP "Somersault"

Description

Somersault is the project name for an HP Labs Bristol activity to produce a group-based, Active Replication mechanism. It is influenced strongly by Isis, and related industry efforts for Active Replication systems. However, where Isis presents a very general group computing mechanism (Virtual Synchrony), whose groups can be used for purposes other than High Availability, Somersault is targeted specifically for HA. This reduces its generality, but reduces its complexity, and presumably improves its performance. Somersault is not a product offering from HP at this time.

Domain of Applicability

Somersault could be applied to distributed systems which have Hot Standby/Active Replication requirements. It supports total state replication between application replicas, if so desired, and also provides automatic client reconnection to a standby following a failure. Hence, MTTR can be reduced to zero. Furthermore, since it is network based, it does not have the proprietary dependencies on hardware that are found in ServiceGuard, and it can support replicas separated over a WAN (geographical diversity). Being communications-based, it, like Isis, has a CORBA strategy: Somersault will support an "ORB Plus + Somersault" model. It is well suited to the telecommunications management environment, offering management abilities and resource definitions that map well to the operational environment of telecommunications. Somersault is not applicable to a general cooperative group-computing paradigm. It is also not well suited to mass active replication of secondary storage (disks) (i.e. it has no equivalent to IDB- Isis for Databases).

Classification

Somersault is a middleware technology that supports HS/AR applications.

Applications built with this technology should be able to be positioned in the "High Availability" quadrant of the Service Class Matrix.

Isis Technologies

Isis, a subsidiary of Stratus Computers, offers a family of products, all devoted to supporting applications frameworks which have a requirement for HS/AR. The base product, the Isis Software Development Kit (SDK), is common to all the other product offerings (see Figure 7). The SDK itself, however, offers an extremely complex and proprietary set of APIs. Hence, it is generally not considered a viable middleware option on its own. Rather, some of the vertical Isis offerings shown in Figure 7 are typically used to ease application development.

Figure 7 : Isis Product Line

Orbix + Isis

Description

Orbix + Isis (O+I) offers a CORBA-compliant distribution model for Actively Replicated process groups. Applications can be developed using the traditional Iona Orbix C++ bindings, and then converted to being Orbix + Isis servers. O+I has group membership services, including heartbeat detection, join and exclusion primitives. It supports total message ordering, as well as less stringent (but more efficient) causal ordering, leaving the choice up to the developer. It supports a general cooperative group computing paradigm, and while it is primarily intended for High Availability, it can be used by any applications that need a totally ordered ("Virtually synchronous") computing model. Being network based, it has no proprietary hardware dependencies. It is supported on multiple common platforms (HPUX, Solaris, IBM AIX, Windows NT), and groups can be composed of server processes running on heterogeneous combinations of these platforms. Isis groups can be distributed over Wide Area Networks, allowing for geographic diversity.

In addition to its standard Active Replication model, Orbix + Isis also supports a Coordinator/Cohort model. While this model can still fully mask failures from clients as in the Active Replica model, it does so in a "lazy" replication style, which can be considerably more efficient, depending upon the size of state to be transferred relative to the cost of computing that state from the client inputs. Coordinator/Cohort can also be a useful load balancing tool, since a group can nominate different members to be coordinators for different client operations simultaneously.

Finally, Orbix + Isis has an Event Stream model which decouples the client and server in an interaction and funnels their operations through an Event Queue. This Event Queue can optionally store messages persistently, and is itself implemented as an Actively Replicated (hence Highly Available) Isis group. This yields a CORBA based publish/subscribe mechanism, with fully ordered, guaranteed message delivery to intended recipients, whether they are executing at the time of transmission or not.

Domain of Applicability

Orbix + Isis is specifically intended for CORBA based applications which are widely distributed, and mission critical. In other words, applications that due to the sheer number of components involved, will have small system MTTF values (something is failing almost all the time throughout the distributed environment), yet the system as a whole is expected to have very high availability. Hence, all important system components need to be actively replicated to mask failures to the users of the system. Unfortunately, there are several known difficulties with the technology:

Classification

Orbix + Isis is a middleware technology that supports HS/AR applications. Applications built with this technology should be able to be positioned in the "High Availability" quadrant of the Service Class Matrix.

Isis for Database (IDB)

Description

Isis for Database (IDB) extends the Isis Active Replication paradigm by using the Isis SDK to mirror large pools of secondary storage. In this sense, its goal is similar to that of MirrorDisk. However, since it does not rely on hardware configurations and is constructed from the networked software architecture of the Isis SDK, it is portable across hardware platforms (HPUX, AIX, Solaris, etc.) and it can be extended over Wide Area Networks to provide disk replicas at geographically large distances from each other (although for this to be practical, LAN-type bandwidths would be required, e.g. 10 Mbps WAN. This would probably require a dedicated SONET ring between the geographically remote locations).

IDB supports the on-line (active) streaming of data to the mirrored databases. It does this by taking control of the "commit" verb of application transactions. This means that IDB must be specifically tailored for every target database. Versions of IDB have been provided for Oracle and Sybase at this time. A database failure, or the failure of the underlying storage mechanism, is masked completely, meaning that applications using IDB can continue transactions in progress unimpeded to the remaining replica(s). When the failed volume is returned to service, state (the delta of all missed transactions) is spooled from an active member of the Isis group to the newcomer in the background, until it is resynchronized. Application transactions continue unimpeded while this Data Synchronization occurs.

Domain of Applicability

IDB is useful in systems that require Active Replication of large, persistent databases, especially when this requirement is to be met in a portable (multiple database types, multiple hardware platforms) setting, and potentially where the replication can span geographically diverse sites.

Shortcoming with IDB include:

IDB/O+I integration Since IDB is a vertical Isis application directly employing the Isis SDK, it is not integrated to O+I. In fact, its APIs are purely procedural, and support no object notions at all.

Database Versions. A version of IDB would be required for every database type that is to be replicated, since IDB actively takes control of the transactional sequence. (In this sense, IDB acts very much like a transaction monitor).

Third Party Data Access Support. IDB assumes that applications accessing the underlying replicated databases through it make direct embedded SQL or CLI (Call Level Interface) invocations, so that it can take control of the commit verbs. In cases where a third party tool is used to generate the low level data access (e.g. Persistence, Odapter, etc.), the tool would have to be modified by its vendor to support IDB's requirements that it give up control over the commit.

Classification

IDB is a middleware technology that supports HS/AR for persistent storage components (i.e. databases and their underlying disk storage).Applications built with this technology should be able to be positioned in the "High Availability" quadrant of the Service Class Matrix.

Other Technologies

Oracle Parallel Server (OPS)

Description

Oracle Parallel Server (OPS) is a technology created by Oracle in partnership with each of the major hardware platform vendors on which Oracle databases run (e.g. Pyramid, IBM, HP, Sun, etc.). OPS takes advantage of the Distributed Lock Manager and multiported disk configurations discussed previously in the context of ServiceGuard, and of similar hardware and OS extensions on the non-HP platforms. OPS uses these tools to create an Actively Replicated shared state between two Oracle database managers running on clustered machines. Hence, in the event of a failure, applications which transfer to the standby instance of Oracle are able to connect almost immediately. While they are still explicitly disconnected from their original database, and any uncommitted transactions are lost and must be re-entered, committed transactions are visible almost immediately.

Hence, OPS in combination with ServiceGuard and MirrorDisk can yield a much faster database recovery, and consequently a much lower overall system MTTR.

Domain of Applicability

OPS is applicable in environments that are using Oracle as the primary database server underlying their applications, and which need an improved recovery time over the pure Cold Standby model of ServiceGuard. The model is still RR (Rollback and Restart), but the recovery time is dramatically improved, and there is presumably less transactional work to be rolled back (and hence fewer dissatisfied customers).

The solution is very hardware specific, and takes advantage of DLM and multiported disk technology. It is not extensible to geographic diversity requirements.

Classification

OPS is positioned at the high end of the Warm Standby scale, approaching Hot Standby. It has nearly complete state replication between the Oracle replicas, but it still requires applications to follow a Rollback & Restart path on failure.

Applications built using OPS would lie near the boundary between "Basic Availability" and "High Availability" quadrants in the Service Class Matrix. The determination of whether an overall system was BA or HA would depend upon other application recovery criteria.

Oracle Replication

Description

Oracle, like most database vendors today, offers a "lazy" replication model in which transactions are logged to a file, as well as being applied directly to the database. Periodically (e.g. hourly, nightly, etc.) these transaction logs are spooled to a standby server for application to a redundant database. This technique is often used for functional replication (e.g. in order to have multiple data sources physically close to their points of consumption), as well as for failure masking.

As a High Availability strategy, it offers relative simplicity, no hardware dependence, and geographical diversity (since the replication is network-based, and not dependent on multiported disks). There is a window (sized according to the frequency of the spooling of the transaction logs) in which transactions committed on the primary will not be seen on the standby, and will have to be "rolled back" in the client's (and user's) context. Furthermore, when the original primary is reintroduced back into the system, those "missing" transactions can suddenly reappear. Normalizing the divergent database views can be difficult, and invariably requires DBA (human) intervention. Hence, relying on lazy database replication is not a satisfactory "fully-automated" version of HA.

Domain of Applicability

"Lazy" database replication is suitable for systems that require a relaxed form of High Availability for their persistent storage, but which can tolerate substantial MTTRs and can afford to lose the portion of their committed transactions that had not been propagated to the standby system prior to failure. While this might be thought not to apply to most mission critical business systems, because of performance, cost and technology reasons, "Lazy" replication is actually quite common in such systems today.

Classification

"Lazy" database replication is a Warm Standby approach for obtaining High Availability persistent storage replicas. The degree of "Warmth" in the Warm Standby approach is a function of the frequency at which the transaction logs are spooled, since they reflect the state sharing which is occurring. That is, replication hourly is "warmer" than replication nightly.

A system built using Lazy database replication would be located in the Basic quadrant of the Service Class Matrix.

Transarc Encina

Description

Encina is a DCE based Transaction Processing monitor. TP Monitors are used to coordinate and synchronize ACID transactions enqueued by multiple clients against multiple resource managers (databases) in a distributed environment. Encina is not intended to be "High Availability" middleware; it is transactional middleware. However, the reliability concepts behind ACID (Atomicity, Consistency, Isolation, Durability) transactions and the reliability implied by HA are closely related.

Encina (or any TP monitor) can be used to ensure that application transactions, once entered into the system, and once committed, will reliably and durably be conveyed to a replicated set of databases, via two phase commits (2PC). However, 2PCs are computationally expensive operations, and for performance reasons, this is rarely done. In practice, Encina uses transaction logging techniques similar to those discussed in the previous section, and implements a kind of Lazy Replication. The advantage is that the log is not stored with the primary database, and so all committed transactions are seen at the replica. I.e., using the TP monitor avoids the "window" of temporarily lost transactions discussed above.

Domain of Applicability

Using a TP monitor to enforce Active Replication (using 2PCs) or Lazy Replication (using transaction logs) is a reasonable approach only if the application domain is using a TP monitor already for its functional properties (i.e. distributing ACID transactions across multiple resource managers). TP Monitors (Encina, Tuxedo, CICS) are "heavyweight" pieces of middleware - they are very expensive, complex, and have large learning curves. They are rarely, if ever, introduced solely to provide HA.

Classification

If Encina (or similar TP Monitor) is used to replicate all transactions using 2PCs to identical database replicas, it can achieve a Hot Standby/Active Replication environment for its applications. (This is very similar to what IDB does, only IDB's "rendezvous" algorithm has much better performance than 2PCs, making it a practical alternative for large database replication). In this usage, a system could be positioned in the High Availability quadrant of the Service Class Matrix.

If the TP monitor is used to perform Lazy Replication, it presents a form of Warm Standby, with a degree of state sharing across replicas. In this configuration, a system would be positioned in the Basic Availability quadrant of the Service Class Matrix.

Objectstore Replication

Description

ODI (Object Design Inc.) Objectstore is an Object Oriented Database (OODB). Objectstore has a passive, lazy replication mechanism similar to the one discussed for Oracle. ODI has had discussions with Isis, and the two companies have concluded that it is possible to create an IDB version that would control Objectstore transactions. Hence, it would be feasible to have Actively Replicated Objectstore images using IDB. The companies are waiting for customer funding before proceeding with such an integration effort.

The other leading OODBs (such as Versant and Objectivity) have similar replication mechanisms.

Domain of Applicability

See Oracle Domain of Applicability

Classification

See Oracle Classification

References