Optimizing Failover Time in a Serviceguard Environment, June 2007

ManualsBrandsHP ManualsSoftwareHP SAP Linux Serviceguard Cluster Extension

Executive summary

One of the most important measurements for an effective high-availability/mission-critical environment

is how much delay the end user notices in the event of a failure. In these environments, certain things

need to take place, such as detecting a failure, finding a way to restart the work, ensuring data

integrity, and restarting applications and making them available to users.

Different business needs require different environments. Environments vary widely in their tolerance

for unplanned downtime, their hardware configuration, specialized software, and system and data

management. These factors require careful consideration when configuring a high-availability

environment. Thorough testing in a production or near-production environment should be done to

make sure that the configured cluster meets the requirements. Testing and fine-tuning can help

optimize failover time and increase application availability to end users.

This paper explains the HP Serviceguard failover process. Then it discusses how you can optimize

your cluster failover time. Then it discusses Serviceguard Extension for Faster Failover, a Serviceguard

auxiliary product you can purchase that reduces the time for the Serviceguard component of the

failover process.

The HP Serviceguard failover process

The process when failover is caused by a node failure

Serviceguard nodes monitor each other to be sure they can all communicate and cooperate. Every

node in a Serviceguard cluster sends heartbeat messages over the network and listens for heartbeat

messages from other nodes. Heartbeat messages are sent at regular intervals, defined in the cluster

configuration file as the HEARTBEAT_INTERVAL.

If a node does not receive a heartbeat from another node, it begins the process of re-forming the

cluster and removing the unreachable node from cluster membership. Figure 1 shows the steps in a

failover caused by a failed node.

Figure 1. Steps in a failover caused by a failed node—standard Serviceguard implementation

Election Lock

acquisition

Quiescence Node

failure

detection

Cluster re-formation

Cluster component

recovery

Resource

recovery

(VG, FS, IP)

Application

recovery

Serviceguard component of failover time Application-dependent

failover time

Note: Diagram is not to scale.

The Serviceguard component of the total failover time when it is caused by node failure (not a

package failure) is composed of: node failure detection, election, lock acquisition, quiescence, and

cluster component recovery.

• Node failure detection—The system notices that a cluster node is not in communication with the

other cluster nodes. Serviceguard begins to re-form the cluster.

• Election—The cluster nodes decide which nodes will be in the re-formed cluster.

• Lock acquisition—If more than one group of nodes wants to re-form the cluster and no group has

a clear majority of members, the first group to reach the cluster lock re-forms the cluster.