Optimizing Failover Time in a Serviceguard Environment, June 2007

Executive summary
One of the most important measurements for an effective high-availability/mission-critical environment
is how much delay the end user notices in the event of a failure. In these environments, certain things
need to take place, such as detecting a failure, finding a way to restart the work, ensuring data
integrity, and restarting applications and making them available to users.
Different business needs require different environments. Environments vary widely in their tolerance
for unplanned downtime, their hardware configuration, specialized software, and system and data
management. These factors require careful consideration when configuring a high-availability
environment. Thorough testing in a production or near-production environment should be done to
make sure that the configured cluster meets the requirements. Testing and fine-tuning can help
optimize failover time and increase application availability to end users.
This paper explains the HP Serviceguard failover process. Then it discusses how you can optimize
your cluster failover time. Then it discusses Serviceguard Extension for Faster Failover, a Serviceguard
auxiliary product you can purchase that reduces the time for the Serviceguard component of the
failover process.
The HP Serviceguard failover process
The process when failover is caused by a node failure
Serviceguard nodes monitor each other to be sure they can all communicate and cooperate. Every
node in a Serviceguard cluster sends heartbeat messages over the network and listens for heartbeat
messages from other nodes. Heartbeat messages are sent at regular intervals, defined in the cluster
configuration file as the HEARTBEAT_INTERVAL.
If a node does not receive a heartbeat from another node, it begins the process of re-forming the
cluster and removing the unreachable node from cluster membership. Figure 1 shows the steps in a
failover caused by a failed node.
Figure 1. Steps in a failover caused by a failed node—standard Serviceguard implementation
Election Lock
acquisition
Quiescence Node
failure
detection
Cluster re-formation
Cluster component
recovery
Resource
recovery
(VG, FS, IP)
Application
recovery
Serviceguard component of failover time Application-dependent
failover time
Note: Diagram is not to scale.
The Serviceguard component of the total failover time when it is caused by node failure (not a
package failure) is composed of: node failure detection, election, lock acquisition, quiescence, and
cluster component recovery.
Node failure detection—The system notices that a cluster node is not in communication with the
other cluster nodes. Serviceguard begins to re-form the cluster.
Election—The cluster nodes decide which nodes will be in the re-formed cluster.
Lock acquisition—If more than one group of nodes wants to re-form the cluster and no group has
a clear majority of members, the first group to reach the cluster lock re-forms the cluster.
2