Managing HP Serviceguard A.11.20.10 for Linux, December 2012

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux ProLiant Cluster

Managing HP Serviceguard A.11.20.10 for

Linux

HP Part Number: 701460-002

Published: December 2012

Summary of content (304 pages)

PAGE 1
Managing HP Serviceguard A.11.20.
PAGE 2
Legal Notices The information in this document is subject to change without notice. Hewlett-Packard makes no warranty of any kind with regard to this manual, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Hewlett-Packard shall not be held liable for errors contained herein or direct, indirect, special, incidental or consequential damages in connection with the furnishing, performance, or use of this material. Warranty.
PAGE 3
Contents Printing History ..........................................................................................15 Preface......................................................................................................17 1 Serviceguard for Linux at a Glance.............................................................19 1.1 What is Serviceguard for Linux? .........................................................................................19 1.1.1 Failover.......................................
PAGE 4
3.1.2.3 WBEM Query....................................................................................................35 3.1.2.4 WBEM Indications..............................................................................................36 3.2 How the Cluster Manager Works .......................................................................................36 3.2.1 Configuration of the Cluster ........................................................................................36 3.2.
PAGE 5
3.5.11 VLAN Configurations................................................................................................67 3.5.11.1 What is VLAN?.................................................................................................67 3.5.11.2 Support for Linux VLAN......................................................................................67 3.5.11.3 Configuration Restrictions....................................................................................68 3.5.11.
PAGE 6
4.7.3.3.1 Rules and Restrictions for Mixed Mode...........................................................85 4.7.4 Cluster Configuration Parameters ................................................................................86 4.7.5 Cluster Configuration: Next Step ................................................................................99 4.8 Package Configuration Planning ......................................................................................100 4.8.
PAGE 7
5 Building an HA Cluster Configuration........................................................129 5.1 Preparing Your Systems ...................................................................................................129 5.1.1 Installing and Updating Serviceguard .........................................................................129 5.1.2 Understanding the Location of Serviceguard Files.........................................................129 5.1.3 Enabling Serviceguard Command Access.......
PAGE 8
5.2.9 Verifying the Cluster Configuration ............................................................................157 5.2.10 Cluster Lock Configuration Messages........................................................................157 5.2.11 Distributing the Binary Configuration File ..................................................................158 5.3 Managing the Running Cluster.........................................................................................158 5.3.
PAGE 9
6.1.4.35 vgchange_cmd..............................................................................................180 6.1.4.36 vg................................................................................................................180 6.1.4.37 File system parameters.....................................................................................180 6.1.4.38 concurrent_fsck_operations..............................................................................181 6.1.4.
PAGE 10
7.2.2 Adding Previously Configured Nodes to a Running Cluster............................................204 7.2.3 Removing Nodes from Participation in a Running Cluster...............................................204 7.2.3.1 Using Serviceguard Commands to Remove a Node from Participation in a Running Cluster ......................................................................................................................204 7.2.4 Halting the Entire Cluster ............................................
PAGE 11
7.7.1.2.2 Editing the Package Configuration File..........................................................226 7.7.2 Creating the Package Control Script...........................................................................227 7.7.2.1 Customizing the Package Control Script ..............................................................228 7.7.2.2 Adding Customer Defined Functions to the Package Control Script .........................228 7.7.2.2.
PAGE 12
8.8.2 Halting a Detached Package....................................................................................249 8.8.3 Cluster Re-formations Caused by Temporary Conditions...............................................250 8.8.4 Cluster Re-formations Caused by MEMBER_TIMEOUT Being Set too Low........................250 8.8.5 System Administration Errors ....................................................................................251 8.8.5.1 Package Control Script Hangs or Failures .............
PAGE 13
A.6.3 Documenting Maintenance Operations .....................................................................267 B Integrating HA Applications with Serviceguard...........................................269 B.1 Checklist for Integrating HA Applications ...........................................................................269 B.1.1 Defining Baseline Application Behavior on a Single System ...........................................270 B.1.2 Integrating HA Applications in Multiple Systems .............
PAGE 14
H HP Serviceguard Toolkit for Linux..............................................................295 Index.......................................................................................................
PAGE 15
Printing History Table 1 Printing Date Part Number Edition November 2001 B9903-90005 First November 2002 B9903-90012 First December 2002 B9903-90012 Second November 2003 B9903-90033 Third February 2005 B9903-90043 Fourth June 2005 B9903-90046 Fifth August 2006 B9903-90050 Sixth July 2007 B9903-90054 Seventh March 2008 B9903-90060 Eighth April 2009 B9903-90068 Ninth July 2009 B9903-90073 Tenth June 2012 701460-001 NA December 2012 701460-002 NA The last printing date
PAGE 16
PAGE 17
Preface This guide describes how to configure and manage Serviceguard for Linux on HP ProLiant server under the Linux operating system. It is intended for experienced Linux system administrators. (For Linux system administration tasks that are not specific to Serviceguard, use the system administration documentation and manpages for your distribution of Linux.) The contents are as follows: • Chapter 1 (page 19) describes a Serviceguard cluster and provides a roadmap for using this guide.
PAGE 18
Information about supported configurations is in the HP Serviceguard for Linux Configuration Guide. For updated information on supported hardware and Linux distributions refer to the HP Serviceguard for Linux Certification Matrix. Both documents are available at: http://www.hp.com/info/sglx Problem Reporting If you have any problems with the software or documentation, please contact your local Hewlett-Packard Sales Office or Customer Service Center.
PAGE 19
1 Serviceguard for Linux at a Glance This chapter introduces Serviceguard for Linux and shows where to find different kinds of information in this book. It includes the following topics: • What is Serviceguard for Linux? (page 19) • Using Serviceguard for Configuring in an Extended Distance Cluster Environment (page 21) • Using Serviceguard Manager (page 22) • Configuration Roadmap (page 22) If you are ready to start setting up Serviceguard clusters, skip ahead to Chapter 4 (page 75).
PAGE 20
Figure 1 Typical Cluster Configuration In the figure, node 1 (one of two SPU's) is running package A, and node 2 is running package B. Each package has a separate group of disks associated with it, containing data needed by the package's applications, and a copy of the data. Note that both nodes are physically connected to disk arrays. However, only one node at a time may access the data for a given group of disks.
PAGE 21
Figure 2 Typical Cluster After Failover After this transfer, the package typically remains on the adoptive node as long the adoptive node continues running. If you wish, however, you can configure the package to return to its primary node as soon as the primary node comes back online. Alternatively, you may manually transfer control of the package back to the primary node at the appropriate time. Figure 2 (page 21) does not show the power connections to the cluster, but these are important as well.
PAGE 22
1.3 Using Serviceguard Manager NOTE: For more information, see Appendix E (page 283), and the section on Serviceguard Manager in the latest version of the Serviceguard Release Notes. For more information about Serviceguard Manager compatibility, see Serviceguard/Serviceguard Manager Plug-in Compatibility and Feature Matrix and the latest Release Notes at http://www.hp.com/go/hpux-serviceguard-docs (Select HP Serviceguard). Serviceguard Manager is the graphical user interface for Serviceguard.
PAGE 23
Figure 3 Tasks in Configuring a Serviceguard Cluster HP recommends that you gather all the data that is needed for configuration before you start. See Chapter 4 (page 75) for tips on gathering data. 1.
PAGE 24
PAGE 25
2 Understanding Hardware Configurations for Serviceguard for Linux This chapter gives a broad overview of how the server hardware components operate with Serviceguard for Linux. The following topics are presented: • Redundant Cluster Components • Redundant Network Components (page 25) • Redundant Disk Storage (page 29) • Redundant Power Supplies (page 30) Refer to the next chapter for information about Serviceguard software components. 2.
PAGE 26
2.2.1 Rules and Restrictions • A single subnet cannot be configured on different network interfaces (NICs) on the same node. • In the case of subnets that can be used for communication between cluster nodes, the same network interface must not be used to route more than one subnet configured on the same node. • For IPv4 subnets, Serviceguard does not support different subnets on the same LAN interface. ◦ For IPv6, Serviceguard supports up to two subnets per LAN interface (site-local and global).
PAGE 27
Figure 4 Redundant LANs In Linux configurations, the use of symmetrical LAN configurations is strongly recommended, with the use of redundant hubs or switches to connect Ethernet segments. The software bonding configuration should be identical on each node, with the active interfaces connected to the same hub or switch. 2.2.3 Cross-Subnet Configurations As of Serviceguard A.11.
PAGE 28
• You should not use the wildcard (*) for node_name in the package configuration file, as this could allow the package to fail over across subnets when a node on the same subnet is eligible; failing over across subnets can take longer than failing over on the same subnet. List the nodes in order of preference instead of using the wildcard. • You should configure IP monitoring for each subnet; see “Monitoring LAN Interfaces and Detecting Failure: IP Level” (page 63). 2.2.3.
PAGE 29
IMPORTANT: Although cross-subnet topology can be implemented on a single site, it is most commonly used by extended-distance clusters and Metrocluster. For more information about such clusters, see the following documents at http://www.hp.
PAGE 30
Figure 5 Mirrored Disks Connected for High Availability 2.4 Redundant Power Supplies You can extend the availability of your hardware by providing battery backup to your nodes and disks. HP-supported uninterruptible power supplies (UPS) can provide this protection from momentary power loss. Disks should be attached to power circuits in such a way that disk array copies are attached to different power sources. The boot disk should be powered from the same circuit as its corresponding node.
PAGE 31
3 Understanding Serviceguard Software Components This chapter gives a broad overview of how the Serviceguard software components work.
PAGE 32
• cmlogd—cluster system log daemon • cmdisklockd—cluster lock LUN daemon • cmresourced—Serviceguard Generic Resource Assistant Daemon • cmserviced—Service Assistant daemon • qs—Quorum Server daemon • cmlockd—utility daemon • cmsnmpd—cluster SNMP subagent (optionally running) • cmwbemd—WBEM daemon • cmproxyd—proxy daemon Each of these daemons logs to the Linux system logging files. The quorum server daemon logs to the user specified log file, such as, /usr/local/qs/log/qs.
PAGE 33
3.1.1.3 Log Daemon: cmlogd cmlogd is used by cmcld to write messages to the system log file. Any message written to the system log by cmcld it written through cmlogd. This is to prevent any delays in writing to syslog from impacting the timing of cmcld. The path for this daemon is $SGLBIN/cmlogd. 3.1.1.4 Network Manager Daemon: cmnetd This daemon monitors the health of cluster networks. It also handles the addition and deletion of relocatable package IPs, for both IPv4 and IPv6 addresses. 3.1.1.
PAGE 34
Serviceguard Quorum Server release notes at http://www.hp.com/go/hpux-serviceguard-docs (Select HP Serviceguard Quorum Server Software). See also “Use of the Quorum Server as a Cluster Lock” (page 39). The path for this daemon is: • For SUSE: /opt/qs/bin/qs • For Red Hat: /usr/local/qs/bin/qs 3.1.1.9 Utility Daemon: cmlockd Runs on every node on which cmcld is running. It maintains the active and pending cluster resource locks. 3.1.1.
PAGE 35
For more information, see the following: Common Information Model (CIM) Web-Based Enterprise Management (WBEM) 3.1.2.2 Support for Serviceguard WBEM Provider Serviceguard WBEM provider allows you to get the basic Serviceguard cluster information via the Common Information Model object Manager (CIMOM) technology.
PAGE 36
• HP_SGQuorumServer • HP_SGLockLun • HP_SGLockDisk For more information about WBEM provider classes, see Managed Object Format (MOF) files for properties. When SGProviders is installed, the MOF files are copied to the /opt/sgproviders/ mof/ directory on SUSE Linux Enterprise Server and /usr/local/sgproviders/mof/ directory on Red Hat Enterprise Linux server.
PAGE 37
If heartbeat and data are sent over the same LAN subnet, data congestion may cause Serviceguard to miss heartbeats and initiate a cluster re-formation that would not otherwise have been needed. For this reason, HP recommends that you dedicate a LAN for the heartbeat as well as configuring heartbeat over the data network.
PAGE 38
• A node halts because of a package failure. • A node halts because of a service failure. • Heavy network traffic prohibited the heartbeat signal from being received by the cluster. • The heartbeat network failed, and another network is not configured to carry heartbeat. Typically, re-formation results in a cluster with a different composition. The new cluster may contain fewer or more nodes than in the previous incarnation of the cluster. 3.2.
PAGE 39
Figure 7 Lock LUN Operation Serviceguard periodically checks the health of the lock LUN and writes messages to the syslog file if the disk fails the health check. This file should be monitored for early detection of lock disk problems. 3.2.9 Use of the Quorum Server as a Cluster Lock The cluster lock in Linux can also be implemented by means of a quorum server. A quorum server can be used in clusters of any size.
PAGE 40
Figure 9 Quorum Server to Cluster Distribution IMPORTANT: For more information about the quorum server, see the latest version of the HP Serviceguard Quorum Server release notes at http://www.hp.com/go/hpux-serviceguard-docs (Select HP Serviceguard Quorum Server Software). 3.2.10 No Cluster Lock Normally, you should not configure a cluster of three or fewer nodes without a cluster lock. In two-node clusters, a cluster lock is required.
PAGE 41
IMPORTANT: During Step 1, while the nodes are using a strict majority quorum, node failures can cause the cluster to go down unexpectedly if the cluster has been using a quorum device before the configuration change. For example, suppose you change the quorum server polling interval while a two-node cluster is running. If a node fails during Step 1, the cluster will lose quorum and go down, because a strict majority of prior cluster members (two out of two in this case) is required.
PAGE 42
Figure 10 Package Moving During Failover 3.3.1.2.1 Configuring Failover Packages You configure each package separately. You create a failover package by generating and editing a package configuration file template, then adding the package to the cluster configuration database; details are in Chapter 6: “Configuring Packages and Their Services ” (page 163). For legacy packages (packages created by the method used on versions of Serviceguard earlier than A.11.
PAGE 43
package can be temporarily set with the cmmodpkg command; at reboot, the configured value will be restored. The auto_run parameter is set in the package configuration file. A package switch normally involves moving failover packages and their associated IP addresses to a new system. The new system must already have the same subnet configured and working properly, otherwise the packages will not be started.
PAGE 44
In Figure 12, node1 has failed and pkg1 has been transferred to node2. pkg1's IP address was transferred to node2 along with the package. pkg1 continues to be available and is now running on node2. Also note that node2 now has access both to pkg1's disk and pkg2's disk. NOTE: For design and configuration information about clusters that span subnets, see the documents listed under “Cross-Subnet Configurations” (page 27). Figure 12 After Package Switching 3.3.1.2.
PAGE 45
Table 2 Package Configuration Data Package Name NODE_NAME List FAILOVER_POLICY pkgA node1, node2, node3, node4 min_package_node pkgB node2, node3, node4, node1 min_package_node pkgC node3, node4, node1, node2 min_package_node When the cluster starts, each package starts as shown in Figure 13. Figure 13 Rotating Standby Configuration before Failover If a failure occurs, the failing package would fail over to the node containing fewest running packages: 3.
PAGE 46
Figure 14 Rotating Standby Configuration after Failover NOTE: Under the min_package_node policy, when node2 is repaired and brought back into the cluster, it will then be running the fewest packages, and thus will become the new standby node. If these packages had been set up using the configured_node failover policy, they would start initially as in Figure 13, but the failure of node2 would cause the package to start on node3, as shown in Figure 15.
PAGE 47
Figure 15 configured_node Policy Packages after Failover If you use configured_node as the failover policy, the package will start up on the highest-priority eligible node in its node list. When a failover occurs, the package will move to the next eligible node in the list, in the configured order of priority. 3.3.1.2.
PAGE 48
Figure 16 Automatic Failback Configuration before Failover Table 3 Node Lists in Sample Cluster Package Name NODE_NAME List FAILOVER POLICY FAILBACK POLICY pkgA node1, node4 configured_node automatic pkgB node2, node4 configured_node automatic pkgC node3, node4 configured_node automatic node1 panics, and after the cluster reforms, pkgA starts running on node4: 48 Understanding Serviceguard Software Components
PAGE 49
Figure 17 Automatic Failback Configuration After Failover After rebooting, node1 rejoins the cluster. At that point, pkgA will be automatically stopped on node4 and restarted on node1. Figure 18 Automatic Failback Configuration After Restart of node1 3.
PAGE 50
NOTE: Setting the failback_policy to automatic can result in a package failback and application outage during a critical production period. If you are using automatic failback, you may want to wait to add the package’s primary node back into the cluster until you can allow the package to be taken out of service temporarily while it switches back to the primary node. Serviceguard automatically chooses a primary node for a package when the NODE_NAME is set to '*'.
PAGE 51
If there is a common generic resource that needs to be monitored as a part of multiple packages, then the monitoring script for that resource can be launched as part of one package and all other packages can use the same monitoring script. There is no need to launch multiple monitors for a common resource. If the package that has started the monitoring script fails or is halted, then all the other packages that are using this common resource also fail.
PAGE 52
3.4.1 What Makes a Package Run? There are 3 types of packages: • The failover package is the most common type of package. It runs on one node at a time. If a failure occurs, it can switch to another node listed in its configuration file. If switching is enabled for several nodes, the package manager will use the failover policy to determine where to start the package. • A system multi-node package runs on all the active cluster nodes at the same time.
PAGE 53
Figure 19 Legacy Package Time Line Showing Important Events The following are the most important moments in a package’s life: 1. Before the control script starts. (For modular packages, this is the master control script.) 2. During run script execution. (For modular packages, during control script execution to start the package.) 3. While services are running 4. If there is a generic resource configured and it fails, then the package will be halted. 5.
PAGE 54
3.4.3 During Run Script Execution Once the package manager has determined that the package can start on a particular node, it launches the script that starts the package (that is, a package’s control script or master control script is executed with the start parameter). This script carries out the following steps: 1. 2. 3. 4. 5. 6. 7. Executes any external_pre_scripts (modular packages only; see “About External Scripts” (page 122)) Activates volume groups or disk groups. Mounts file systems.
PAGE 55
Normal starts are recorded in the log, together with error messages or warnings related to starting the package. NOTE: After the package run script has finished its work, it exits, which means that the script is no longer executing once the package is running normally. After the script exits, the PIDs of the services started by the script are monitored by the package manager directly.
PAGE 56
• Process IDs of the services • Subnets configured for monitoring in the package configuration file • Generic resources configured for monitoring in the package configuration file If a service fails but the restart parameter for that service is set to a value greater than 0, the service will restart, up to the configured number of restarts, without halting the package.
PAGE 57
1. 2. 3. 4. 5. 6. 7. 8. Halts all package services. Executes any customer-defined halt commands (legacy packages only) or external_scripts (modular packages only; see “external_script” (page 184)). Removes package IP addresses from the LAN card on the node. Unmounts file systems. Deactivates volume groups. Revokes Persistent registrations and reservations, if any Exits with an exit code of zero (0). Executes any external_pre_scripts (modular packages only; see “external_pre_script” (page 183)).
PAGE 58
• 0—normal exit. The package halted normally, so all services are down on this node. • 1—abnormal exit, also known as no_restart exit. The package did not halt normally. Services are killed, and the package is disabled globally. It is not disabled on the current node, however. • 2 — abnormal exit, also known as restart exit. The package did not halt normally. Services are killed, and the package is disabled globally. It is not disabled on the current node, however.
PAGE 59
Table 4 Error Conditions and Package Movement for Failover Packages (continued) Package Error Condition Results Error or Exit Code Node Failfast Service Enabled Failfast Enabled Linux Status on Primary after Error Halt script Package Allowed to Package runs after Run on Primary Allowed to Run Error or Exit Node after Error on Alternate Node Loss of Network No Either Setting Running Yes Yes Yes package depended on failed Either Setting Either Setting Running Yes Yes when dependency is again
PAGE 60
Because system multi-node and multi-node packages do not fail over, they do not have relocatable IP address. A relocatable IP address is like a virtual host IP address that is assigned to a package. HP recommends that you configure names for each package through DNS (Domain Name System). A program then can use the package’s name like a host name as the input to gethostbyname(3), which will return the package’s relocatable IP address.
PAGE 61
the others are available as backups. If one interface fails, another interface in the bonded group takes over. HP strongly recommends you use channel bonding in each critical IP subnet to achieve highly available network services. Host Bus Adapters (HBAs) do not have to be identical. Ethernet LANs must be the same type, but can be of different bandwidth (for example, 1 Gb and 100 Mb). Serviceguard for Linux supports the use of bonding of LAN interfaces at the driver level.
PAGE 62
Figure 23 Bonded NICs Node2 Node1 bond0: bond0: eth0 eth1 eth0 eth1 active active Hub Crossover cable Hub In the bonding model, individual Ethernet interfaces are slaves, and the bond is the master. In the basic high availability configuration (mode 1), one slave in a bond assumes an active role, while the others remain inactive until a failure is detected. (In Figure 3-18, both eth0 slave interfaces are active.
PAGE 63
on-board LAN interfaces) must be used in any combination of channel bonds to avoid a single point of failure for heartbeat connections. 3.5.5 Bonding for Load Balancing It is also possible to configure bonds in load balancing mode, which allows all slaves to transmit data in parallel, in an active/active arrangement. In this case, high availability is provided by the fact that the bond still continues to function (with less throughput) if one of the component LANs should fail.
PAGE 64
• Detects when a network interface fails to send or receive IP messages, even though it is still up at the link level. • Handles the failure, failover, recovery, and failback. 3.5.7.
PAGE 65
16.89.120.0 … Possible IP Monitor Subnets: IPv4: 16.89.112.0 Polling Target 16.89.112.1 IPv6: 3ffe:1000:0:a801:: Polling Target 3ffe:1000:0:a801::254 … The IP Monitor section of the cluster configuration file will look similar to the following for a subnet on which IP monitoring is configured with target polling. NOTE: This is the default if cmquerycl detects a gateway for the subnet in question; see SUBNET under “Cluster Configuration Parameters ” (page 86) for more information.
PAGE 66
IMPORTANT: HP strongly recommends that you do not change the default NETWORK_POLLING_INTERVAL value of 2 seconds. See also “Reporting Link-Level and IP-Level Failures” (page 66). 3.5.7.3 Constraints and Limitations • A subnet must be configured into the cluster in order to be monitored. • Polling targets are not detected beyond the first-level router. • Polling targets must accept and respond to ICMP (or ICMPv6) ECHO messages.
PAGE 67
3.5.9 Package Switching and Relocatable IP Addresses A package switch involves moving the package to a new system. In the most common configuration, in which all nodes are on the same subnet(s), the package IP (relocatable IP; see “Stationary and Relocatable IP Addresses and Monitored Subnets” (page 59)) moves as well, and the new system must already have the subnet configured and working properly, otherwise the packages will not be started.
PAGE 68
3.5.11.3 Configuration Restrictions Linux allows up to 1024 VLANs to be created from a physical NIC port. A large pool of system resources is required to accommodate such a configuration; Serviceguard could suffer performance degradation if many network interfaces are configured in each cluster node. To prevent this and other problems, Serviceguard imposes the following restrictions: • A maximum of 30 network interfaces per node is supported.
PAGE 69
Figure 26 Physical Disks Combined into LUNs NOTE: LUN definition is normally done using utility programs provided by the disk array manufacturer. Since arrays vary considerably, you should refer to the documentation that accompanies your storage unit. For information about configuring multipathing, see “Multipath for Storage ” (page 78). 3.6.2 Monitoring Disks Each package configuration includes information about the disks that are to be activated by the package at startup.
PAGE 70
Unlike exclusive activation for volume groups, which does not prevent unauthorized access to the underlying LUNs, PR controls access at the LUN level. Registration and reservation information is stored on the device and enforced by its firmware; this information persists across device resets and system reboots. NOTE: Persistent Reservations coexist with, and are independent of, activation protection of volume groups.
PAGE 71
For more information on using Serviceguard with VMware Virtual Machines, see the white paper Using Serviceguard for Linux with VMware Virtual Machines at http://www.hp.com/go/ linux-serviceguard-docs. CAUTION: Serviceguard makes and revokes registrations and reservations during normal package startup and shutdown, or package failover. Serviceguard also provides a script to clear reservations in the event of a catastrophic cluster failure.
PAGE 72
3.8 Responses to Failures Serviceguard responds to different kinds of failures in specific ways. For most hardware failures, the response is not user-configurable, but for package and service failures, you can choose the system’s response, within limits. 3.8.1 Reboot When a Node Fails The most dramatic response to a failure in a Serviceguard cluster is a system reboot. This allows packages to move quickly to another node, protecting the integrity of the data.
PAGE 73
SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the cluster. To release all resources related toPackage2 (such as exclusive access to volume group vg02 and the Package2 IP address) as quickly as possible, SystemB halts (system reset). NOTE: If AUTOSTART_CMCLD in /etc/rc.config.d/cmcluster ($SGAUTOSTART) is set to zero, the node will not attempt to join the cluster when it comes back up.
PAGE 74
• If service_fail_fast_enabled (page 178) is set to yes in the package configuration file, Serviceguard will reboot the node if there is a failure of that specific service. • If node_fail_fast_enabled (page 171) is set to yes in the package configuration file, and the package fails, Serviceguard will halt (reboot) the node on which the package is running. For more information, see “Package Configuration Planning ” (page 100) and Chapter 6 (page 163). 3.8.
PAGE 75
4 Planning and Documenting an HA Cluster Building a Serviceguard cluster begins with a planning phase in which you gather and record information about all the hardware and software components of the configuration.
PAGE 76
your cluster without having to bring it down, you need to plan the initial configuration carefully. Use the following guidelines: • Set the Maximum Configured Packages parameter (described later in this chapter under “Cluster Configuration Planning ” (page 81)) high enough to accommodate the additional packages you plan to add. • Networks should be pre-configured into the cluster configuration if they will be needed for packages you will add later while the cluster is running.
PAGE 77
For more information, see the white paper Using Serviceguard for Linux with VMware Virtual Machines at http://www.hp.com/go/linux-serviceguard-docs. 4.3 Hardware Planning Hardware planning requires examining the physical hardware itself. One useful procedure is to sketch the hardware configuration in a diagram that shows adapter cards and buses, cabling, disks and peripherals.
PAGE 78
4.3.3 Shared Storage SCSI can be used for up to four-node clusters; FibreChannel can be used for clusters of up to 16 nodes. 4.3.3.1 FibreChannel FibreChannel cards can be used to connect up to 16 nodes to a disk array containing storage. After installation of the cards and the appropriate driver, the LUNs configured on the storage unit are presented to the operating system as device files, which can be used to build LVM volume groups.
PAGE 79
You can obtain information about available disks by using the following commands; your system may provide other utilities as well. • ls /dev/sd* (Smart Array cluster storage) • ls /dev/hd* (non-SCSI/FibreChannel disks) • ls /dev/sd* (SCSI and FibreChannel disks) • du • df • mount • vgdisplay -v • lvdisplay -v See the manpages for these commands for information about specific usage. The commands should be issued from all nodes after installing the hardware and rebooting the system.
PAGE 80
Power Supply Enter the power supply unit number of the UPS to which the host or other device is connected. Be sure to follow UPS, power circuit, and cabinet power limits as well as SPU power limits. 4.4.1 Power Supply Configuration Worksheet The Power Supply Planning worksheet (page 273) will help you organize and record your specific power supply configuration. Make as many copies as you need. 4.
PAGE 81
IP Address The IP address(es) by which the quorum server will communicate with the cluster nodes. Supported Node Names The name (39 characters or fewer) of each cluster node that will be supported by this quorum server. These entries will be entered into qs_authfile on the system that is running the quorum server process. 4.
PAGE 82
4.7.1 Easy Deployment: cmpreparecl The cmpreparecl script allows you to ease the process of setting up the servers participating in the cluster. It also checks for the availability of ports used by Serviceguard Linux, starts the xinetd services, updates specific files, and sets up the firewall. As of Serviceguard A.11.20.10, the cmpreparecl script is supported. NOTE: After you run the cmpreparecl script, you can start the cluster configuration.
PAGE 83
f. g. h. i. Sets the AUTOSTART_CMCLD=1. In SUSE Linux Enterprise Server 11 environment, the RUN_PARALLEL parameter in the /etc/sysconfig/boot file, is set to "NO". The host names of the nodes and quorum if specified, their IP addresses are validated and updated in the /etc/hosts file. The /etc/lvm/lvm.conf and /etc/lvm/lvm_$(uname -n).conf files are updated to enable VG Activation Protection. Creates and deploys the firewall rules.
PAGE 84
(1m); see “Specifying the Address Family for the Cluster Hostnames” (page 148). The default is IPV4. See the subsections that follow for more information and important rules and restrictions. 4.7.3.1 What Is IPv4–only Mode? IPv4 is the default mode: unless you specify IPV6 or ANY (either in the cluster configuration file or via cmquerycl -a) Serviceguard will always try to resolve the nodes' hostnames (and the Quorum Server's, if any) to IPv4 addresses, and will not try to resolve them to IPv6 addresses.
PAGE 85
NOTE: This also applies if HOSTNAME_ADDRESS_FAMILY is set to ANY. See “Allowing Root Access to an Unconfigured Node” (page 130) for more information. • If you use a Quorum Server, you must make sure that the Quorum Server hostname (and the alternate Quorum Server address specified by QS_ADDR, if any) resolve to IPv6 addresses, and you must use Quorum Server version A.04.00 or later. See the latest Quorum Server release notes for more information; you can find them at http://www.hp.
PAGE 86
IPv4 address in this file (in the case of /etc/hosts, the IPv4 loopback address cannot be removed). In addition, the file must contain the following entry: ::1 localhost ipv6-localhost ipv6-loopback For more information and recommendations about hostname resolution, see “Configuring Name Resolution” (page 131). • You must use $SGCONF/cmclnodelist, not ~/.rhosts or /etc/hosts.equiv, to provide root access to an unconfigured node.
PAGE 87
All other characters are legal. The cluster name can contain up to 39 characters. CAUTION: Make sure that the cluster name is unique within the subnets configured on the cluster nodes; under some circumstances Serviceguard may not be able to detect a duplicate name and unexpected problems may result.
PAGE 88
Can be changed while the cluster is running; see “What Happens when You Change the Quorum Configuration Online” (page 40) for important information. QS_ADDR An alternate fully-qualified hostname or IP address for the quorum server. It must be (or resolve to) an IPv4 address on Red Hat 5 and Red Hat 6. On SLES 11, it can be (or resolve to) either an IPv4 or an IPv6 address if HOSTNAME_ADDRESS_FAMILY is set to ANY, but otherwise must match the setting of HOSTNAME_ADDRESS_FAMILY.
PAGE 89
the documents listed under “Cross-Subnet Configurations” (page 27) for more information. You can define multiple SITE_NAMEs. SITE_NAME entries must precede any NODE_NAME entries. See also SITE. IMPORTANT: SITE_NAME must be 39 characters or less, and are case-sensitive. Duplicate SITE_NAME entries are not allowed. NODE_NAME The hostname of each system that will be a node in the cluster.
PAGE 90
the documents listed under “Cross-Subnet Configurations” (page 27) for more information. If SITE is used, it must be used for each node in the cluster (that is, all the nodes must be associated with some defined site, though not necessarily the same one). If you are using SITEs, you can restrict the output of cmviewcl (1m) to a given site by means of the -S option. In addition, you can configure a site_preferred or site_preferred_manual failover_policy (page 172) for a package.
PAGE 91
NOTE: Any subnet that is configured in this cluster configuration file as a SUBNET for IP monitoring purposes, or as a monitored_subnet in a package configuration file (or SUBNET in a legacy package; see “Package Configuration Planning ” (page 100)) must be specified in the cluster configuration file via NETWORK_INTERFACE and either STATIONARY_IP or HEARTBEAT_IP.
PAGE 92
Considerations for cross-subnet: IP addresses for a given heartbeat path are usually on the same subnet on each node, but it is possible to configure the heartbeat on multiple subnets such that the heartbeat is carried on one subnet for one set of nodes and another subnet for others, with the subnets joined by a router.
PAGE 93
subnets here. You can identify any number of subnets to be monitored. A stationary IP address can be either an IPv4 or an IPv6 address. For more information about IPv6 addresses, see “IPv6 Address Types” (page 277). For information about changing the configuration online, see “Changing the Cluster Networking Configuration while the Cluster Is Running” (page 221). CAPACITY_NAME, CAPACITY_VALUE Node capacity parameters. Use the CAPACITY_NAME and CAPACITY_VALUE parameters to define a capacity for this node.
PAGE 94
MEMBER_TIMEOUT The amount of time, in microseconds, after which Serviceguard declares that the node has failed and begins re-forming the cluster without this node. Default value: 14 seconds (14,000,000 microseconds). This value leads to a failover time of between approximately 18 and 22 seconds, if you are using a quorum server, or a Fiber Channel cluster lock, or no cluster lock. Increasing the value to 25 seconds increases the failover time to between approximately 29 and 39 seconds.
PAGE 95
rebooted, if a system hang or network load spike prevents the node from sending a heartbeat signal within the MEMBER_TIMEOUT value. More than one node could be affected if, for example, a network event such as a broadcast storm caused kernel interrupts to be turned off on some or all nodes while the packets are being processed, preventing the nodes from sending and processing heartbeat messages.
PAGE 96
If NETWORK_POLLING_INTERVAL is defined to be 9,000,000 (9 seconds), then the polling happens at 9th second, 18th second and so on. • Serviceguard also uses this parameter to calculate the number of consecutive packets that each LAN interface can miss/receive to mark a LAN interface DOWN/UP. When an interface is monitored at IP-Level, and the NETWORK_POLLING_INTERVAL is defined to be 8 seconds or more, then the number of consecutive packets that each LAN interface can miss/receive to be marked DOWN/UP is 2.
PAGE 97
NOTE: CONFIGURED_IO_TIMEOUT_EXTENSION is supported only with iFCP switches that allow you to get their R_A_TOV value. • For switches and routers connecting an NFS server and cluster-node clients that can run packages using the NFS-mounted file system; see “Planning for NFS-mounted File Systems” (page 101). To set the value for the CONFIGURED_IO_TIMEOUT_EXTENSION, you must first determine the Maximum Bridge Transit Delay (MBTD) for each switch and router. The value should be in the vendors' documentation.
PAGE 98
See “Monitoring LAN Interfaces and Detecting Failure: IP Level” (page 63) for more information. Can be changed while the cluster is running; must be removed, with its accompanying IP_MONITOR and POLLING_TARGET entries, if the subnet in question is removed from the cluster configuration. IP_MONITOR Specifies whether or not the subnet specified in the preceding SUBNET entry will be monitored at the IP layer. To enable IP monitoring for the subnet, set IP_MONITOR to ON; to disable it, set it to OFF.
PAGE 99
name for a weight that exactly corresponds to a CAPACITY_NAME specified earlier in the cluster configuration file. (A package has weight; a node has capacity.) The rules for forming WEIGHT_NAME are the same as those spelled out for CAPACITY_NAME earlier in this list. These parameters are optional, but if they are defined, WEIGHT_DEFAULT must follow WEIGHT_NAME, and must be set to a floating-point value between 0 and 1000000.
PAGE 100
4.8 Package Configuration Planning Planning for packages involves assembling information about each group of highly available services. NOTE: As of Serviceguard A.11.18, there is a new and simpler way to configure packages. This method allows you to build packages from smaller modules, and eliminates the separate package control script and the need to distribute it manually; see Chapter 6: “Configuring Packages and Their Services ” (page 163), for complete instructions.
PAGE 101
• If a package moves to an adoptive node, what effect will its presence have on performance? • What hardware/software resources need to be monitored as part of the package? You can then configure these as generic resources in the package and write appropriate monitoring scripts for monitoring the resources. NOTE: Generic resources influence the package based on their status. The actual monitoring of the resource should be done in a script and this must be configured as a service.
PAGE 102
• Networking among the Serviceguard nodes must be configured in such a way that a single failure in the network does not cause a package failure. • Only NFS client-side locks (local locks) are supported. Server-side locks are not supported. • Because exclusive activation is not available for NFS-imported file systems, you must take the following precautions to ensure that data is not accidentally overwritten.
PAGE 103
When adding packages, be sure not to exceed the value of max_configured_packages as defined in the cluster configuration file (see “Cluster Configuration Parameters ” (page 86)). You can modify this parameter while the cluster is running if you need to. 4.8.4 Choosing Switching and Failover Behavior To determine the failover behavior of a failover package (see “Package Types” (page 41)), you define the policy that governs where Serviceguard will automatically start up a package that is not running.
PAGE 104
• generic_resource_name: defines the logical name used to identify a generic resource in a package. • generic_resource_evaluation_type: defines when the status of a generic resource is evaluated. This can be set to during_package_start or before_package_start. If not specified, DPS is considered as default. • ◦ during_package_start means the status of generic resources are evaluated during the course of start of the package.
PAGE 105
NOTE: Generic resources must be configured to use the monitoring script. It is the monitoring script that contains the logic to monitor the resource and set the status of a generic resource accordingly by using cmsetresource(1m). These scripts must be written by end-users according to their requirements. The monitoring script must be configured as a service in the package if the monitoring of the resource is required to be started and stopped as a part of the package.
PAGE 106
ATTRIBUTE_NAME Style Priority ATTRIBUTE_VALUE modular no_priority The cmviewcl -v -f line output (snippet) will be as follows: cmviewcl -v -f line -p pkg1 | grep generic_resource generic_resource:sfm_disk|name=sfm_disk generic_resource:sfm_disk|evaluation_type=during_package_start generic_resource:sfm_disk|up_criteria=”N/A” generic_resource:sfm_disk|node:node1|status=unknown generic_resource:sfm_disk|node:node1|current_value=0 generic_resource:sfm_disk|node:node2|status=unknown generic_resource:sfm_disk|n
PAGE 107
4.8.6.2 Online Reconfiguration of Generic Resources Online operations such as addition, deletion, and modification of generic resources in packages are supported. The following operations can be performed online: • Addition of a generic resource of generic_resource_evaluation_type set to during_package_start, whose status is not down. Please ensure that while adding a generic resource, the equivalent monitor is available; if not add the monitor while adding a generic resource.
PAGE 108
Serviceguard adds two new capabilities: you can specify broadly where the package depended on must be running, and you can specify that it must be down. These capabilities are discussed later in this section under “Extended Dependencies” (page 112). You should read the next section, “Simple Dependencies” (page 108), first. 4.8.7.1 Simple Dependencies A simple dependency occurs when one package requires another to be running on the same node.
PAGE 109
• A package cannot depend on itself, directly or indirectly. That is, not only must pkg1 not specify itself in the dependency_condition (page 174), but pkg1 must not specify a dependency on pkg2 if pkg2 depends on pkg1, or if pkg2 depends on pkg3 which depends on pkg1, etc.
PAGE 110
NOTE: Keep the following in mind when reading the examples that follow, and when actually configuring priorities: 1. auto_run (page 170) should be set to yes for all the packages involved; the examples assume that it is. 2. Priorities express a ranking order, so a lower number means a higher priority (10 is a higher priority than 30).
PAGE 111
If pkg1 depends on pkg2, and pkg1’s priority is lower than or equal to pkg2’s, pkg2’s node order dominates. Assuming pkg2’s node order is node1, node2, node3, then: • On startup: ◦ • pkg2 will start on node1, or node2 if node1 is not available or does not at present meet all of its dependencies, etc. – pkg1 will start on whatever node pkg2 has started on (no matter where that node appears on pkg1’s node_name list) provided all of pkg1’s other dependencies are met there.
PAGE 112
Note that the nodes will be tried in the order of pkg1’s node_name list, and pkg2 will be dragged to the first suitable node on that list whether or not it is currently running on another node. • • On failover: ◦ If pkg1 fails on node1, pkg1 will select node2 to fail over to (or node3 if it can run there and node2 is not available or does not meet all of its dependencies; etc.) ◦ pkg2 will be dragged to whatever node pkg1 has selected, and restart there; then pkg1 will restart there.
PAGE 113
• You can specify whether the package depended on must be running or must be down. You define this condition by means of the dependency_condition, using one of the literals UP or DOWN (the literals can be upper or lower case). We'll refer to the requirement that another package be down as an exclusionary dependency; see “Rules for Exclusionary Dependencies” (page 113).
PAGE 114
4.8.7.4.2 Rules for different_node and any_node Dependencies These rules apply to packages whose dependency_condition is UP and whose dependency_location is different_node or any_node. For same-node dependencies, see Simple Dependencies (page 108); for exclusionary dependencies, see “Rules for Exclusionary Dependencies” (page 113). • Both packages must be failover packages whose failover_policy (page 172) is configured_node.
PAGE 115
• these are failover packages, and • the failing package can “drag” these packages to a node on which they can all run. Otherwise the failing package halts and the packages it depends on continue to run 4. Starts the packages the failed package depends on (those halted in step 3, if any).
PAGE 116
4.8.10.3 Simple Method Use this method if you simply want to control the number of packages that can run on a given node at any given time. This method works best if all the packages consume about the same amount of computing resources. If you need to make finer distinctions between packages in terms of their resource consumption, use the Comprehensive Method (page 117) instead. To implement the simple method, use the reserved keyword package_limit to define each node's capacity.
PAGE 117
you wanted to ensure that the larger packages, pkg2 and pkg3, did not run on node1 at the same time, you could raise the weight_value of one or both so that the combination exceeded 10 (or reduce node1's capacity to 8). 4.8.10.3.2 Points to Keep in Mind The following points apply specifically to the Simple Method (page 116). Read them in conjunction with the Rules and Guidelines (page 121), which apply to all weights and capacities.
PAGE 118
memory weight does not exceed 1000. But Serviceguard has no knowledge of the real-world meanings of the names processor and memory; there is no mapping to actual processor and memory usage and you would get exactly the same results if you used the names apples and oranges. For example, suppose you have the following configuration: • A two node cluster running four packages. These packages contend for resource we'll simply call A and B. • node1 has a capacity of 80 for A and capacity of 50 for B.
PAGE 119
NOTE: You do not have to define capacities for every node in the cluster. If any capacity is not defined for any node, Serviceguard assumes that node has an infinite amount of that capacity.
PAGE 120
NOTE: Option 4 means that the package is “weightless” as far as this particular capacity is concerned, and can run even on a node on which this capacity is completely consumed by other packages. (You can make a package “weightless” for a given capacity even if you have defined a cluster-wide default weight; simply set the corresponding weight to zero in the package's cluster configuration file.
PAGE 121
to move; see “How Package Weights Interact with Package Priorities and Dependencies” (page 121)). This is true whenever a package has a weight that exceeds the available amount of the corresponding capacity on the node. 4.8.10.5 Rules and Guidelines The following rules and guidelines apply to both the Simple Method (page 116) and the Comprehensive Method (page 117) of configuring capacities and weights. • You can define a maximum of four capacities, and corresponding weights, throughout the cluster.
PAGE 122
its priority is set to the default, no_priority) will not be halted to make room for a down package that has no priority. Between two down packages without priority, Serviceguard will decide which package to start if it cannot start them both because there is not enough node capacity to support their weight. 4.8.10.7.1 Example 1 • pkg1 is configured to run on nodes turkey and griffon. It has a weight of 1 and a priority of 10. It is down and has switching disabled.
PAGE 123
• 0 - indicating success. • 1 - indicating the package will be halted, and should not be restarted, as a result of failure in this script. • 2 - indicating the package will be restarted on another node, or halted if no other node is available. NOTE: In the case of the validate entry point, exit values 1 and 2 are treated the same; you can use either to indicate that validation failed.
PAGE 124
while (( i < ${#SG_SERVICE_NAME[*]} )) do case ${SG_SERVICE_CMD[i]} in *monitor.
PAGE 125
4.8.11.2 Determining Why a Package Has Shut Down You can use an external script (or CUSTOMER DEFINED FUNCTIONS area of a legacy package control script) to find out why a package has shut down.
PAGE 126
monitored_subnet_access unconfigured for a monitored subnet is equivalent to FULL). (For legacy packages, see “Configuring Cross-Subnet Failover” (page 231)). • You should not use the wildcard (*) for node_name in the package configuration file, as this could allow the package to fail over across subnets when a node on the same subnet is eligible; failing over across subnets can take longer than failing over on the same subnet. List the nodes in order of preference instead of using the wildcard.
PAGE 127
Assuming nodeA is pkg1’s primary node (where it normally starts), create node_name entries in the package configuration file as follows: node_name nodeA node_name nodeB node_name nodeC node_name nodeD 4.8.12.2.2 Configuring monitored_subnet_access In order to monitor subnet 15.244.65.0 or 15.244.56.0, depending on where pkg1 is running, you would configure monitored_subnet and monitored_subnet_access in pkg1’s package configuration file as follows: monitored_subnet 15.244.65.
PAGE 128
If you intend to remove a node from the cluster configuration while the cluster is running, ensure that the resulting cluster configuration will still conform to the rules for cluster locks described above. See “Cluster Lock Planning” (page 80) for more information. If you are planning to add a node online, and a package will run on the new node, ensure that any existing cluster-bound volume groups for the package have been imported to the new node.
PAGE 129
5 Building an HA Cluster Configuration This chapter and the next take you through the configuration tasks required to set up a Serviceguard cluster. You carry out these procedures on one node, called the configuration node, and Serviceguard distributes the resulting binary file to all the nodes in the cluster. In the examples in this chapter, the configuration node is named ftsys9, and the sample target node is called ftsys10.
PAGE 130
SGROOT=/opt/cmcluster # SG root directory SGCONF=/opt/cmcluster/conf # configuration files SGSBIN=/opt/cmcluster/bin # binaries SGLBIN=/opt/cmcluster/bin # binaries SGLIB=/opt/cmcluster/lib # libraries SGRUN=/opt/cmcluster/run # location of core dumps from daemons SGAUTOSTART=/opt/cmcluster/conf/cmcluster.rc # SG Autostart file Throughout this document, system filenames are usually given with one of these location prefixes.
PAGE 131
Serviceguard consults it only when configuring a node into a cluster for the first time; it is ignored after that. It does not exist by default, but you will need to create it. You may want to add a comment such as the following at the top of the file: ########################################################### # Do not edit this file! # Serviceguard uses this file only to authorize access to an # unconfigured node. Once the node is configured, # Serviceguard will not consult this file.
PAGE 132
Serviceguard nodes can communicate over any of the cluster’s shared networks, so the network resolution service you are using (such as DNS, NIS, or LDAP) must be able to resolve each of their primary addresses on each of those networks to the primary hostname of the node in question. In addition, HP recommends that you define name resolution in each node’s /etc/hosts file, rather than rely solely on a service such as DNS.
PAGE 133
IMPORTANT: Serviceguard does not support aliases for IPv6 addresses. For information about configuring an IPv6–only cluster, or a cluster that uses a combination of IPv6 and IPv4 addresses for the nodes' hostnames, see “About Hostname Address Families: IPv4-Only, IPv6-Only, and Mixed Mode” (page 83). 5.1.5.
PAGE 134
NOTE: HP recommends that you also make the name service itself highly available, either by using multiple name servers or by configuring the name service into a Serviceguard package. 5.1.6 Ensuring Consistency of Kernel Configuration Make sure that the kernel configurations of all cluster nodes are consistent with the expected behavior of the cluster during failover.
PAGE 135
DEVICE=bond0 IPADDR=192.168.1.1 NETMASK=255.255.255.0 NETWORK=192.168.1.0 BROADCAST=192.168.1.255 ONBOOT=yes BOOTPROTO=none USERCTL=no For Red Hat 5 and Red Hat 6 only, add the following line to the ifcfg-bond0file: BONDING OPTS=’miimon=100 mode=1’ 2. Create an ifcfg-ethn file for each interface in the bond. All interfaces should have SLAVE and MASTER definitions.
PAGE 136
5.1.8.3 Viewing the Configuration You can test the configuration and transmit policy with ifconfig. For the configuration created above, the display should look like this: /sbin/ifconfig bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.
PAGE 137
REMOTE_IPADDR='' STARTMODE='onboot' BONDING_MASTER='yes' BONDING_MODULE_OPTS='miimon=100 mode=1' BONDING_SLAVE0='eth0' BONDING_SLAVE1='eth1' The above example configures bond0 with mii monitor equal to 100 and active-backup mode. Adjust the IP, BROADCAST, NETMASK, and NETWORK parameters to correspond to your configuration. As you can see, you are adding the configuration options BONDING_MASTER, BONDING-MODULE_OPTS, and BONDING_SLAVE.
PAGE 138
Respond to the prompts as shown in the following table to set up the lock LUN partition: fdisk Table 7 Changing Linux Partition Types Prompt Response Action Performed 1. Command (m for help): n Create new partition 2. Partition number (1-4): 1 Partition affected 3.
PAGE 139
NOTE: fdisk may not be available for SUSE on all platforms. In this case, using YAST2 to set up the partitions is acceptable. Support for Lock LUN Devices The following table describes the support for lock LUN devices on udev and device mapper: If udev device is selected as lock LUN. This is supported, but the same udev rules must be used across all nodes in the cluster for the whole LUN or the partitioned LUN. If /dev/disk/by-id, /dev/ This is not supported on a whole LUN or a partitioned LUN.
PAGE 140
• Building Volume Groups: Example for Smart Array Cluster Storage (MSA 2000 Series) (page 143) • Building Volume Groups and Logical Volumes (page 144) • Distributing the Shared Configuration to all Nodes (page 144) • Testing the Shared Configuration (page 145) • Storing Volume Group Configuration Data (page 146) • Setting up Disk Monitoring (page 147) CAUTION: The minor numbers used by the LVM volume groups must be the same on all cluster nodes.
PAGE 141
1. Run fdisk, specifying your device file name in place of : # fdisk Respond to the prompts as shown in the following table, to define a partition: Prompt Response Action Performed 1. Command (m for help): n Create a new partition 2. Command action e extended primary partition (1-4) p Creation a primary partition 3. Partition number (1-4): 1 Create partition 1 4. First cylinder (1-nn, default 1): Enter Accept the default starting cylinder 1 5.
PAGE 142
Disk /dev/sdc: 64 heads, 32 sectors, 4067 cylinders Units = cylinders of 2048 * 512 bytes Device Boot /dev/sdc Start 1 End 4067 Blocks Id System 4164592 8e Linux LVM Command (m for help): w The partition table has been altered! 3. Repeat this process for each device file that you will use for shared storage. fdisk /dev/sdd fdisk /dev/sdf fdisk /dev/sdg 4.
PAGE 143
5. Run vgscan: vgscan NOTE: At this point, the setup for volume-group activation protection is complete. Serviceguard adds a tag matching the uname -n value of the owning node to each volume group defined for a package when the package runs and deletes the tag when the package halts. The command vgs -o +tags vgname will display any tags that are set for a volume group.
PAGE 144
5.1.12.5 Building Volume Groups and Logical Volumes 1. Use Logical Volume Manager (LVM) to create volume groups that can be activated by Serviceguard packages. For an example showing volume-group creation on LUNs, see “Building Volume Groups: Example for Smart Array Cluster Storage (MSA 2000 Series)” (page 143). (For Fibre Channel storage you would use device-file names such as those used in the section “Creating Partitions” (page 140)). 2. 3.
PAGE 145
NOTE: Use vgchange --deltag only if you are implementing volume-group activation protection. Remember that volume-group activation protection, if used, must be implemented on each node. 2. To get the node ftsys10 to see the new disk partitioning that was done on ftsys9, reboot: reboot The partition table on the rebooted node is then rebuilt using the information placed on the disks when they were partitioned on the other node. NOTE: 3. You must reboot at this time.
PAGE 146
2. On ftsys10, activate the volume group, mount the file system, write a date stamp on to the shared file, and then look at the content of the file: vgchange --addtag $(uname -n) vgpkgB vgchange -a y vgpkgB mount /dev/vgpkgB/lvol1 /extra echo ‘Written by’ ‘hostname‘ ‘on’ ‘date‘ >> /extra/datestamp cat /extra/datestamp You should see something like the following, including the date stamp written by the other node: Written by ftsys9.mydomain on Mon Jan 22 14:23:44 PST 2006 Written by ftsys10.
PAGE 147
NOTE: Be careful if you use YAST or YAST2 to configure volume groups, as that may cause all volume groups to be activated. After running YAST or YAST2, check that volume groups for Serviceguard packages not currently running have not been activated, and use LVM commands to deactivate any that have. For example, use the command vgchange -a n /dev/sgvg00 to deactivate the volume group sgvg00. Red Hat It is not necessary to prevent vgscan on Red Hat.
PAGE 148
5.2.1 cmquerycl Options 5.2.1.1 Speeding up the Process In a larger or more complex cluster with many nodes, networks or disks, the cmquerycl command may take several minutes to complete. To speed up the configuration process, you can direct the command to return selected information only by using the -k and -w options: -k eliminates some disk probing, and does not return information about potential cluster lock volume groups and lock physical volumes.
PAGE 149
cmquerycl -v -h ipv6 -C $SGCONF/clust1.conf -n ftsys9 -n ftsys10 • -h ipv4 tells Serviceguard to discover and configure only IPv4 subnets. If it does not find any eligible subnets, the command will fail. • -h ipv6 tells Serviceguard to discover and configure only IPv6 subnets. If it does not find any eligible subnets, the command will fail.
PAGE 150
To specify an alternate hostname or IP address by which the Quorum Server can be reached, use a command such as (all on one line): cmquerycl -q -n ftsys9 -n ftsys10 -C .conf Enter the QS_HOST (IPv4 or IPv6 on SLES 11; IPv4 only on Red Hat 5 and Red Hat 6), optional QS_ADDR (IPv4 or IPv6 on SLES 11; IPv4 only on Red Hat 5 and Red Hat 6) , QS_POLLING_INTERVAL, and optionally a QS_TIMEOUT_EXTENSION; and also check the HOSTNAME_ADDRESS_FAMILY setting, which defaults to IPv4.
PAGE 151
15.13.165.0 15.13.182.0 15.244.65.0 15.244.56.0 lan2 lan2 lan2 lan2 lan3 lan3 lan4 lan4 (nodeA) (nodeB) (nodeC) (nodeD) (nodeA) (nodeB) (nodeC) (nodeD) lan3 lan3 lan3 lan3 (nodeA) (nodeB) (nodeC) (nodeD) IPv6: 3ffe:1111::/64 3ffe:2222::/64 Possible Heartbeat IPs: 15.13.164.0 15.13.164.1 15.13.164.2 15.13.172.0 15.13.172.158 15.13.172.159 15.13.165.0 15.13.165.1 15.13.165.2 15.13.182.0 15.13.182.158 15.13.182.
PAGE 152
5.2.6 Specifying Maximum Number of Configured Packages This value must be equal to or greater than the number of packages currently configured in the cluster. The count includes all types of packages: failover, multi-node, and system multi-node. The maximum number of packages per cluster is 300. The default is the maximum. NOTE: Remember to tune kernel parameters on each node to ensure that they are set high enough for the largest number of packages that will ever run concurrently on that node. 5.2.
PAGE 153
Figure 27 Access Roles 5.2.8.3 Levels of Access Serviceguard recognizes two levels of access, root and non-root: • Root access: Full capabilities; only role allowed to configure the cluster. As Figure 27 shows, users with root access have complete control over the configuration of the cluster and its packages. This is the only role allowed to use the cmcheckconf, cmapplyconf, cmdeleteconf, and cmmodnet -a commands.
PAGE 154
IMPORTANT: Users on systems outside the cluster can gain Serviceguard root access privileges to configure the cluster only via a secure connection (rsh or ssh). • Non-root access: Other users can be assigned one of four roles: ◦ Full Admin: Allowed to perform cluster administration, package administration, and cluster and package view operations. These users can administer the cluster, but cannot configure or create a cluster. Full Admin includes the privileges of the Package Admin role.
PAGE 155
Access control policies are defined by three parameters in the configuration file: • Each USER_NAME can consist either of the literal ANY_USER, or a maximum of 8 login names from the /etc/passwd file on USER_HOST. The names must be separated by spaces or tabs, for example: # Policy 1: USER_NAME john fred patrick USER_HOST bit USER_ROLE PACKAGE_ADMIN • USER_HOST is the node where USER_NAME will issue Serviceguard commands.
PAGE 156
USER_HOST bit USER_ROLE PACKAGE_ADMIN If this policy is defined in the cluster configuration file, it grants user john the PACKAGE_ADMIN role for any package on node bit. User john also has the MONITOR role for the entire cluster, because PACKAGE_ADMIN includes MONITOR. If the policy is defined in the package configuration file for PackageA, then user john on node bit has the PACKAGE_ADMIN role only for PackageA. Plan the cluster’s roles and validate them as soon as possible.
PAGE 157
5.2.8.5 Package versus Cluster Roles Package configuration will fail if there is any conflict in roles between the package configuration and the cluster configuration, so it is a good idea to have the cluster configuration file in front of you when you create roles for a package; use cmgetconf to get a listing of the cluster configuration file.
PAGE 158
# Warning: Neither a quorum server nor a lock lun was specificed. # A Quorum Server or a lock lun is required for clusters of only two nodes. If you attempt to configure both a quorum server and a lock LUN, the following message appears on standard output when issuing the cmcheckconf or cmapplyconf command: Duplicate cluster lock, line 55. Quorum Server already specified. 5.2.
PAGE 159
3. 4. Verify that nodes leave and enter the cluster as expected using the following steps: • Halt the cluster. You can use Serviceguard Manager or the cmhaltnode command. • Check the cluster membership to verify that the node has left the cluster. You can use the Serviceguard Manager main page or the cmviewcl command. • Start the node. You can use Serviceguard Manager or the cmrunnode command. • Verify that the node has returned to operation.
PAGE 160
NOTE: The /sbin/init.d/cmcluster file may call files that Serviceguard stores in$SGCONF/ rc. (See “Understanding the Location of Serviceguard Files” (page 129) for information about Serviceguard directories on different Linux distributions.) This directory is for Serviceguard use only! Do not move, delete, modify, or add files in this directory. 5.3.
PAGE 161
If you must disable identd, do the following on each node after installing Serviceguard but before each node rejoins the cluster (For example, before issuing a cmrunnode or cmruncl). For Red Hat and SUSE: 1. 2. Change the value of the server_args parameter in the file /etc/xinetd.d/hacl-cfg from -c to -c -i Restart xinetd: /etc/init.d/xinetd restart 5.3.6 Deleting the Cluster Configuration You can delete a cluster configuration by means of the cmdeleteconf command.
PAGE 162
PAGE 163
6 Configuring Packages and Their Services Serviceguard packages group together applications and the services and resources they depend on. The typical Serviceguard package is a failover package that starts on one node but can be moved (“failed over”) to another if necessary. For more information, see “What is Serviceguard for Linux? ” (page 19), “How the Package Manager Works” (page 41), and“Package Configuration Planning ” (page 100).
PAGE 164
When you have made these decisions, you are ready to generate the package configuration file; see “Generating the Package Configuration File” (page 185). 6.1.1 Types of Package: Failover, Multi-Node, System Multi-Node There are three types of packages: • Failover packages. This is the most common type of package. Failover packages run on one node at a time.
PAGE 165
and start the package for the first time. But if you then halt the multi-node package via cmhaltpkg, it can be re-started only by means of cmrunpkg, not cmmodpkg. • If a multi-node package is halted via cmhaltpkg, package switching is not disabled. This means that the halted package will start to run on a rebooted node, if it is configured to run on that node and its dependencies are met.
PAGE 166
Table 8 Base Modules Module Name Parameters (page) Comments failover package_name (page 169) * module_name (page 169) * module_version (page 169) * package_type (page 169) package_description (page 169) * node_name (page 170) auto_run (page 170) node_fail_fast_enabled (page 171) run_script_timeout (page 171) halt_script_timeout (page 171) successor_halt_script_timeout (page 172) Base module. Use as primary building block for failover packages.
PAGE 167
Table 9 Optional Modules Module Name Parameters (page) Comments dependency dependency_name (page 174) * dependency_condition (page 174) dependency_location (page 174) Add to a base module to create a package that depends on one or more other packages. weight weight_name (page 175) * weight value (page 175) * Add to a base module to create a package that has weight that will be counted against a node's capacity.
PAGE 168
Table 9 Optional Modules (continued) Module Name Parameters (page) Comments acp user_name (page 184) user_host (page 184) user_role (page 184) Add to a base module to configure Access Control Policies for the package. all all parameters Use if you are creating a complex package that requires most or all of the optional parameters; or if you want to see the specifications and comments for all available parameters.
PAGE 169
NOTE: For more information, see the comments in the editable configuration file output by the cmmakepkg command, and the cmmakepkg (1m) manpage.
PAGE 170
6.1.4.6 node_name The node on which this package can run, or a list of nodes in order of priority, or an asterisk (*) to indicate all nodes. The default is *. For system multi-node packages, you must specify node_name *. If you use a list, specify each node on a new line, preceded by the literal node_name, for example: node_name node_name node_name The order in which you specify the node names is important.
PAGE 171
6.1.4.8 node_fail_fast_enabled Can be set to yes or no. The default is no. yes means the node on which the package is running will be halted (reboot) if the package fails; no means Serviceguard will not halt the system.
PAGE 172
If a timeout occurs: • Switching will be disabled. • The current node will be disabled from running the package. If a halt-script timeout occurs, you may need to perform manual cleanup. See Chapter 8: “Troubleshooting Your Cluster” (page 241). 6.1.4.11 successor_halt_timeout Specifies how long, in seconds, Serviceguard will wait for packages that depend on this package to halt, before halting this package. Can be 0 through 4294, or no_timeout. The default is no_timeout.
PAGE 173
• configured_node means Serviceguard will attempt to start the package on the first available node in the list you provide under node_name (page 170). • min_package_node means Serviceguard will start the package on whichever node in the node_name list has the fewest packages running at the time. • site_preferred means Serviceguard will try all the eligible nodes on the local SITE before failing the package over to a node on another SITE.
PAGE 174
If you assign a priority, it must be unique in this cluster. A lower number indicates a higher priority, and a numerical priority is higher than no_priority. HP recommends assigning values in increments of 20 so as to leave gaps in the sequence; otherwise you may have to shuffle all the existing priorities when assigning priority to a new package. IMPORTANT: Because priority is a matter of ranking, a lower number indicates a higher priority (20 is a higher priority than 40).
PAGE 175
6.1.4.21 weight_name, weight_value These parameters specify a weight for a package; this weight is compared to a node's available capacity (defined by the CAPACITY_NAME and CAPACITY_VALUE parameters in the cluster configuration file) to determine whether the package can run there. Both parameters are optional, but if weight_value is specified, weight_name must also be specified, and must come first. You can define up to four weights, corresponding to four different capacities, per cluster.
PAGE 176
6.1.4.23 monitored_subnet_access In cross-subnet configurations, specifies whether each monitored_subnet is accessible on all nodes in the package’s node_name list (page 170), or only some. Valid values are PARTIAL, meaning that at least one of the nodes has access to the subnet, but not all; and FULL, meaning that all nodes have access to the subnet. The default is FULL, and it is in effect if monitored_subnet_access is not specified.
PAGE 177
6.1.4.25 ip_subnet_node In a cross-subnet configuration, specifies which nodes an ip_subnet is configured on. If no ip_subnet_nodes are listed under an ip_subnet, it is assumed to be configured on all nodes in this package’s node_name list (page 170). Can be added or deleted while the package is running, with these restrictions: • The package must not be running on the node that is being added or deleted.
PAGE 178
NOTE: Be careful when defining service run commands. Each run command is executed in the following way: • The cmrunserv command executes the run command. • Serviceguard monitors the process ID (PID) of the process the run command creates. • When the command exits, Serviceguard determines that a failure has occurred and takes appropriate action, which may include transferring the package to an adoptive node.
PAGE 179
• generic_resource_name • generic_resource_evaluation_type • generic_resource_up_criteria See the descriptions that follow. The following is an example of defining generic resource parameters: generic_resource_name generic_resource_evaluation_type generic_resource_up_criteria cpu_monitor during_package_start <50 See the package configuration file for more examples. 6.1.4.33 generic_resource_evaluation_type Defines when the status of a generic resource is evaluated.
PAGE 180
NOTE: Operators other than the ones mentioned above are not supported. This attribute does not accept more than one up criterion. For example, >> 10, << 100 are not valid.
PAGE 181
fs_fsck_opt "" fs_type "ext3" A logical volume must be built on an LVM volume group. Logical volumes can be entered in any order. A gfs file system can be configured using only the fs_name, fs_directory, and fs_mount_opt parameters; see the configuration file for an example. Additional rules apply for gfs as explained under fs_type. NOTE: Red Hat GFS is not supported in Serviceguard A.11.20.00. For an NFS-imported file system, see the discussion under fs_name (page 181) and fs_server (page 182).
PAGE 182
For an NFS-imported file system, the additional parameters required are fs_server, fs_directory, fs_type, and fs_mount_opt; see fs_server (page 182) for an example. CAUTION: Before configuring an NFS-imported file system into a package, make sure you have read and understood the rules and guidelines under “Planning for NFS-mounted File Systems” (page 101), and configured the cluster parameter CONFIGURED_IO_TIMEOUT_EXTENSION, described under “Cluster Configuration Parameters ” (page 86).
PAGE 183
NOTE: A package using gfs (Red Hat Global File System, or GFS) cannot use any other file systems of a different type. vg and vgchange_cmd (page 180) are not valid for GFS file systems. For more information about using GFS with Serviceguard, see Clustering Linux Servers with the Concurrent Deployment of HP Serviceguard for Linux and Red Hat Global File Systems for RHEL5 at http://www.hp.com/go/linux-serviceguard-docs.
PAGE 184
If more than one external_pre_script is specified, the scripts will be executed on package startup in the order they are entered into the package configuration file, and in the reverse order during package shutdown. See “About External Scripts” (page 122), as well as the comments in the configuration file, for more information and examples. 6.1.4.51 external_script The full pathname of an external script.
PAGE 185
PATH Specifies the path to be used by the script. SUBNET Specifies the IP subnets that are to be monitored for the package. RUN_SCRIPTand HALT_SCRIPT Use the full pathname of each script. These two parameters allow you to separate package run instructions and package halt instructions for legacy packages into separate scripts if you need to. In this case, make sure you include identical configuration information (such as node names, IP addresses, etc.) in both scripts.
PAGE 186
• To generate a configuration file that contains all the optional modules: cmmakepkg $SGCONF/pkg1/pkg1.conf • To create a generic failover package (that could be applied without editing): cmmakepkg -n pkg1 -m sg/failover $SGCONF/pkg1/pkg1.
PAGE 187
NOTE: etc. 4. 5. 6. cmcheckconf and cmapplyconf check for missing mount points, volume groups, Halt the package. Configure package IP addresses and application services. Run the package and ensure that applications run as expected and that the package fails over correctly when services are disrupted. See “Testing the Package Manager ” (page 241).
PAGE 188
• If this package will depend on another package or packages, enter values for dependency_name, dependency_condition, dependency_location, and optionally priority. See “About Package Dependencies” (page 107) for more information. NOTE: The package(s) this package depends on must already be part of the cluster configuration by the time you validate this package (via cmcheckconf; see “Verifying and Applying the Package Configuration” (page 189)); otherwise validation will fail.
PAGE 189
• If the package needs to mount LVM volumes to file systems (other than Red Hat GFS; see fs_type (page 182)), use the vg parameters to specify the names of the volume groups to be activated, and select the appropriate vgchange_cmd. Use the fs_ parameters (page 181) to specify the characteristics of file systems and how and where to mount them. See the comments in the FILESYSTEMS section of the configuration file for more information and examples.
PAGE 190
• Configured resources are available on cluster nodes. • File systems and volume groups are valid. • Services are executable. • Any package that this package depends on is already be part of the cluster configuration. For more information, see the manpage for cmcheckconf (1m) and “Checking Cluster Components” (page 201). When cmcheckconf has completed without errors, apply the package configuration, for example: cmapplyconf -P $SGCONF/pkg1/pkg1.
PAGE 191
the MD /dev/md0, and for some reason /dev/hpdev/my_disk2 becomes inaccessible. If the email_id specified in the package configuration file is sguser@xyz.com. The following e-mail notification is sent to sguser@xyz.com: Date: Tue, 9 Oct 2012 23:18:01 -0700 From: root Message-Id: <201210100618.q9A6I1d9023167@node1.hp.com> To: sguser@xyz.
PAGE 192
PAGE 193
7 Cluster and Package Maintenance This chapter describes the cmviewcl command, then shows how to start and halt a cluster or an individual node, how to perform permanent reconfiguration, and how to start, halt, move, and modify packages during routine maintenance of the cluster.
PAGE 194
• starting - The cluster is in the process of determining its active membership. At least one cluster daemon is running. • unknown - The node on which the cmviewcl command is issued cannot communicate with other nodes in the cluster. 7.1.4 Node Status and State The status of a node is either up (active as a member of the cluster) or down (inactive in the cluster), depending on whether its cluster daemon is running or not.
PAGE 195
• detached - A package is said to be detached from the cluster or node where it was running, when the cluster or node is halted with —d option. Serviceguard no longer monitors this package. The last known status of the package before it is detached from the cluster was up. • unknown - Serviceguard could not determine the status at the time cmviewcl was run. A system multi-node package is up when it is running on all the active cluster nodes.
PAGE 196
7.1.6 Package Switching Attributes cmviewcl shows the following package switching information: • AUTO_RUN: Can be enabled or disabled. For failover packages, enabled means that the package starts when the cluster starts, and Serviceguard can switch the package to another node in the event of failure. For system multi-node packages, enabled means an instance of the package can start on a new node joining the cluster (disabled means it will not).
PAGE 197
Failover packages can also be configured with one of two values for the failback_policy parameter (page 173), and these are also displayed in the output of cmviewcl -v: • automatic: Following a failover, a package returns to its primary node when the primary node becomes available again. • manual: Following a failover, a package will run on the adoptive node until moved back to its original node by a system administrator. 7.1.
PAGE 198
NOTE: The Script_Parameters section of the PACKAGE output of cmviewcl shows the Subnet status only for the node that the package is running on. In a cross-subnet configuration, in which the package may be able to fail over to a node on another subnet, that other subnet is not shown (see “Cross-Subnet Configurations” (page 27)). 7.1.11.
PAGE 199
UNOWNED_PACKAGES PACKAGE pkg2 STATUS down STATE unowned AUTO_RUN disabled NODE unowned Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover configured_node Failback manual Script_Parameters: ITEM STATUS NODE_NAME Service down Generic Resource up ftsys9 Subnet up Generic Resource up ftsys10 Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled NAME service2 sfm_disk1 15.13.168.
PAGE 200
Policy_Parameters: POLICY_NAME CONFIGURED_VALUE Failover configured_node Failback manual Script_Parameters: ITEM Service Service up Subnet Generic Resource up STATUS up 0 up MAX_RESTARTS RESTARTS NAME 0 0 0 sfm_disk_monitor 0 0 sfm_disk Node_Switching_Parameters: NODE_TYPE STATUS SWITCHING Primary up enabled Alternate up enabled NODE ftsys10 STATUS up NAME ftsys10 ftsys9 service2 15.13.168.0 (current) STATE running Network_Parameters: INTERFACE STATUS PRIMARY up PRIMARY up NAME eth0 eth1 7.1.11.
PAGE 201
7.1.11.7 Viewing Information about Unowned Packages The following example shows packages that are currently unowned, that is, not running on any configured node.
PAGE 202
Table 10 Verifying Cluster Components (continued) Component (Context) Tool or Command; More Information Comments Quorum Server (cluster) cmcheckconf (1m), cmapplyconf (1m). Commands check that the quorum server, if used, is running and all nodes are authorized to access it; and, if more than one IP address is specified, that the quorum server is reachable from all nodes through both the IP addresses.
PAGE 203
7.2 Managing the Cluster and Nodes This section describes the following tasks: • Starting the Cluster When all Nodes are Down (page 203) • Adding Previously Configured Nodes to a Running Cluster (page 204) • Removing Nodes from Participation in a Running Cluster (page 204) • Halting the Entire Cluster (page 204) • Automatically Restarting the Cluster (page 205) • Halting a Node or the Cluster while Keeping Packages Running (page 205) In Serviceguard A.11.
PAGE 204
7.2.2 Adding Previously Configured Nodes to a Running Cluster You can use Serviceguard Manager, or HP Serviceguard commands as shown, to bring a configured node up within a running cluster. Use the cmrunnode command to add one or more nodes to an already running cluster. Any node you add must already be a part of the cluster configuration. The following example adds node ftsys8 to the cluster that was just started with only nodes ftsys9 and ftsys10.
PAGE 205
the cluster to halt even when packages are running. This command can be issued from any running node. Example: cmhaltcl -f -v This halts all the cluster nodes. 7.2.5 Automatically Restarting the Cluster You can configure your cluster to automatically restart after an event, such as a long-term power failure, which brought down all nodes in the cluster. This is done by setting AUTOSTART_CMCLD to 1 in the $SGAUTOSTART file (see “Understanding the Location of Serviceguard Files” (page 129)). 7.
PAGE 206
• Extended Distance Cluster (serviceguard-xdc) supports LAD for modular failover packages. For more information, see “Creating a serviceguard-xdc Modular Package” in chapter 5 of HP Serviceguard Extended Distance Cluster for Linux A.11.20.10 Deployment Guide at http:// www.hp.com/go/linux-serviceguard-docs. • Live Application Detach is supported only with modular failover packages and modular multi-node packages.
PAGE 207
7.3.3 Additional Points To Note Keep the following points in mind: • When packages are detached, they continue to run, but without high availability protection. Serviceguard does not detect failures of components of detached packages, and packages are not failed over. IMPORTANT: This means that you will need to detect any errors that occur while the package is detached, and take corrective action by running cmhaltpkg to halt the detached package and cmrunpkg (1m) to restart the package on another node.
PAGE 208
2. Halt any packages that do not qualify for Live Application Detach, such as legacy and system multi-node packages. For example: cmhaltpkg -n node1 legpak1 legpak2 NOTE: 3. If you do not do this, the cmhaltnode in the next step will fail. Halt the node with the -d (detach) option: cmhaltnode -d node1 NOTE: -d and -f are mutually exclusive. See cmhaltnode (1m) for more information.
PAGE 209
2. Halt the cluster, detaching the remaining packages: cmhaltcl -d 3. 4. Upgrade the heartbeat networks as needed. Restart the cluster, automatically re-attaching pkg6 through pkgn and starting any other packages that have auto_run (page 170) set to yes in their package configuration file: cmruncl 5. Start the remaining packages; for example: cmmodpkg -e pkg1 pkg2 pkg3 pkg4 pkg5 7.
PAGE 210
Halting a package has a different effect from halting the node. When you halt the node, its packages may switch to adoptive nodes (assuming that switching is enabled for them); when you halt the package, it is disabled from switching to another node, and must be restarted manually on another node or on the same node. System multi-node packages run on all cluster nodes simultaneously; halting these packages stops them running on all nodes.
PAGE 211
Now, suppose you run the command cmhaltpkg pkgA, if a failure is detected in non-sg-module2, then the package halt process is aborted at this point and the package is moved to the halt_aborted state. The command exits and does not proceed further to halt the sg-module3 and sg-module4 modules. After fixing the error, if you re-run cmhaltpkg pkgA, halt begins from sg-module1 and proceeds.
PAGE 212
You can change package switching behavior either temporarily or permanently using Serviceguard commands. To temporarily disable switching to other nodes for a running package, use the cmmodpkg command. For example, if pkg1 is currently running, and you want to prevent it from starting up on another node, enter the following: cmmodpkg -d pkg1 This does not halt the package, but will prevent it from starting up elsewhere.
PAGE 213
NOTE: But a failure in the package control script will cause the package to fail. The package will also fail if an external script (or pre-script) cannot be executed or does not exist. • The package will not be automatically failed over, halted, or started. • A package in maintenance mode still has its configured (or default) weight, meaning that its weight, if any, is counted against the node's capacity; this applies whether the package is up or down.
PAGE 214
• Generic resources configured in a package must be available (status 'up') before taking the package out of maintenance mode. • You cannot do online configuration as described under “Reconfiguring a Package” (page 232). • You cannot configure new dependencies involving this package; that is, you cannot make it dependent on another package, or make another package depend on it. See also “Dependency Rules for a Package in Maintenance Mode or Partial-Startup Maintenance Mode ” (page 214).
PAGE 215
7.5.2.1 Procedure Follow these steps to perform maintenance on a package's networking components. In this example, we'll call the package pkg1 and assume it is running on node1. 1. Place the package in maintenance mode: cmmodpkg -m on -n node1 pkg1 2. Perform maintenance on the networks or resources and test manually that they are working correctly. NOTE: If you now run cmviewcl, you'll see that the STATUS of pkg1 is up and its STATE is maintenance. 3.
PAGE 216
7. If everything is working as expected, bring the package out of maintenance mode: cmmodpkg -m off pkg1 8. Restart the package: cmrunpkg pkg1 7.5.3.2 Excluding Modules in Partial-Startup Maintenance Mode In the example above, we used cmrunpkg -m to run all the modules up to and including package_ip, but none of those after it. But you might want to run the entire package apart from the module whose components you are going to work on.
PAGE 217
Table 11 Types of Changes to the Cluster Configuration (continued) Change to the Cluster Configuration Required Cluster State Reconfigure IP addresses for a NIC used by the cluster Must delete the interface from the cluster configuration, reconfigure it, then add it back into the cluster configuration. See “What You Must Keep in Mind” (page 221). Cluster can be running throughout. Change NETWORK_POLLING_INTERVAL Cluster can be running.
PAGE 218
• cmhaltnode [–t] [–f] • cmrunnode [–t] • cmhaltpkg [–t] • cmrunpkg [–t] [-n node_name] • cmmodpkg { -e [-t] | -d } [-n node_name] • cmruncl –v [–t] NOTE: You cannot use the -t option with any command operating on a package in maintenance mode; see “Maintaining a Package: Maintenance Mode” (page 212). For more information about these commands, see their respective manpages.
PAGE 219
1. 2. 3. Use cmviewcl -v -f line to write the current cluster configuration out to a file. Edit the file to include the events or changes you want to preview Using the file from Step 2 as input, run cmeval to preview the results of the changes. For example, assume that pkg1 is a high-priority package whose primary node is node1, and which depends on pkg2 and pkg3 to be running on the same node. These lower-priority-packages are currently running on node2.
PAGE 220
If there are also packages that depend upon that node, the package configuration must also be modified to delete the node. This all must be done in one configuration request (cmapplyconf command). • The access control list for the cluster can be changed while the cluster is running. Changes to the package configuration are described in a later section. The following sections describe how to perform dynamic reconfiguration tasks. 7.6.3.
PAGE 221
1. Use the following command to store a current copy of the existing cluster configuration in a temporary file: cmgetconf -c cluster1 temp.conf 2. Specify the new set of nodes to be configured (omitting ftsys10) and generate a template of the new configuration: cmquerycl -C clconfig.conf -c cluster1 -n ftsys8 -n ftsys9 3. 4. Edit the file clconfig.conf to check the information about the nodes that remain in the cluster.
PAGE 222
• You cannot change the designation of an existing interface from HEARTBEAT_IP to STATIONARY_IP, or vice versa, without also making the same change to all peer network interfaces on the same subnet on all other nodes in the cluster.
PAGE 223
1. Run cmquerycl to get a cluster configuration template file that includes networking information for interfaces that are available to be added to the cluster configuration: cmquerycl -c cluster1 -C clconfig.conf NOTE: As of Serviceguard A.11.18, cmquerycl -c produces output that includes commented-out entries for interfaces that are not currently part of the cluster configuration, but are available. The networking portion of the resulting clconfig.
PAGE 224
1. Halt any package that uses this subnet and delete the corresponding networking information (monitored_subnet, ip_subnet, ip_address; see the descriptions for these parameters starting with monitored_subnet (page 175)). See “Reconfiguring a Package on a Running Cluster ” (page 232) for more information. 2. Run cmquerycl to get the cluster configuration file: cmquerycl -c cluster1 -C clconfig.conf 3.
PAGE 225
7.7 Configuring a Legacy Package IMPORTANT: You can still create a new legacy package. If you are using a Serviceguard Toolkit such as Serviceguard NFS Toolkit, consult the documentation for that product. Otherwise, use this section to maintain and re-work existing legacy packages rather than to create new ones.
PAGE 226
3. 4. 5. 6. 7. 8. Apply the configuration. Run the package and ensure that it can be moved from node to node. Halt the package. Configure package IP addresses and application services in the control script. Distribute the control script to all nodes. Run the package and ensure that applications run as expected and that the package fails over correctly when services are disrupted. 7.7.1.2.
PAGE 227
IMPORTANT: Each subnet specified here must already be specified in the cluster configuration file via the NETWORK_INTERFACE parameter and either the HEARTBEAT_IP or STATIONARY_IP parameter. See “Cluster Configuration Parameters ” (page 86) for more information. See also “Stationary and Relocatable IP Addresses and Monitored Subnets” (page 59) and monitored_subnet (page 175). IMPORTANT: (page 231).
PAGE 228
7.7.2.1 Customizing the Package Control Script You need to customize as follows; see the relevant entries under “Package Parameter Explanations” (page 168) for more discussion. • Update the PATH statement to reflect any required paths needed to start your services. • Specify the Remote Data Replication Method and Software RAID Data Replication method if necessary. CAUTION: If you are not using the serviceguard-xdc or CLX products, do not modify the REMOTE DATA REPLICATION DEFINITION section.
PAGE 229
: # do nothing instruction, because a function must contain some command. date >> /tmp/pkg1.datelog echo 'Starting pkg1' >> /tmp/pkg1.datelog test_return 51 } # This function is a place holder for customer defined functions. # You should define all actions you want to happen here, before the service is # halted. function customer_defined_halt_cmds { # ADD customer defined halt commands. : # do nothing instruction, because a function must contain some command. date >> /tmp/pkg1.
PAGE 230
The following items are checked (whether you use Serviceguard Manager or cmcheckconf command): • Package name is valid, and at least one NODE_NAME entry is included. • There are no duplicate parameter entries. • Values for parameters are within permitted ranges. • Run and halt scripts exist on all nodes in the cluster and are executable. • Run and halt script timeouts are less than 4294 seconds. • Configured resources are available on cluster nodes.
PAGE 231
7.7.5 Configuring Cross-Subnet Failover To configure a legacy package to fail over across subnets (see “Cross-Subnet Configurations” (page 27)), you need to do some additional configuration. Suppose that you want to configure a package, pkg1, so that it can fail over among all the nodes in a cluster comprising NodeA, NodeB, NodeC, and NodeD. NodeA and NodeB use subnet 15.244.65.0, which is not used by NodeC and NodeD; and NodeC and NodeD use subnet 15.244.56.0, which is not used by NodeA and NodeB.
PAGE 232
7.7.5.3.1 Control-script entries for nodeA and nodeB IP[0] = 15.244.65.82 SUBNET[0] 15.244.65.0 IP[1] = 15.244.65.83 SUBNET[1] 15.244.65.0 7.7.5.3.2 Control-script entries for nodeC and nodeD IP[0] = 15.244.56.100 SUBNET[0] = 15.244.56.0 IP[1] = 15.244.56.101 SUBNET[1] =15.244.56.0 7.
PAGE 233
See “Allowable Package States During Reconfiguration ”to determine whether this step is needed. 2. If it is not already available, you can obtain a copy of the package's configuration file by using the cmgetconf command, specifying the package name. cmgetconf -p pkg1 pkg1.conf 3. Edit the package configuration file. IMPORTANT: Restrictions on package names, dependency names, and service names have become more stringent as of A.11.18.
PAGE 234
To create the package, follow the steps in the chapter Chapter 6: “Configuring Packages and Their Services ” (page 163). Then use a command such as the following to verify the configuration of the newly created pkg1 on a running cluster: cmcheckconf -P $SGCONF/pkg1/pkg1conf.conf Use a command such as the following to distribute the new package configuration to all nodes in the cluster: cmapplyconf -P $SGCONF/pkg1/pkg1conf.
PAGE 235
CAUTION: is running. Be extremely cautious about changing a package's configuration while the package If you reconfigure a package online (by executing cmapplyconf on a package while the package itself is running) it is possible that the package will fail, even if the cmapplyconf succeeds, validating the changes with no errors. For example, if a file system is added to the package while the package is running, cmapplyconf does various checks to verify that the file system and its mount point exist.
PAGE 236
Table 12 Types of Changes to Packages (continued) Change to the Package Required Package State Add or delete a service: legacy package Package must not be running. Change service_restart: modular package Package can be running. Change SERVICE_RESTART: legacy package Package must not be running. Add or remove a SUBNET (in control script) : legacy package Package must not be running. (Also applies to cross-subnet configurations.) Package must not be running.
PAGE 237
Table 12 Types of Changes to Packages (continued) Change to the Package Change a file system: modular package Required Package State Package should not be running (unless you are only changing fs_umount_opt). Changing file-system options other than fs_umount_opt may cause problems because the file system must be unmounted (using the existing fs_umount_opt) and remounted with the new options; the CAUTION under “Remove a file system: modular package” applies in this case as well.
PAGE 238
Table 12 Types of Changes to Packages (continued) Change to the Package Required Package State Add a generic resource of evaluation type before_package_start Package can be running if the status of generic resource is 'up', else package must be halted. Remove a generic resource Package can be running. Change the generic_resource_evaluation_type Package can be running if the status of generic resource is 'up'. Not allowed if changing the generic_resource_evaluation_type causes the package to fail.
PAGE 239
The typical corrective actions to take in the event of a transfer of package include: • Determining when a transfer has occurred. • Determining the cause of a transfer. • Repairing any hardware failures. • Correcting any software problems. • Restarting nodes. • Transferring packages back to their original nodes. • Enabling package switching. 7.
PAGE 240
PAGE 241
8 Troubleshooting Your Cluster This chapter describes how to verify cluster operation, how to review cluster status, how to add and replace hardware, and how to solve some typical cluster problems.
PAGE 242
You can also test the package manager using generic resources. Perform the following procedure for each package on the cluster: 1. Obtain the generic resource that is configured in a package by entering cmviewcl -v -p 2. Set the status of generic resource to DOWN using the following command: cmsetresource -r –s down 3. To view the package status, enter cmviewcl -v The package should be running on the specified adoptive node. 4.
PAGE 243
• All cables • Disk interface cards Some monitoring can be done through simple physical inspection, but for the most comprehensive monitoring, you should examine the system log file (/var/log/messages) periodically for reports on all configured HA devices. The presence of errors relating to a device will show the need for maintenance. 8.3 Replacing Disks The procedure for replacing a faulty disk mechanism depends on the type of disk configuration you are using.
PAGE 244
part of the recovery. Use the $SGCONF/scripts/sg/pr_cleanup script to do this. (The script is also in $SGCONF/bin/. See “Understanding the Location of Serviceguard Files” (page 129) for the locations of Serviceguard directories on various Linux distributions.
PAGE 245
7. If necessary, add the node back into the cluster using the cmrunnode command. (You can omit this step if the node is configured to join the cluster automatically.) Now Serviceguard will detect that the MAC address (LLA) of the card has changed from the value stored in the cluster binary configuration file, and it will notify the other nodes in the cluster of the new MAC address. The cluster will operate normally after this.
PAGE 246
4. Start the quorum server as follows: • Use the init q command to run the quorum server. Or • 5. 6. Create a package in another cluster for the Quorum Server, as described in the Release Notes for your version of Quorum Server. They can be found at http://www.hp.com/ go/hpux-serviceguard-docs (Select (HP Serviceguard Quorum Server Software). All nodes in all clusters that were using the old quorum server will connect to the new quorum server.
PAGE 247
TX packets:5741486 errors:1 dropped:0 overruns:1 carrier:896 collisions:26706 txqueuelen:100 Interrupt:9 Base address:0xdc00 eth1 Link encap:Ethernet HWaddr 00:50:DA:64:8A:7C inet addr:192.168.1.106 Bcast:192.168.1.255 Mask:255.255.255.
PAGE 248
Dec 14 14:34:45 star04 cmcld[2048]: Examine the file /usr/local/cmcluster/pkg5/pkg5_run.log for more details. The following is an example of a successful package starting: Dec Dec Dec Dec Dec Dec 14 14:39:27 star04 CM-CMD[2096]: cmruncl 14 14:39:27 star04 cmcld[2098]: Starting cluster management protocols.
PAGE 249
It doesn't check: • The correct setup of the power circuits. • The correctness of the package configuration script. 8.7.6 Reviewing the LAN Configuration The following networking commands can be used to diagnose problems: • ifconfig can be used to examine the LAN configuration. This command lists all IP addresses assigned to each LAN interface card. • arp -a can be used to check the arp tables. • cmscancl can be used to test IP-level connectivity between network interfaces in the cluster.
PAGE 250
Unable to halt the detached package on node as the node is not reachable. Retry once the node is reachable. In such a case, the node should be powered up and be accessible. You must then rerun the cmhaltpkg command. 8.8.3 Cluster Re-formations Caused by Temporary Conditions You may see Serviceguard error messages, such as the following, which indicate that a node is having problems: Member node_name seems unhealthy, not receiving heartbeats from it.
PAGE 251
For more information, including requirements and recommendations, see the MEMBER_TIMEOUT discussion under “Cluster Configuration Parameters ” (page 86). 8.8.5 System Administration Errors There are a number of errors you can make when configuring Serviceguard that will not show up when you start the cluster.
PAGE 252
specified in the package control script appear in the ifconfig output under the inet addr: in the ethX:Y block, use cmmodnet to remove them: cmmodnet -r -i where is the address indicated above and is the result of masking the with the mask found in the same line as the inet address in the ifconfig output. 3. Ensure that package volume groups are deactivated. First unmount any package logical volumes which are being used for file systems.
PAGE 253
Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 17:18:36 root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.com root@abc.hp.
PAGE 254
Unable to set client version at quorum server 192.6.7.2: reply timed out Probe of quorum server 192.6.7.2 timed out These messages could be an indication of an intermittent network problem; or the default quorum server timeout may not be sufficient. You can set the QS_TIMEOUT_EXTENSION to increase the timeout, or you can increase the MEMBER_TIMEOUT value. See “Cluster Configuration Parameters ” (page 86)for more information about these parameters.
PAGE 255
8.10 Troubleshooting Serviceguard Manager The following section describes how to troubleshoot issues related to Serviceguard Manager Problem Solution “Service Temporarily Unavailable” when trying to launch Serviceguard Manager Ensure that a loop back address is mentioned in the /etc/ hosts file 127.0.0.1 localhost.localdomain localhost Tomcat process has not started by any chance Run the Tomcat startup command /opt/hp/hpsmh/tomcat/bin/startup.sh The following message is displayed when erviceguard 1.
PAGE 256
PAGE 257
A Designing Highly Available Cluster Applications This appendix describes how to create or port applications for high availability, with emphasis on the following topics: • Automating Application Operation • Controlling the Speed of Application Failover (page 258) • Designing Applications to Run on Multiple Systems (page 261) • Restoring Client Connections (page 264) • Handling Application Failures (page 265) • Minimizing Planned Downtime (page 266) Designing for high availability means reducing
PAGE 258
• Minimize the reentry of data. • Engineer the system for reserve capacity to minimize the performance degradation experienced by users. A.1.2 Define Application Startup and Shutdown Applications must be restartable without manual intervention. If the application requires a switch to be flipped on a piece of hardware, then automated restart is impossible. Procedures for application startup, shutdown and monitoring must be created so that the HA software can perform these functions automatically.
PAGE 259
running the application. After failover, if these data disks are filesystems, they must go through filesystems recovery (fsck) before the data can be accessed. To help reduce this recovery time, the smaller these filesystems are, the faster the recovery will be. Therefore, it is best to keep anything that can be replicated off the data filesystem. For example, there should be a copy of the application executables on each system rather than having one copy of the executables on a shared filesystem.
PAGE 260
the beginning. This capability makes the application more robust and reduces the visibility of a failover to the user. A common example is a print job. Printer applications typically schedule jobs. When that job completes, the scheduler goes on to the next job.
PAGE 261
A.2.7 Design for Replicated Data Sites Replicated data sites are a benefit for both fast failover and disaster recovery. With replicated data, data disks are not shared between systems. There is no data recovery that has to take place. This makes the recovery time faster. However, there may be performance trade-offs associated with replicating data. There are a number of ways to perform data replication, which should be fully investigated by the application designer.
PAGE 262
A.3.1.1 Obtain Enough IP Addresses Each application receives a relocatable IP address that is separate from the stationary IP address assigned to the system itself. Therefore, a single system might have many IP addresses, one for itself and one for each of the applications that it normally runs. Therefore, IP addresses in a given subnet range will be consumed faster than without high availability. It might be necessary to acquire additional IP addresses.
PAGE 263
over time if the application migrates. Applications that use gethostname() to determine the name for a call to gethostbyname(3) should also be avoided for the same reason. Also, the gethostbyaddr() call may return different answers over time if called with a stationary IP address. Instead, the application should always refer to the application name and relocatable IP address rather than the hostname and stationary IP address.
PAGE 264
With UDP datagram sockets, however, there is a problem. The client may connect to multiple servers utilizing the relocatable IP address and sort out the replies based on the source IP address in the server’s response message. However, the source IP address given in this response will be the stationary IP address rather than the relocatable application IP address.
PAGE 265
give up after 2 minutes and go for coffee and don't come back for 28 minutes, the perceived downtime is actually 30 minutes, not 5. Factors to consider are the number of reconnection attempts to make, the frequency of reconnection attempts, and whether or not to notify the user of connection loss. There are a number of strategies to use for client reconnection: • Design clients which continue to try to reconnect to their failed server.
PAGE 266
Ideally, if one process fails, the other processes can wait a period of time for that component to come back online. This is true whether the component is on the same system or a remote system. The failed component can be restarted automatically on the same system and rejoin the waiting processing and continue on. This type of failure can be detected and restarted within a few seconds, so the end user would never know a failure occurred.
PAGE 267
The trade-off is that the application software must operate with different revisions of the software. In the above example, the database server might be at revision 5.0 while the some of the application servers are at revision 4.0. The application must be designed to handle this type of situation. A.6.1.2 Do Not Change the Data Layout Between Releases Migration of the data to a new format can be very time intensive. It also almost guarantees that rolling upgrade will not be possible.
PAGE 268
PAGE 269
B Integrating HA Applications with Serviceguard The following is a summary of the steps you should follow to integrate an application into the Serviceguard environment: 1. Read the rest of this book, including the chapters on cluster and package configuration, and the appendix “Designing Highly Available Cluster Applications.” 2.
PAGE 270
B.1.1 Defining Baseline Application Behavior on a Single System 1. Define a baseline behavior for the application on a standalone system: • Install the application, database, and other required resources on one of the systems. Be sure to follow Serviceguard rules in doing this: ◦ Install all shared data on separate external volume groups. ◦ Use a Journaled filesystem (JFS) as appropriate. • Perform some sort of standard test to ensure the application is running correctly.
PAGE 271
# cmhaltpkg pkg1 # cmrunpkg -n node1 pkg1 # cmmodpkg -e pkg1 2. 3. • Fail one of the systems. For example, turn off the power on node 1. Make sure the package starts up on node 2. • Repeat failover from node 2 back to node 1. Be sure to test all combinations of application load during the testing. Repeat the failover processes under different application states such as heavy user load versus no user load, batch jobs versus online transactions, etc.
PAGE 272
PAGE 273
C Blank Planning Worksheets This appendix reprints blank versions of the planning worksheets described in the “Planning” chapter. You can duplicate any of these worksheets that you find useful and fill them in as a part of the planning process.
PAGE 274
Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supply _______________________ Disk Unit __________________________ Power Supply _______________________ ============================================================================ Tape Backup Power: Tape Unit __________________________
PAGE 275
Physical Volume Name: _________________ Physical Volume Name: _________________ Physical Volume Name: _________________ ============================================================================= Volume Group Name: ___________________________________ Physical Volume Name: _________________ Physical Volume Name: _________________ Physical Volume Name: _________________ C.
PAGE 276
Package AutoRun Enabled? ______ Node Failfast Enabled? ________ Failover Policy:_____________ Failback_policy:___________________________________ Access Policies: User:_________________ From node:_______ Role:_____________________________ User:_________________ From node:_______ Role:______________________________________________ Log level____ Log file:_______________________________________________________________________________________ Priority_____________ Successor_halt_timeout____________ dependency_n
PAGE 277
D IPv6 Network Support This appendix describes some of the characteristics of IPv6 network addresses, specifically: • IPv6 Address Types • Network Configuration Restrictions (page 280) • Configuring IPv6 on Linux (page 280) D.1 IPv6 Address Types Several IPv6 types of addressing schemes are specified in the RFC 2373 (IPv6 Addressing Architecture). IPv6 addresses are 128-bit identifiers for interfaces and sets of interfaces. There are various address formats for IPv6 defined by the RFC 2373.
PAGE 278
D.1.2 IPv6 Address Prefix IPv6 Address Prefix is similar to CIDR in IPv4 and is written in CIDR notation. An IPv6 address prefix is represented by the notation: IPv6-address/prefix-length where ipv6-address is an IPv6 address in any notation listed above and prefix-length is a decimal value representing how many of the leftmost contiguous bits of the address comprise the prefix. Example: fec0:0:0:1::1234/64 The first 64-bits of the address fec0:0:0:1 forms the address prefix.
PAGE 279
Table 16 80 bits 16 bits 32 bits zeros FFFF IPv4 address Example: ::ffff:192.168.0.1 D.1.4.3 Aggregatable Global Unicast Addresses The global unicast addresses are globally unique IPv6 addresses. This address format is very well defined in the RFC 2374 (An IPv6 Aggregatable Global Unicast Address Format). The format is: Table 17 3 13 8 24 16 64 bits FP TLA ID RES NLA ID SLA ID Interface ID where FP = Format prefix. Value of this is “001” for Aggregatable Global unicast addresses.
PAGE 280
“FF” at the beginning of the address identifies the address as a multicast address. The “flags” field is a set of 4 flags “000T”. The higher order 3 bits are reserved and must be zero. The last bit ‘T’ indicates whether it is permanently assigned or not. A value of zero indicates that it is permanently assigned otherwise it is a temporary assignment. The “scop” field is a 4-bit field which is used to limit the scope of the multicast group.
PAGE 281
D.3.1 Enabling IPv6 on Red Hat Linux Add the following lines to /etc/sysconfig/network: NETWORKING_IPV6=yes IPV6FORWARDING=no IPV6_AUTOCONF=no IPV6_AUTOTUNNEL=no # Enable global IPv6 initialization # Disable global IPv6 forwarding # Disable global IPv6 autoconfiguration # Disable automatic IPv6 tunneling D.3.
PAGE 282
D.3.5 Configuring a Channel Bonding Interface with Persistent IPv6 Addresses on SUSE Configure the following parameters in /etc/sysconfig/network/ifcfg-bond0: BOOTPROTO=static BROADCAST=10.0.2.255 IPADDR=10.0.2.10 NETMASK=255.255.0.0 NETWORK=0.0.2.
PAGE 283
E Using Serviceguard Manager HP Serviceguard Manager is a web-based, HP System Management Homepage (HP SMH) tool that replaces the functionality of the earlier Serviceguard management tools. Serviceguard Manager allows you to monitor, administer and configure a Serviceguard cluster from any system with a supported web browser. The Serviceguard Manager Main Page provides you with a summary of the health of the cluster including the status of each node and its packages.
PAGE 284
1. Enter the standard URL http://:2301/. For example, http://clusternode1.cup.hp.com:2301/ 2. When the System Management Homepage login screen appears, enter your login credentials and click Sign In. The System Management Homepage for the selected server appears. 3. From the Serviceguard Cluster box, click the name of the cluster. NOTE: If a cluster is not yet configured, you will not see the Serviceguard Cluster section on this screen.
PAGE 285
NOTE: Serviceguard Manager can be launched by HP Systems Insight Manager version 5.10 or later if Serviceguard Manager is installed on an HP Systems Insight Manager Central Management Server. For a Serviceguard A.11.19 cluster, Systems Insight Manager will attempt to launch Serviceguard Manager B.02.00 from one of the nodes in the cluster; for a Serviceguard A.11.18 cluster, Systems Insight Manager will attempt to launch Serviceguard Manager B.01.01 from one of the nodes in the cluster.
PAGE 286
PAGE 287
F Maximum and Minimum Values for Parameters Table 21 shows the range of possible values for cluster configuration parameters. Table 21 Minimum and Maximum Values of Cluster Configuration Parameters Cluster Parameter Minimum Value Maximum Value Default Value Member Timeout See MEMBER_TIMEOUT under “Cluster Configuration Parameters” in Chapter 4. See MEMBER_TIMEOUT under “Cluster Configuration Parameters” in Chapter 4.
PAGE 288
PAGE 289
G Monitoring Script for Generic Resources Monitoring scripts are the scripts written by an end-user and must contain the core logic to monitor a resource and set the status of a generic resource. These scripts are started as a part of the package start. • You can set the status/value of a simple/extended resource respectively using the cmsetresource(1m) command. • You can define the monitoring interval in the script.
PAGE 290
For resources of evaluation_type: before_package_start • Monitoring scripts can also be launched outside of the Serviceguard environment, init, rc scripts, etc. (Serviceguard does not monitor them). • The monitoring scripts for all the resources in a cluster of type before_package_start can be configured in a single multi-node package by using the services functionality and any packages that require the resources can mention the generic resource name in their package configuration file.
PAGE 291
generic_resource_evaluation_type before_package_start generic_resource_name generic_resource_evaluation_type lan1 before_package_start dependency_name dependency_condition dependency_location generic_resource_monitors generic_resource_monitors = up same_node Thus, the monitoring scripts for all the generic resources of type before_package_start are configured in one single multi-node package and any package that requires this generic resource can just configure the generic resource name.
PAGE 292
# * --------------------------------* # * The following utility functions are sourced in from $SG_UTILS * # * ($SGCONF/scripts/mscripts/utils.sh) and available for use: * # * * # * sg_log * # * * # * By default, only log messages with a log level of 0 will * # * be output to the log file.
PAGE 293
{ sg_log 5 "start_command" # ADD your service start steps here return 0 } ######################################################################### # # stop_command # # This function should define actions to take when the package halts # # ######################################################################### function stop_command { sg_log 5 "stop_command" # ADD your halt steps here exit 1 } ################ # main routine ################ sg_log 5 "customer defined monitor script" #####################
PAGE 294
PAGE 295
H HP Serviceguard Toolkit for Linux The HP Serviceguard Toolkits such as, Contributed Toolkit, NFS, and Oracle Toolkits are used for the integration of applications such as, Apache, MySQL, NFS, Oracle database, and so on with the Serviceguard for Linux environment. The Toolkit documentation describes how to customize the package for your needs. For more information, see HP Serviceguard Contributed Toolkit Suite on Linux Release Notes Version A.04.02.02 at http://www.hp.com/go/linux-serviceguard-docs.
PAGE 296
PAGE 297
Index A Access Control Policies, 152 active node, 20 adding a package to a running cluster, 233 adding cluster nodes advance planning, 127 adding nodes to a running cluster, 204 adding packages on a running cluster, 191 administration adding nodes to a running cluster, 204 halting a package, 209 halting the entire cluster, 204 moving a package, 211 of packages and services, 209 of the cluster, 203 reconfiguring a package while the cluster is running, 232 reconfiguring a package with the cluster offline, 233
PAGE 298
cluster node parameter, 86, 88 defined, 36 dynamic re-formation, 37 heartbeat subnet parameter, 90 initial configuration of the cluster, 36 main functions, 36 maximum configured packages parameter, 99 member timeout parameter, 94 monitored non-heartbeat subnet, 92 network polling interval parameter, 95, 99 planning the configuration, 86 quorum server parameter, 88 testing, 242 cluster node parameter in cluster manager configuration, 86, 88 cluster parameters initial configuration, 36 cluster re-formation sc
PAGE 299
planning for, 102 explanations package parameters, 168 F failback policy used by package manager, 47 FAILBACK_POLICY parameter used by package manager, 47 failover controlling the speed in applications, 258 defined, 20 failover behavior in packages, 103 failover package, 41, 164 failover policy used by package manager, 44 FAILOVER_POLICY parameter used by package manager, 44 failure kinds of responses, 72 network communication, 74 response to hardware failures, 73 responses to package and service failures,
PAGE 300
integrating HA applications with Serviceguard, 269 introduction Serviceguard at a glance, 19 understanding Serviceguard hardware, 25 understanding Serviceguard software, 31 IP in sample package control script, 228 IP address adding and deleting in packages, 60 for nodes and packages, 59 hardware planning, 77, 81 portable, 59 reviewing for packages, 246 switching, 43, 44, 67 IP_MONITOR defined, 98 J JFS, 259 K kernel hang, and TOC, 72 safety timer, 32 kernel consistency in cluster configuration, 134 kernel
PAGE 301
redundant subnets, 77 networks binding to IP addresses, 263 binding to port addresses, 263 IP addresses and naming, 261 node and package IP addresses, 59 packages using IP addresses, 262 supported types in Serviceguard, 25 writing network applications as HA services, 258 no cluster lock choosing, 40 node basic concepts, 25 halt (TOC), 72 in Serviceguard cluster, 19 IP addresses, 59 timeout and TOC example, 72 node types active, 20 primary, 20 NODE_FAIL_FAST_ENABLED effect of setting, 74 NODE_NAME parameter
PAGE 302
for expansion, 102 hardware configuration, 77 high availability objectives, 75 overview, 75 package configuration, 100 power, 79 quorum server, 80 SPU information, 77 volume groups and physical volumes, 81 worksheets, 79 planning and documenting an HA cluster, 75 planning for cluster expansion, 75 planning worksheets blanks, 273 point of failure in networking, 26 POLLING_TARGET defined, 98 ports dual and single aggregated, 62 power planning power sources, 79 worksheet, 80, 274 power supplies blank planning
PAGE 303
install, 129 introduction, 19 Serviceguard at a Glance, 19 Serviceguard behavior in LAN failure, 25 in monitored resource failure, 25 in software failure, 25 Serviceguard commands to configure a package, 225 Serviceguard Manager, 22 overview, 22 Serviceguard software components figure, 31 serviceguard WBEM provider, 34 shared disks planning, 78 shutdown and startup defined for applications, 258 single point of failure avoiding, 19 single-node operation, 160, 239 size of cluster preparing for changes, 127 SM
PAGE 304
VGChange, 185 volume group for cluster lock, 38, 39 planning, 81 volume group and physical volume planning, 81 W WEIGHT_DEFAULT defined, 98 WEIGHT_NAME defined, 98 What is Serviceguard?, 19 worksheet blanks, 273 cluster configuration, 99, 275 hardware configuration, 79, 273 package configuration, 275, 276 power supply configuration, 80, 273, 274 use in planning, 75 304 Index