PowerEdge MX7000 Management Module Redundancy Dell EMC Technical White Paper
Revisions Revisions Date Description Jan 2019 Initial release Acknowledgements This paper was produced by the following members of the Dell EMC storage engineering team: Author: Prakash Nara, Jitendra Jagasia, Deepa Hegde, Venkat Donepudi 2 PowerEdge MX7000 Management Module Redundancy | Document ID
Table of contents Table of contents Revisions.............................................................................................................................................................................2 Acknowledgements .............................................................................................................................................................2 Table of contents .................................................................................................
Introduction Introduction The purpose of this whitepaper is to describe the MX7000 Management Module (MM) high availability feature provided by dual MM modules, discuss manual (user initiated) and automatic (system initiated) failovers, physical identification of active/standby MMs for part replacement scenarios and troubleshooting redundancy health. MM Redundancy The PowerEdge MX7000 with a recommended configuration has dual MMs, each occupying a slot accessible through the back of the chassis.
Establishing redundancy Establishing redundancy In a dual MM configuration, on power up, one of the MMs claim and win the active role (more affinity for MM in slot 1) and initiate the boot up. The active MM does the orchestration of initializing the cluster by assuming the active node role and bringing up all monitored resources (services) in active mode, it then onboards the other MM to be the standby node with all its monitored resources (services) in standby mode.
Inherent benefits of redundancy Inherent benefits of redundancy On MM failure - - Continued access to OME-M with approximate downtime of 2.5 minutes. Downtime accounts for detection of failures, promotion of standby to active, reconciliation of inter device communications and network readiness. Management network including OME Modular IP addresses continue to function on the new active MM All the data (device inventory, configuration, jobs, alerts, logs, etc.
Failovers operations if there are any observations of persistent issues on the active MM and want to remedy by switching to standby MM. - example racadm command: racadm changeover example racadm command: racadm racreset Automatic (system initiated) failover occurs in the following scenarios • • Long running active MM may eventually develop and manifest failures (software and/or hardware).
Moving/Swapping MMs between chassis Moving/Swapping MMs between chassis Moving or swapping MMs between chassis could be a typical usecase during maintenance and trouble shooting scenarios. Use case 1, please refer to figure 5: A chassis with dual MM configuration and redundancy health OK, fully supports movement of single MM without any configuration or device history data loss.
Identify Active and Standby MMs Dual MM remove or swap Identify Active and Standby MMs Following are two of several ways to identify which MM is active • • 9 Via the OME Modular GUI (Figure 7) Physical Identify Combo LED on the back of the chassis (Figure 8). For more details on Identify Combo LED please refer to “PowerEdge MX7000 At-the-box System Identify” whitepaper.
Troubleshooting Redundancy Health Alerts Identifying Active MM via GUI Identify Active MM via LED on the MM Troubleshooting Redundancy Health Alerts Following redundancy critical alerts are generated and displayed on OME Modular: SEL1501: Chassis management controller/Management Module (CMC/MM) redundancy is lost Reason for the alert: One MM is removed from a dual MM configuration chassis Recommended action: Insert another MM with same firmware version to restore redundancy 10 PowerEdge MX7000 Managem
Troubleshooting Redundancy Health Alerts SEL1524: Management Module in Slot [1/2] is offline Reasons for the alert: One of the MMs is not performing at its optimal level in a dual MM configuration chassis, affected MM will be shown as offline and should self-heal.
Troubleshooting Redundancy Health Alerts Configuration Configuration Configuration Configuration Configuration Configuration Configuration Deploy (Templates) Identity Pools, Networks Firmware Baseline Alert Policies Chassis Address (ipv4,ipv6,DNS,etc) Time (NTP, Timezone) Chassis "root" user password Preserved Preserved Preserved Preserved Preserved Preserved Preserved Not preserved Not preserved Not preserved Not preserved Preserved Preserved Preserved Configuration Configuration Configuration Configu