Dell EMC Storage Systems Events and Alerts Troubleshooting Guide for the metro node appliance 7.
Notes, cautions, and warnings NOTE: A NOTE indicates important information that helps you make better use of your product. CAUTION: A CAUTION indicates either potential damage to hardware or loss of data and tells you how to avoid the problem. WARNING: A WARNING indicates a potential for property damage, personal injury, or death. © 2021 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries.
Contents Chapter 1: Troubleshooting........................................................................................................... 4 Alerts and Logs.....................................................................................................................................................................4 Issue: Do not see a generated alert in Live Alerts for a generated event.............................................................
1 Troubleshooting The following guide covers the troubleshooting issues which user can face while working with the Events and Alerts: Topics: • • • • • • • • • • • • • • • • • • • • • • • Alerts and Logs Issue: Do not see a generated alert in Live Alerts for a generated event.
Logs The notifications log verifies the event processing. Events that are received from the NSWF can be verified by checking out the Kafka logs. The alerts also contain the various fields which narrow down the cause, component, and resource of the events. Also, the properties panel contains the corrective action that can be taken to resolve the issues.
Issue: Do not see a generated alert in Live Alerts for a generated event. Solution Check the followings: ● ● ● ● ● Alert for condition_id is supported. Alert for condition_id is enabled. Alert for a component is enabled. The notifications service is enabled. In troubleshooting, see the notification stack. Issue: Mapping the conditionID with firmware debug events Solution Run cd /etc/opt/dell/vplex. ● The cat firmware_events.yaml provides the brief description about the conditionID.
service@director-1-1-a:/etc/opt/dell/vplex> ● The cat firmware_events.yaml | grep legacyDbgEvents lists the supported conditionID vs legacy debug events mapping. service@director-1-1-a:~> cd /etc/opt/dell/vplex service@director-1-1-a:/etc/opt/dell/vplex> service@director-1-1-a:/etc/opt/dell/vplex> cat firmware_events.
legacyDbgEvents: [stdf/23] legacyDbgEvents: [stdf/24] legacyDbgEvents: [stdf/56] legacyDbgEvents: [stdf/34] legacyDbgEvents: [stdf/31] legacyDbgEvents: [stdf/19] legacyDbgEvents: [sfp/9] legacyDbgEvents: [sfp/7] legacyDbgEvents: [sfp/11, sfp/12] legacyDbgEvents: [sfp/11, sfp/12] legacyDbgEvents: [sfp/11, sfp/12] legacyDbgEvents: [sfp/11, sfp/12] legacyDbgEvents: [dios/20] legacyDbgEvents: [dios/13] legacyDbgEvents: [utl/16] legacyDbgEvents: [vmg/1, vmg/2, vmg/3] legacyDbgEvents: [vmg/29] legacyDbgEvents: [n
Question: What is the expected service status? Answer See the following: sudo systemctl status notifications service@director-2-1-b:~> sudo systemctl status notifications ● notifications.service - NotificationService Loaded: loaded (/usr/lib/systemd/system/notifications.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2020-10-29 05:19:26 UTC; 6 days ago Main PID: 5040 (vplex_launch_no) Tasks: 188 (limit: 4915) CGroup: /system.slice/notifications.
Nov 05 00:37:49 director-2-1-b kafka-server-start.sh[2114]: INFO [GroupMetadataManager brokerId=0] Removed 0 expired > Nov 05 00:47:49 director-2-1-b kafka-server-start.sh[2114]: INFO [GroupMetadataManager brokerId=0] Removed 0 expired > Nov 05 00:57:49 director-2-1-b kafka-server-start.sh[2114]: INFO [GroupMetadataManager brokerId=0] Removed 0 expired > Nov 05 01:07:49 director-2-1-b kafka-server-start.
Issue: How does a user can get the service dependencies of all notification-related services? Answer Notifications rely on high-level service dependencies such as Telegraf, Kafka, and Postgresql. Issue: Database connection error in the notifications service logs at start-up Reason It is because the notification service is up before the database and the re-connection is tried until a connection to the database is achieved.
Question: If NSFW is down, then how can a user identify it? Answer If NSFW is down, then the notification service generates a heartbeat alert. And, a heartbeat alert gets closed once NSFW is up. There is a delay of 5 minutes for the alert to generate because NSFW is sending the heartbeats every 5 minutes. If NSFW is down for a longer period, then the last updated time of the heartbeat alert gets updated instead of creating lots of alerts.
Question: When are the email notifications going to be sent? Answer The email notifications should be in enable state (default state is enable). When an alert is generated, then the email is sent along with the generation of alert. Question: What happens if the customer provides the wrong email ID? Answer The notifications service is not intended to verify the email ID. If the email that is sent to the wrong email id, then user does not receive any email.
FC023 System Health SCSI Target FibreChannel Port encountered an error, resulting in the generation of excessive cores for the HBA. Disabling port . FC024 System Health SCSI Target FibreChannel Port encountered an error and was unable to load firmware for the HBA. Disabling port . FC025 System Health SCSI Target FibreChannel Port encountered an error and was unable to access HBA resources. Disabling port .
Database name Tables ● ● ● ● ● ● ● ● Used by idrac_alert notification_action notification_status system_alert system_event scope_incarnation monitor_alert config_data Purpose ● NSFW ● iDRAC Detail of the tables Table name Purpose alert_definition It stores the Static Alert Data Condition IDs. disabled_idrac_alert Disabled Alerts by user. disabled_platform_alerts Disabled Alerts by user.
Question: Is there a mapping for port level alerts between legacy and voyager? Answer Implemented Voyager Call Homes The following Voyager call homes include SFP check/reset/replace as part of remedial action: ID Name 0x0009000e 0x00110001 16 Legacy events External RCA UnintentionalFrontEndPortLi Yes nkDown stdf/19 An enabled FE port 1. Check the FE has gone down as port status in a result of FC cable the ports pull, switch reboot, context of VPlexcli / or disabling switch clusters/ port.
ID Name Called Home Legacy events External RCA External remedy corresponding port is enabled. 4. If the link remains down, contact Dell Customer Support. 0x00120003 PathDisconnected Yes udcom/3 A communications path has been disconnected due to network connectivity issues. Check the WAN COM or LOCAL COM path that is disconnected, then check the switch logs for errors that help in pointing the root cause. If errors show hardware issues, then check/ clean/replace the cables and SFPs along the path.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy array and LUN masking. 2. Check array configuration and the physical connections between: a. The VPLEX BE Port and the backend switch b. back-end switch and the array 3. Try cleaning, or replacing cable(s), or SFPs in the path to the storage volume. If the issue persists, then contact Dell Customer Support.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy connectivity problem. of the port in the engines/*/ directors/*/ hardware/ ports context, and verify if the port is still enabled but in 'no-link' state. If the issue persists, then engage DELL Customer Service to check the physical connectivity and LOCAL COM switches.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy bandwidth problem. replace the cables and SFPs along the path. Otherwise investigate if there are bandwidth/ congestion/oversubscription issues leading to the degradation. No 0x8a459018 IPC_SFP_IS_N Yes OT_DETECTED ipc/24 An SFP is missing, inserted incorrectly, or faulty. Apply the following measures until the issue has been resolved: 1. Identify the physical port specified in the event. 2.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy 3. Reseat the SFP. 4. Follow the optical cable to the switch, write down the switch/port information. 5. Inspect the switch side cabling/SFP issues. 6. Verify the switch port configuration. No 0x8a45302e TOO_MANY_IP Yes C_PATHS_DO WN ipc/46 The last redundant IP path to the specified director is down. It could be due to problem with the IP WANCOM switch.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy VPLEX BE port and the switch. port, especially anything that has been changed recently. Reseat/ clean/replace the hardware to resolve the problem. If the problem persists, and unable to determine the cause, then contact DELL Customer Service. Yes 0x8a343013 stdf_19_WARNI Yes NG stdf/19 An enabled FE port has gone down as a result of FC cable pull, switch reboot, or disabling switch port. 1.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy 4. Replace SFPs on the I/O path; 5. Check that the target storage device see if it is oversubscribed, if so, add more target ports to serve I/O. For any help, contact Dell Customer Support. No 0x8a369015 tach_21_CRITI CAL Yes tach/21 Some amount of I/O fails because frames timed out. Do the following steps (take in order until the issue is solved): 1.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy 2. Clean and reseat the cable. 3. Reseat the SFP on both ends. 4. Replace the SFP on both ends. 5. Try to use a different switch port if available. 6. Contact Dell Customer Support for FRU to replace the I/O SLIC. No 0x8a36301a FC_PORT_RX_ Yes POWER_LOW tach/26 The RX Power level on a Fibre Channel port is below the low limit and I/O timeouts have occurred.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy this product, but the vendor part number is not on the recognized list. this product. If not, replace the SFP. No 0x8a36901f tach_31_CRITI CAL Yes tach/31 An SFP is missing, inserted incorrectly, or faulty. Take following steps (take in order until the issue is solved): 1. Identify the physical port specified in the event. 2. If the SFP is missing, insert a new SFP and attach the appropriate cable. 3.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy must be replaced ASAP. 1. Identify the physical port specified in the event. 2. Reseat the SFP on both ends. 3. Replace the SFP on the specified port. 4. Contact Dell for FRU to replace the I/O SLIC. No 0x8a363034 IF_RX_RXEOFA Yes _ALARM tach/52 Interface received frames with End of Frame Abort (EOFA) delimiter in the last minute. There might be faulty hardware on the I/O path.
Relevant to voyager ID Name Called home Firmware event External RCA External Remedy 2. Replace SFPs on the interface. Question: Where a user can get the REST API notifications for all UI? Answer UI Functionality REST API HTTP Operation Platform Alerts Live Listing notification/v1/ GET platform_alerts? offset=0&limit=100&so rt_by=lastModified&enabled=t rue JSON for Patch Operation Sample Response id: 3 description: The IP port state has changed.
UI Functionality REST API HTTP Operation JSON for Patch Operation Sample Response state: CLOSED Platform Alerts Historical All (To get log details) notification/v1/ GET platform_alerts/ historical/all? state=CLOSED&severit y=ERROR&conditionId =0x10006&resource=T EST&enabled=true&fro mDate=10-21-2020&to Date=10-23-2020 additionalData: {name: {type: "str", value: "0x1", format: "%s", category: "string", supplementalKey: "1"}} aggregatedResources: "" category: "OPERATIONAL" component: "director-1-1-A" c
UI Functionality REST API HTTP Operation JSON for Patch Operation Sample Response message: The NIC in Slot 3 Port 2 network link is down. appname: dsm_ism_srvmgrd facility: daemon host: director-1-1-a hostname: director-1-1a severity: WARNING lastModified: 2020-10-23T05:46:24.
UI Functionality REST API HTTP Operation JSON for Patch Operation Sample Response lastModified: "2020-10-23T05:46:24. 703+0000" message: "The NIC in Slot 3 Port 1 network link is down.
UI Functionality REST API HTTP Operation Monitor Alerts Historical All (To get log details) notification/v1/ GET hardware_alerts/ monitor_alerts/ historical/all? messageId=HWMHRT102&fromDate=1021-2020&toDate=10-23 -2020 JSON for Patch Operation Sample Response appName: "vplex-peerheartbeat" category: "vplex_monitor" count: 1 created: "2020-10-23T05:39:45. 134+0000" day: "2020-09-14T00:00:00.
UI Functionality REST API HTTP Operation JSON for Patch Operation Sample Response category: ALARM Acknowledge Platform alerts notification/v1/ platform_alerts/state PATCH [{"path": 1," op": "replace"," value": "ACK"}] id: 1 description: Storage Array is not seen by this director.
UI Functionality REST API HTTP Operation JSON for Patch Operation Sample Response category:ALARM userNote: User note for platform alert is added Open iDRAC alerts notification/v1/ hardware_alerts/ idrac_alerts/state PATCH [{"path": 6," op": "replace", "value": "OPEN"}] id: 7 facilityCode: 3 severityCode: 4 version: 1 category: System messageId: NIC100 message: The NIC in Slot 3 Port 2 network link is down.
UI Functionality REST API HTTP Operation JSON for Patch Operation Sample Response enabled: true Close iDRAC alerts notification/v1/ hardware_alerts/ idrac_alerts/state PATCH [{"path": 6," op": "replace"," value": "CLOSED"}] id: 7 facilityCode: 3 severityCode: 4 version: 1 category: System messageId: NIC100 message: The NIC in Slot 3 Port 2 network link is down. appname: dsm_ism_srvmgrd facility: daemon host: director-1-1-a hostname: director-1-1a severity: WARNING lastModified: 2020-10-23T05:46:24.
UI Functionality REST API HTTP Operation JSON for Patch Operation Sample Response enabled: true appname: vplex-peerheartbeat facilityCode: 1 hostname: director-1-1a severity: CRITICAL severityCode: 2 state: OPEN messageId: HWMHRT102 Acknowledge Monitor alerts notification/v1/ hardware_alerts/ monitor_alerts/state PATCH [{"path": 8," op": "replace"," value": "ACK"}] id: 1 version: 1 host: director-1-1-a facility: user category: vplex_monitor enabled: true appname: vplex-peerheartbeat facilityCode: 1
UI Functionality REST API HTTP Operation JSON for Patch Operation Sample Response messageId: HWMHRT102 Add Monitor alert user notes notification/v1/ hardware_alerts/ monitor_alerts/ user_note PATCH [{"path": 2, "op": "replace", "value": "notes for monitor alert added"}] id:1 version:1 host:director-1-1-a facility:user category:vplex_monitor enabled:true appname:vplex-peerheartbeat facilityCode:1 hostname:director-1-1-a severity:CRITICAL state:OPEN messageId:HWMHRT102 userNote:notes for monitor aler
UI Functionality REST API HTTP Operation Test Alert notification/v1/ platform_alerts/ trigger_test_alerts POST JSON for Patch Operation Sample Response id: 11 resource: TEST component: director-1-1-A name: Director Scope Test Operational", state: OPEN severity: ERROR eventSourceId: 0x1 eventSource: TEST conditionId: 0x10006 category: OPERATIONAL Close Test Alert notification/v1/ platform_alerts/ close_test_alerts PATCH [{"path": 3, "op": "replace", "value": "CLOSED"}] id: 11 resource: TEST compo
UI Functionality REST API HTTP Operation JSON for Patch Operation Sample Response dcEnabled: true id: 3 notification: "idrac_alerts" emailEnabled: true dcEnabled: true Enable Email Notifications notification/v1/action PATCH [{"op": "replace", "path": "/ platform_alerts/email", "value": true}] id:2 notification:platform_al erts emailEnabled:true dcEnabled:true Disable Email Notifications notification/v1/action PATCH [{"op": "replace", "path": "/ platform_alerts/email", "value": false}] id:2 not
Notification service Dream catcher service ESE service ESRS CLM Results Comments System Alerts + Hardware alerts. Down UP UP UP UP PASS Events are not Sent back. Test connectivity Payload and the Product topology payload UP Down UP UP UP FAIL The generated events are not sent to back end. Down Down UP UP UP FAIL Events are not generated and sent. UP UP Down UP UP FAIL The generated events are not sent to back end.
Notification service Dream catcher service ESE service ESRS CLM Results Comments sent to back end. Down Down Down Down UP FAIL Events are not generated and sent. UP UP UP UP Down FAIL The generated events are not sent to back end. Down UP UP UP Down FAIL Events are not generated and sent. UP Down UP UP Down FAIL The generated events are not sent to back end. Down Down UP UP Down FAIL Events are not generated and sent.
Notification service Dream catcher service ESE service ESRS CLM Results Comments Down UP Down Down Down FAIL Events are not generated and sent. UP Down Down Down Down FAIL The generated events are not sent to back end. Down Down Down Down Down FAIL Events are not generated and sent. Services related to notification Postgres Kafka Telegraf Notification Result Comments Up Up Up Up PASS Events receive.
Postgres Kafka Telegraf Notification Result Comments it restarts the Telegraf also. Up Down Down Down FAIL Events are not received. If the Kafka, Telegraf, and Notifications are crashed/killed, then it restarts automatically. Down Up Up Up FAIL Events are buffered in Kafka and pushed to notification service once it is up. Events are not lost. Notification tries to reconnect to the Postgres(db storing fails).
Postgres Kafka Telegraf Notification Result Comments after the buffer limit is reached, Telegraf starts discarding the incoming events. Down Down Down Up FAIL Events are not received until all the services are up.