CHALLENGE™/Onyx™ Diagnostic Road Map Document Number 108-7045-030
Contributors Written by Kameran Kashani and Greg Morris Illustrated by Dan Young, Cheri Brown, Greg Morris, and Kameran Kashani Edited by Christina Cary Production by Lorrie Williams Engineering contributions by John Kraft, Steve Whitney, Rich Altmaier, Unmesh Agarwala, Ray Mascia, Robert Thomas, Dilip Amin, Sue Liu, Greg Wong, Ken Beck, Ken Choy, Laurent Coudrelle, Ben Fathi © Copyright 1993, 1994 Silicon Graphics, Inc.
Contents Introduction............................................................................................................... xiii 1. Theory of Operations ............................................................................................. 1-1 1.1 Overview...................................................................................................... 1-1 1.2 System Buses................................................................................................ 1-4 1.2.
2.6 2.7 iv Clocks Are Good but the Processor LEDs Are All Lit................................................... 2-28 LEDs Show Failure Code After Starting Boot Pattern .................................................................... 2-29 The Enable Register Is Incorrect .......................................... 2-29 IP19 LEDs Show a Static 0xe Pattern................................... 2-29 IP21 LEDs Show a Static 0x14 Pattern ................................
3. Power Subsystem ................................................................................................... 3-1 3.1 Overview...................................................................................................... 3-1 3.2 Power Fault Indicator Descriptions and Locations................................ 3-3 3.2.1 System Controller and Offline Switchers (OLSs) ................. 3-3 3.2.2 System and Power Boards .......................................................
5.6 5.7 5.8 5.9 6. Interactive Diagnostics Environment (IDE) ...................................................... 6-1 6.1 6.2 6.3 6.4 6.5 vi IP19 PROM Error and Status Messages................................................. 5-15 5.6.1 IP19 PROM Messages (Short Form) ..................................... 5-17 5.6.2 IP19 PROM Messages (Long Form) ..................................... 5-18 5.6.3 Diagnostic Codes and Their Meanings................................ 5-19 CPU Cache ......................
. IRIX Error Reporting ............................................................................................... 7-1 7.1 Overview...................................................................................................... 7-1 7.2 Panic Messages............................................................................................ 7-1 7.2.1 Interpreting Panic Messages ................................................... 7-2 IP19-Specific Messages ........................................
Figures Figure 1-1 Figure 1-2 Figure 1-3 Figure 1-4 Figure 1-5 Figure 1-6 Figure 2-1 Figure 2-2 Figure 2-3 Figure 2-4 Figure 2-5 Figure 2-6 Figure 2-7 Figure 2-8 Figure 2-9 Figure 2-10 Figure 3-1 Figure 3-2 Figure 3-3 Figure 3-4 Figure 3-5 Figure 3-6 Figure 3-7 Figure 3-8 Figure 3-9 Figure 3-10 Figure 3-11 Figure 3-12 Figure 3-13 Figure 3-14 Everest Functional Block Diagram ......................................................... 1-2 IO4 Functional Block Diagram...........................................
Figure 3-15 Figure 4-1 Figure 4-2 Figure 4-3 Figure 4-4 Figure 4-5 Figure 4-6 Figure 4-7 Figure 5-1 Figure 5-2 Figure 5-3 Figure 5-4 Figure 5-5 Figure 5-6 Figure 5-7 Figure 5-8 Figure 5-9 Figure 5-10 Figure 5-11 x Power Subsystem Voltage Monitoring ................................................ 3-16 System Controller Input/Output Signals.............................................. 4-2 Deskside System Controller Sensors....................................................
Tables Table 1-1 Table 2-1 Table 2-2 Table 3-1 Table 3-2 Table 3-3 Table 3-4 Table 4-1 Table 4-2 Table 4-3 Table 4-4 Table 4-5 Table 5-1 Table 5-2 Table 5-3 Table 5-4 Table 5-5 Table 5-6 Table 5-7 Table 5-8 Table 6-1 Table 6-2 Table 6-3 Table 6-4 Table 6-5 Table 6-6 Table 6-7 Table 6-8 Table 6-9 Table 6-10 Table 7-1 Bus Types in Everest Deskside and Rackmount Systems..................... 1-4 Likely Causes of Common System Problems ......................................... 2-2 System Controller Commands...
Introduction This document describes the various diagnostic tools available with the Everest board set and their relationship to the Everest system components and to one another. This document describes each of the diagnostic tools, the physical area of the system that they test, and the possible error messages. The information contained in this document is organized as follows: • Chapter 1, “Theory of Operations,” provides the theory of operations for the Everest board set.
Chapter 1 1. 1.1 Theory of Operations Overview This chapter introduces the Everest (POWERpath-2) board set and shows how those boards communicate over the various system and board buses. Figure 1-1 is a high-level functional block diagram of the Everest board set installed in a typical system. Figure 1-2 illustrates the IO4 board architecture in greater detail. As the individual diagnostic tools are discussed, similar diagrams highlight the areas affected by that tool (where appropriate).
System Controller MC3 1 MC3 2 IP19/IP21 2 IP19/IP21 1 (Master CPU) Status Panel Cooling Fans Offline Switchers Everest Data Bus (256 bits) Power Boards Everest Address Bus (40 bits) Ethernet IO4 VCAM FMezz SCSI Mezz VME Bus Graphics Board Set SCSI−1/SCSI−2 Buses SCSI Devices Flat Cable Interface Figure 1-1 1-2 Everest Functional Block Diagram Theory of Operations
Everest Data Bus (256 bits) Everest Address Bus (40 bits) IA Chip Map RAM ID Chips IBus (64 bits) EPC Chip PROM S1 Chip P B u s NVRAM Timer F Chip F Chip Mezz Slot Mezz Slot FCI Chip VCAM FCI Chip Graphics Board Set Serial Ports (4) Fast/Wide SCSI-2 Controller SCSI Devices Kybd/ Mouse Port Fast/Wide SCSI-2 Controller SCSI Devices Parallel Port Ethernet Controller Figure 1-2 IO4 Functional Block Diagram The available diagnostic tools are separated into six groups: parity checkers
sequence and display an error message. Error information is also provided by a series of LEDs on the off-line switchers and the system boards, but these are not visible without opening the cabinet. The power-on tests execute whenever the system is powered on or reset. Those tests verify enough of the system’s basic hardware functionality to load the standalone diagnostics from the IDE. The power-on diagnostics (POD) is a special command interpreter that is a subset of the power-on tests.
Table 1-1 (continued) Bus Types in Everest Deskside and Rackmount Systems Bus Category Bus Types Description Peripheral buses SCSI buses Connect a variety of storage devices to the IO4 board. The IO4 board has two built-in SCSI buses. By adding additional IO4 boards and SCSI mezzanine cards an Everest system can support up to 32 SCSI buses (depending upon the specific system configuration). Flat cable interface (FCI) Connects the graphics cards or VMEbus to the IO4 board.
. System Controller MC3 1 MC3 2 IP19/IP21 2 IP19/IP21 1 (Master CPU) Cooling Fans Offline Switchers Everest Data Bus (256 bits) Power Boards Everest Address Bus (40 bits) Ethernet IO4 VME Bus VCAM FMezz SCSI Mezz SCSI−1/SCSI−2 Buses Flat Cable Interface Figure 1-3 SCSI Devices Graphics Board Set Everest System Buses Each bus category is described in the following sections. 1.2.
IP19/IP21 D D A D IP19/IP21 D D D A D IP19/IP21 D D D A D D Data Bus (256) Address Bus (50 bits) MD MD MD MD MD MD MD MD MD MD IO4 IO4 Figure 1-4 1.2.2 Everest Buses and Interface ASICs Polled Serial Bus This dedicated bus is embedded in the system’s backplane. It connects the system’s CPU boards with the System Controller. During the boot process, the System Controller polls the CPU boards over this bus, requesting a bootmaster CPU.
IP19/IP21 (Master CPU) D D A D IP19/IP21 D D D A D IP19/IP21 D D D A D D Polled Serial Bus ID ID IA ID ID System Controller Status Panel IO4 Figure 1-5 Polled Serial Bus The polled serial bus provides a shortened error-reporting path to the System Controller display on the front panel. The path from the bootmaster CPU to the RS-232 system console port involves the Ebus and the IO4’s Ibus.
IO4. The fourth digit represents a specific drive; the fifth digit is the partition on the selected drive. See Figure 1-6 for an example of SCSI drive addressing. IO4 slot assignment 15 Channel 6 5 7 Mezz card 0 IO4 base board 1 3 Mezz card 2 4 Figure 1-6 1.2.5 SCSI Drive Addressing FCI Bus The flat cable interface (FCI) is a Silicon Graphics proprietary interface that connects a variety of local and remote peripheral resources, such as graphics controllers, VMEbus adapters, and FDDI adapters.
Chapter 2 2. 2.1 Diagnostic Procedures Overview This chapter describes how to • examine a frozen CHALLENGE, POWER CHALLENGE, Onyx, or POWER Onyx system • determine whether a problem is caused by hardware or software • determine where hardware problems occur in a system • use various diagnostic tools and techniques for manufacturing and field service technicians A scenario of a frozen system demonstrates how to use the debugging tools.
2.2 Examining a Frozen System When a system is frozen, it is either hung or the kernel has panicked. It is important to determine which case has occurred in the system you are diagnosing because the procedures for fixing a hang are different from those for fixing a kernel panic. 2.2.1 What to Do If the System Is Still Frozen Note: The best way to find a system hang or panic is to examine it while it is still frozen.
If the system is still hung, follow these steps to help isolate the problem: 1. Examine the serial port console, if available. Do the last messages look like normal activity, or is the serial port console showing a panic or sitting at a DBG: prompt? If there are no signs of a kernel panic or crash, type a few characters on the serial console and see if they echo. If they echo, then the kernel is still ticking at interrupt level; this is probably a software bug.
If no error bits are set, then it is probably a software problem. Corrective action: File a bug report. 6. If there is no response to the second NMI, use the procedure in section Section 2.5.5, “Procedure to Cause a Hung System to Enter POD Mode,” to try to reset into POD. If there are no error bits set, then it is still probably a software problem, with corruption of kernel memory. Entering POD depends on a few words of memory being correct. Corrective action: File a bug report. Note: 2.2.
To examine the messages stored in the compressed kernel core dump file, use the uncompvm(1M) command. For example, /usr/etc/uncompvm -h vmcore.N.comp The –h option uncompresses only the header of the file vmcore.n.comp where the kernel panic messages are stored. Panic messages are indicated by the string pb followed by the message number.
A difficult fault to trace is one that occurs in an IP19-based system during a memory write. If an IP19 issues a memory or PIO write, and an error occurs, an error interrupt is sent to one of the CPUs. The CPU receiving the interrupt may not be the same CPU that issued the write operation. The difficulty is compounded when the error occurred during a transaction that originated in a DMA controller. The Ebus is highly pipelined, and an operation, once initiated, may not be completed until some time later.
System Controller MC3 1 MC3 2 IP19/IP21 2 IP19/IP21 1 (Master CPU) Cooling Fans Offline Switchers Everest Data Bus (256 bits) Power Boards Everest Address Bus (40 bits) Ethernet IO4 FMezz VME Bus VCAM SCSI Mezz SCSI−1/SCSI−2 Buses Flat Cable Interface SCSI Devices Graphics Board Set Error Checking Logic Figure 2-1 2.4.1 Everest Bus Parity Checkpoints Error Messages When the hardware detects an error, IRIX and the diagnostics display it in a format called the HARDWARE ERROR STATE display.
2.4.1.1 IP19 CPU Board Figure 2-2 is a functional block diagram of the IP19 board with the error detection points called out. Figure 2-3 shows the physical layout of the board and the locations of the error detection logic.
D ASIC CPU CPU CC Chip CC Chip D ASIC A ASIC Figure 2-3 CPU CPU CC Chip CC Chip D ASIC D ASIC IP19 Board Component Locations CHALLENGE/Onyx Diagnostic Road Map 2-9
IP19 Board Error Messages HARDWARE ERROR STATE: IP19 in slot 1 + A Chip Error Register: 0xffff + 0:CPU 0 CC->A parity error + 1:CPU 1 CC->A parity error + 2:CPU 2 CC->A parity error + 3:CPU 3 CC->A parity error + 4:ADDR_ERROR on EBUS 2 2 2 2 1 + 5:My ADDR_ERROR on EBUS 1 + + 8:CPU 0 CC->D parity error 7 + + + + + + + + + + + 1 2 3 0 CC->D parity error CC->D parity error CC->D parity error ADDR_HERE not asserted 13:CPU 1 ADDR_HERE not asserted 14:CPU 2 ADDR_HERE not asserted 15:CPU 3 ADDR_HERE not
Note: 2.4.1.2 The numbers following each error message correspond to the interface where the error detection logic is located. These registers are duplicated for each installed processor. IP21 CPU Board Figure 2-4 is a functional block diagram of the IP21 board with the error detection points called out. Figure 2-5 shows the physical layout of the board and the locations of the error detection logic.
3 sdb 128+p D ASIC 5 DBO GCache FPU 128+p D ASIC 6 Cmd 7 Tbus Idb 128+p DB1 CC 256+p 7 2 1 A ASIC Channel 0 17+P wb Channel 17+P A_ADDR IU Bus Tag Processor Tags DBO GCache FPU 4 D ASIC Tbus ldb 128+p DB1 D ASIC CC IU Addr/Cmd (48 + 2 Parity) Bus Tag Data (256 + 8 Parity) Processor Tags sdb: stor data bus ldb: load data bus wb channel: write back channel Figure 2-4 2-12 IP21 Board Error Detection Logic Diagnostic Procedures
Cache Simms Bus Tag Ram CC ASIC IU FPU Procesor Tag Rams D ASIC DB ASIC D ASIC DB ASIC A ASIC Bus Tag Rams IU CC ASIC FPU DB ASIC DB ASIC Procesor Tag Rams Figure 2-5 D ASIC D ASIC IP21 Board Component Placement CHALLENGE/Onyx Diagnostic Road Map 2-13
IP21 Board Error Messages HARDWARE ERROR STATE: IP21 in slot 1 + A Chip Error Register: 0xffff + 0:CPU 0 CC->A Channel 0 parity error + 1 + 1:CPU 0 CC->A Wback Channel parity error + 1 + 2:CPU 1 CC->A Channel 0 parity error + 1 + 3:CPU 1 CC->A Wback Channel parity error + 1 Error in the path between the indicated CC Chip and the A chip.
D chips on the IP21 board. Look for error indicators on other boards to help isolate the source of the problem. If none of bits 7, 8 or 9 are set, look for error indicators on other boards for a possible source of the error. If none of bits 7, 8 or 9 are set and no errors are indicated on other boards, then the IP21 board introduced the error in the path between the EBUS and the DB chip.
was prevented from sending to the EBus by some other board. + + + + 5:Addr Error on MyReq on EBus Channel 0 6:Addr Error on MyReq on EBus Wback Channel 7 If bit 2 is also asserted (A Chip Addr Parity), then the error came in from the EBUS. Look for error indicators on other boards to find the source of the error. 7:Data Sent Error Channel 0 5,6 8:Data Sent Error Wback Channel 5,6 If this bit is asserted without DB error (bits 0 or 1)then the error is in the path between D chip and DB chip.
the bus, but never got a response. Note: It is possible for the requesting CC chip to be the cause of the error. It may have dirty sectors that could be the target of the request (for self-intervention). Possible problem with CC Chip, A Chip or Bus Tag. + 12:A Chip MyIntrvention Data Resources Time Out 2 Timed on the A-chip. Read response but no read resource allocated in the A-chip. Either an erroneous response generated by some other CC chip or a failure on the A chip.
2.4.1.3 IO4 Interface Board Figure 2-6 is a functional block diagram of the IO4 board and VCAM with the error detection points called out. Figure 2-7 shows the physical layout of the IO4 board and the locations of the error detection logic. The error messages are listed in the following section.
Map RAM S1 F Chip D ASIC D ASIC A ASIC EPC D ASIC F Chip Figure 2-7 D ASIC IO4/VCAM Component Locations CHALLENGE/Onyx Diagnostic Road Map 2-19
IO4/VCAM Error Messages + IO4 board in slot 5 + IA IBUS Error Register: 0x7ffff + 0: Sticky Error + + + + + + + + + + + + + + + + + + + + 2-20 More than one occurance of one or more of the following 1: First Level Map Error for 2-Level Mapping 12 MAPRAM data parity error detected by ID 2: 2-Level Address Map Response Command Error 4 F chip detected bad parity on IBus operation from IA 3: 1-Level Map Data Error 12 MAPRAM data parity error detected by ID 4: 1-Level Address MapResponse Command Erro
+ 3: Non Existent IOA 3 + 4: Illegal PIO 3 + a 5: My ADDR_ERROR Received 3 + 6: EBUS_TIMEOUT Received 3 + + + + + + + + + + + + + + + + + + + + + + + + No F/S1/EPC configured at specified address, probable software error CC Write Gatherer block write only allowed to F+FCG, probable software error one or more boards detected parity error in IA emitted address, or ADDR_HERE not asserted (no board decoded IA emitted address) IA was not able to get EBus access, to emit its request 7: Invalidate
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 2-22 7 Error from EPC to IA 15..12: EPC Sent- DMA (Enet) Write Request Command Error to IA 7 Error from EPC to IA 15..12: EPC Sent- DMA (Enet) Write Request Data Error to IA 7 Error from EPC to IA 15..12: EPC Sent- Interrupt Request Command Error to IA 7 Error from EPC to IA 15..12: EPC Sent- PIO (thru EPC) Read Response Command Error to IA 7 Error from EPC to IA 15..
+ on 13: Address Map Request IBus Command Error 8 IA on 14: Interrupt IBus Command Error 8 IA + 15: Load Address Write FCI Error 9 + 16: Unknown FCI Command Error 9 + 17: Address Map Request Timeout Error 8 + 18: Address Map Response Data Error 8 + + 19: PIO F Internal Write IBus Data Error 8 20: IBus Surprise 8 + 21: DMA Write IBus Data Error 8 + + + 22: System FCI Reset 23: Software FCI Reset 24: Master Reset 9 + + + + + + + 25: F Error FCI Reset 26: F Chip Reset in Progress 27: Dr
+ + + + + + + + + + + + + + + + + + + + + + + + 2.4.1.
Ebus Data Ebus Addr 1 2 MD ASIC Data (512 + 64 ECC) Leaf 0 DRAM Addr MD ASIC 3 4 MA ASIC MD ASIC Addr Leaf 1 DRAM Data (512 + 64 ECC) MD ASIC Addr/Cmd (48 + 2 Parity) Data (256 + 8 Parity) Figure 2-8 MC3 Memory Board Error Detection Logic CHALLENGE/Onyx Diagnostic Road Map 2-25
MD ASIC MD ASIC MA ASIC MD ASIC MD ASIC Bank A B2 D2 F2 H2 Leaf 1 B0 D0 F0 H0 A1 C1 E1 G1 A2 C2 E2 G2 Bank A B1 D1 F1 H1 A3 C3 E3 G3 B3 D3 F3 H3 Leaf 0 A0 C0 E0 G0 MC3 Board Component Locations Leaf 1 Leaf 0 Figure 2-9 Figure 2-10 MC3 SIMM Bank and Leaf Organization 2-26 Diagnostic Procedures
MC3 Memory Board Error Messages + + + MC3 in slot 3 MA Ebus Error register: 0xf My EBus Address Error + My EBus Data Error + EBus Address Error + EBus Data Error + + 2.
Note: If the 5.0V brick has failed, it will still show approximately 2.5V, owing to the 3.3V brick pulling up through the ASICs. 2.5.1.2 Power Levels OK but the Processor LEDs Are All Off 1. There is a power or ground short. 2. The LED controller PAL has failed. 3. One or more of the R4KIO PALs has failed. Check the PALs on the failing slices. The PALs are at locations I5H1, J1M0, I3J9, and F2E1 for slices 0 through 3, respectively. 2.5.1.
4. Check the reset line (pin 20) to the CC chip. 5. If the reset line does not pulse when the SCLR line is enabled, check pin 17 on the A chip. 6. If the A chip line doesn’t pulse either, check for a failure in the System Controller. 2.5.1.7 LEDs Show Failure Code After Starting Boot Pattern This procedure assumes that the processor is basically functional. It can fetch and execute EPROM instructions, but may be having trouble reaching the bus or ASICs beyond the CC chip. 1.
2.5.1.12 IP21 LEDs Show a Static 0x12 Pattern System is stuck in bootmaster arbitration. Diagnose per Section 2.5.1.11, “IP19 LEDs Show a Static 0xc Pattern.” 2.5.1.13 LEDs Boot to Master/Slave Patterns but UART Output is Garbled or Missing. 1. Check UART cable and connections. 2. Verify that the serial clock is matched to the UART speed. When connecting to the UART, match the speed with the System Controller speed. Note: 2.5.2 The standard System Controller produces a 9600 baud clock.
Select the Debug Settings menu and toggle bit 7 (the Manu-Mode bit) to select the System Controller UART. Refer to Section 4.5.2, “Key Switch in the Manager Position” for more information on the Debug Settings menu. 2.5.3.2 Communicating with the System over the System Controller Port TTT provides some commands you can use over the System Controller port.
Return to the default debug settings by simultaneously pressing the Menu and Scroll Down buttons while cycling the key switch. Cycle the key switch to return to normal controller operation. 2.5.3.4 Communicating With a Disabled Processor When a processor fails, it is disabled by the system. Because a disabled processor is unable to talk to the system bus, you cannot use IDE to diagnose the cause of the fault. Enable the processor by first turning the key switch to the Manager position.
2. Enter the following command at the DBG prompt: stop 3. Next, enter the cpu command to display a list of processors: cpu You should see a list of stopped processors. Normally, all of the CPUs in the system should be listed. A hardware hang problem often shows up as a processor missing from this list. If this is the case, suspect a problem with the CPU. IP19 only: If a CPU is missing from the hardware list and the IP19 board is not at revision -007 or later, it is especially likely a CPU problem. 4.
9. 2.5.6 When you are finished using POD mode, repeat steps 1 through 7, and reset bits 4 and 5 so that the system boots normally. Using POD to Diagnose MC3 Clock Jitter The following procedure, run from POD mode, can determine if an MC3 board has a clock jitter problem: 1. Enter POD mode. To enter POD mode, use the procedures outlined in Section 4.5.2, “Key Switch in the Manager Position” or Section 2.5.5, “Procedure to Cause a Hung System to Enter POD Mode.” 2.
2.6 Error Message Syntax Everest hardware errors are displayed following IRIX kernel panics, in the IDE stand-alone diagnostics, and in some of the PROM-based power-on tests. The display format of the error messages is referred to as the HARDWARE ERROR STATE, and is defined as follows: • The only bits displayed are those indicating an error has been detected. Normal bits are not displayed. • The display walks through all the boards in the system and through every ASIC on each board.
2.7 Known Problems This section lists some known hardware and software problems, as of this writing. 2.7.1 IRIX 5.0.1 Bugs in IRIX 5.0.1 can cause software hangs. Try to move the customer to the latest release of the IRIX operating system. 2.7.2 Paging and File Quotas If the customer is not using paging and file quotas, make sure the customer’s system is running IRIX 5.2 (plus patch 0 and patch 22) or later. 2.7.3 MC3 Clock Jitter A likely cause of hangs is the MC3 board clock jitter problem.
This problem also shows up as a system hang, where the system does not echo characters on the serial console, not respond to network pings, and there is no disk or power meter activity. The procedure in Section 2.5.5 can also help to pinpoint the bad MC3. To deal with the clock jitter problem, there is an MC3 board with a KL-1 daughter card. When this board is not available, there is a valuable workaround, which eliminates the problem in many MC3 boards, called the “MC3 voltage reduction workaround.
2.7.7 IP19 EAROM Corruption There is an EAROM for each CPU on an IP19 board. A board problem, in IP19 boards earlier than -008 (-008 is fixed), could occasionally alter the EAROM at power-on. The fixed board also contains an EAROM updated with a checksum. With IRIX 5.1 or later, the IO4 PROM Version 1.09 automatically corrects common cases of corruption. It also displays a checksum of all EAROMs, in this message: Starting processor #1 Starting processor #2 ... Comparing EAROM checksums...
If you suspect this, check the termination voltage supplied by the System Controller. Use the Voltage Status menu on the front panel. It should read 1.6 volts. If it does not, it is likely that backplane voltage droop is causing the errors. You can also use the sysctrlrd command to examine the voltages.
pb pb pb pb pb pb pb pb pb pb pb pb pb pb pb pb pb pb pb pb 2-40 0: + IP19 in slot 3 1: + CC in IP19 Slot 3, cpu 1 2: + CC ERTOIP Register: 0x10 3: + 4:Parity Error on Data from D-chip 4: + IO4 board in slot 11 5: + IA IBUS Error Register: 0x30800 6: + 11: PIO ReadResponse Data Error 7: + 18..
Chapter 3 3. 3.1 Power Subsystem Overview The power subsystem consists of the offline switchers (OLSs), the midplane and backplane power buses, the DC-to-DC converters (power bricks) on the Everest CPU, VCAM, and memory boards, and the various power boards. The following power boards can be installed: 505 (Cardcage 3 only), 512, dual 505 (also known as the 505x2), 512S (in the SCSIbox 2), and System Controller. See Figure 3-1 for a block diagram illustrating the power subsystem components.
PFWA OLS 1 PFW L & XL Chassis RI_L BLOA,B_TACH RIA BLOA,B_MTR OLS 2 PFW RI_L XL Chassis OLS 3 PFW RI_L XL w/CC3 PFWB RIB PFWC First SCSI Box RIC SERIAL_CLK, SERIAL_BRDIN, SERIAL_ADDR<5..0> SERIAL_CLK, SERIAL_BRDIN, SERIAL_ADDR<5..
When the system is turned on, the power subsystem goes through a series of voltage checks before the boot process is allowed to start. Power is applied to the various system components in the following order: +/-5 V and +/-12 V power bricks (power for the SCSI drives in the deskside systems), 1.5 V and 3.3 V power bricks, 5 V and 12 V power bricks (power for the first internal SCSIBox in the rackmount systems), and 5 V and 12 V for external SCSI.
Rackmount status panel Fault LED Display Power on LED Key Switch Deskside status panel Fault LED Off-line Switcher Fault LED Figure 3-2 3-4 Display Power Switch Off-line Switchers Power on LED Mgr Position On Position Off Position Power LED Rackmount and Deskside Status Panel and OLS Power and Fault Indicators Power Subsystem
3.2.2 System and Power Boards This section provides the locations of the power fault indicators on each of the system and power boards. 3.2.2.1 IP19 and IP21 CPU Board The IP19 and IP21 boards have two power bricks that step down the 48 volts from the midplane/backplane to 5.0 and 3.3 volts. Each brick has a corresponding power fault LED, as shown in Figure 3-3. The LEDs are red and indicate a fault when lit. Note: 3.3 V Fault LED (POKB_FAIL) Fault LEDs (one bank of four) 3.3V Power Brick MSB 5.
LED Reference Designation Color / Meaning When Lit Description N8P2 (POKB_FAIL) Red/Fault Bad 3.3 V power brick B4P2 (POKA_FAIL) Red/Fault Bad 5.0 V power brick Table 3-1 MC3 Board Fault LEDs 3.3V Fail LED (POKB_FAIL) (not present in this version) Provision for 3.3V Power Brick (not present in this version) 5.0V Power Brick 5.0V Fail LED (POKB_FAIL) Figure 3-4 3.2.2.
LED Reference Designation Color / Meaning When Lit Description M2P6 (POKB_FAIL) Red/Fault Bad 1.5 V regulator A (near top) on the IO4 board M1P6 (POKB_FAIL) Red/Fault Bad 1.5 V regulator B (near bottom) on the IO4 board M0P6 (POKB_FAIL) Red/Fault Bad 1.5 V regulator on the VCAM L9P6 (POKA_FAIL) Red/Fault Bad -12 V to -5.2 V regulator on the VCAM L8P6 (POKA_FAIL) Red/Fault Bad +12.0 V to -12.0 V regulator on the VCAM Table 3-2 IO4 Board Fault LEDs VCAM Regulator A (+5V to +1.5V) 1.
LED Reference Designation Color / Meaning When Lit Description M0P6 Red/Fault Bad +1.5 V regulator on the RMT_VCAM L9P6 Red/Fault Bad -5.2 V regulator on the RMT_VCAM L8P6 Red/Fault Bad -12 V regulator on the RMT_VCAM L7P6 Green/Good 5 V input (V5_AUX) from System Controller to RMT_VCAM (should always be on) L6P6 Amber/Fault Bad +12 V input from the backplane L5P6 Amber/Fault Bad +5 V input (VCC) from the backplane L4P6 Amber/Fault Bad +1.
Voltage Fault LED G6M1(for standard F mezzanine) A1K5 (for short F mezzanine) 1.5 V Regulator F Mezzanine Board Fault Indicator and Voltage Regulator Locations Figure 3-7 3.2.2.6 Power Boards The are five different power boards that can be installed in the Challenge/Onyx deskside and rackmount systems: the System Controller, 505, 505x2, 512, and 512S. Each board has one or more power fault LEDs. Figure 3-8 shows power-fault LEDs for the System Controller, 505, 505x2, and 512 boards.
Power fault LEDs Drive sled connector SCSIBox access door Figure 3-9 3.3 SCSIBox Fault Indicators Power-On Sequence Note: The power switch (main circuit breaker), located in the lower right corner of the front of the system cabinet, must be turned on. Turning the System Controller key switch to the On or Manager position enables the OLSs to output 48 volts to the midplane/backplane.
• PEND: controls 5 and 12 volts for the optional, second SCSIBox (rackmount systems only) • PENE: controls 5 and 12 volts for any external cabinets Each time a signal is asserted, a corresponding power-OK (POKx) signal is tested, indicating to the System Controller that the voltage levels are correct. If any voltage-enable line does not generate an OK signal, the System Controller will stop the power-on sequence at the point of the failure.
Main power switch on. Status panel key switch in ON position. 48 VDC applied to backplane. Green Status Panel LED lit No 48 V present at midplane/backplane. System Controller generates 5V_AUX line. Blower(s) Run System Controller comes up No 48 V from OLSs. Check key switch, status panel, cable to backplane, System Controller, cable between backplanes. Check incoming AC, backplane voltage, OLS power, LEDs, OLS cabling. Blower(s) Off. Check fuses/cabling. No Check V5_AUX, V12+, DIS_VEE.
A Looks for good system clock. Enables 5.0 V and 12V for 1st SCSIBox (housing system disk). (PENC) System shutdown initiated. System Controller displays "No System Clock" message No Clock POKC signal received. Clock OK Yes Power−on halted, failing voltage displayed by System Controller. Check voltages on 512S power board in 1st SCSIBox. No POKC Enables 5.0 V and 12 V for optional 2nd SCSIBox (PEND) POKD signal received Yes Power−on halted, failing voltage displayed by System Controller.
See Figure 3-13 for the power-enable/power-ok signal timing. Note that the enable signals are PENx (TTL_H), and that the fault signals are POKx (TTL_L). 48V V5_AUX PENA POKA_0D PENB POKB_0D PENC POKC_0D PEND POKD_0D PENE PENE_0D Figure 3-13 Power-On Signal Timing Note: 3.3.1 The POKx signals (with the exception of POKE) remain high unless there is a fault. The System Controller checks for a low TTL signal to indicate a failed POK problem.
all OR-tied together, so a failure sensed by any of the POK lines indicates the failing voltage but cannot isolate the specific FRU. Also, in systems with more than one of each type of power board, identical voltages are ganged together. Finally, there are secondary regulators; one on the backplane, two on the IO4 board, one each on the MC3 and VCAM boards, and one on each of the FMezz boards, whose output voltages are only POKed.
Monitors backplane voltages generated by VCAM, OLSs, 505 and 512 power boards. Also monitors internal 1.5V power brick. Does not monitor IP19, IP21, MC3, IO4 or SCSIBox. Shuts system down if voltages are out of range. Monitors the 48 V at output of OLSs. Shuts down if below 45 V or above 50 V and sends Power Fail Warning (PFW). Also shuts down for loss of AC.
Table 3-4 provides the voltage ranges monitored by the System Controller, and two sets of voltage thresholds: the upper and lower thresholds at which a voltage warning is issued, and the upper and lower thresholds at which the system is shut down. Maximum Undervoltagea Warningb Nominal Warningc Maximum Overvoltaged 45 V ----------- 47.49 V ----------- 54 V 10.2 V 10.97 V +12.2 V 13.02 V 14.3 V 4.35 V 4.59 V +5.15 V 5.46 V 5.85 V 1.05 V 1.23 V +1.5 V 1.77 V 1.99 V -3.63 V -4.
Chapter 4 4. 4.1 Using the System Controller Overview The Everest System Controller is a microprocessor with a battery-backed clock and RAM. The System Controller performs three basic functions: • The System Controller manages the system’s power-on, power-off, and bootmaster arbitration processes. It also displays a running account of the status of the boot procedure and notifies the bootmaster CPU when a system event, such as power off, is initiated.
Backplane Monitored Levels 48 V, 12 V, 5V , 1.5 V, −5.2 V, 12 V, BP Clock Power Control PENx, POKx, Blower(s) Status Panel Keypad/Keyswitch LCD Display System Controller Firmware−EPROM Real−time Clock NVRAM Battery Backup CPU Control PCLR_OD SCLR_OD BPNMI_L OLS Rlx_H 48 V PFWx_L Figure 4-1 4.2 Temperature Sensors Board/Chassis Sensors Inlet Temp Sensor CPU_COMMUN. SERIAL_ADDRx SERIAL_BRDOUT SERIAL_BRDIN SERIAL_CLK BLOx_MTR BLOx_TACH V5_AUX V12+ VEE_DIS Secondary Serial Ports.
stopped, a “Blower Failure” message is displayed. All three conditions result in a system shutdown. 4.2.3 Power-On, Boot, and Reset Sequences The System Controller plays an active role in the power-on, boot, and reset processes. The power-on process begins when the System Controller enables the OLS outputs, supplying 48 volts to the midplane. Next, the blower(s) are turned on and their speed monitored. Then the System Controller sequentially turns on a series of power-enable lines (PENA through PENE).
4.2.5 Initiating a System Power-Off If a condition is detected that calls for a system shutdown, the System Controller issues an alarm. If the situation is not immediately dangerous, the System Controller will wait until it receives a “Set System Off” message or until its internal timer counts down. This delay in the shutdown sequence is designed to give UNIX ample time to perform an orderly software shutdown and to sync the system disks before power is removed.
4.3 Error Messages There are six categories of error messages displayed by the System Controller: • bootmaster arbitration problems at power-on or reset • bootmaster CPU messages (described in Chapter 5, “PROM Monitor”) • system events – immediate power-off • system events – delayed power-off • system events – informative, System Controller internal problems Table 4-1 through Table 4-5 describe five of the six categories of error messages listed above.
Error Message Failure Area/Possible Solution POKD FAIL Same as above. POKE FAIL The System Controller detects a power supply fault. The condition is logged but no power-off sequence is initiated. BRD/CHASSIS OVR TEMP The System Controller detects an overtemperature condition and initiates a power-off sequence. POWER FAIL WARNING The System Controller detects an AC power failure.
Error Message Failure Area/Possible Solution -12 VDC LOW WARNING Same as above. 48 VDC HIGH WARNING Same as above. 48 VDC LOW WARNING Same as above. POWER CYCLE The System Controller receives a command to perform a power-off, followed by a power-on, from the System Controller serial port. Table 4-2 (continued) System Events – Immediate Power-Off Error Message Failure Area/Possible Solution AMBIENT OVER TEMP The System Controller detects an overtemperature condition.
• Plug your laptop into the System Controller port, labeled External Controller Serial, using the cable permanently attached to the port. On rack-mounted systems, this port is located in the lower left corner of the midplane (when facing the front of the chassis). Deskside systems have the port located in the lower right corner of the backplane (when facing the rear of the chassis). Error Message Error Meaning SYSTEM ON The System Controller reports the power-on sequence completed.
Error Message Error Meaning COP FAILURE The Computer Operating Properly (COP) timer has exceeded time limits. The System Controller firmware must write to a COP timer port before it times out. If the firmware exceeds the time allowed between writes to a COP port, an interrupt is generated. The System Controller firmware may have entered an endless loop. COP MONITOR FAILURE A Computer Operating Properly (COP) clock monitor failure was detected.
Error Message Error Meaning INTERRUPT REQUEST EXTEND INT REQUEST CPU NOT RESPONDING BAD WARNING/ALARM BAD ALARM TYPE BAD WARNING TYPE FP READ FAULT System Controller Internal Problems Table 4-5 Note: 4.4 Internal errors will cause an error message to be displayed, but will not shut down the system. Sensor Locations The locations of the System Controller sensors for both the deskside and rack-mounted systems are shown in Figure 4-3 and Figure 4-4, respectively.
Terminator Secondary Backplane Air Inlet Temperature Sensor External SCSI Rack System Controller Voltage Termination Terminator Midplane 1.5 Volt Regulation 4.
Rackmount Systems test Power On On Position LED Fault Off Position Mgr Position LED System Controller LCD Screen Menu Scroll Scroll Up Down Execute Key Switch Deskside Systems Fault LED Power On LED System Controller LCD Screen Mgr Position On Position Off Position Key Switch Figure 4-4 4-12 Menu Scroll Up Scroll Down Execute System Status Panel (Deskside and Rackmount Versions) Using the System Controller
4.5.1 Key Switch in the On Position There are four menus that are accessible when the key switch is in the On position. Figure 4-5 describes these menus. Checking Memory B+++ Scroll 1. Master CPU Selection Menu Scroll 2. Event History Log Scroll 3. CPU Activity This message is displayed following a successful boot. The second line indicates that the first of the four installed processors is the bootmaster (B) and that the remaining processors (+) are on-line.
4. Boot Status Scroll 5. Turn External Cabinet Off/On Pressing the execute button toggles between turning the external cabinet on and off. Each time the button is pressed, the menu prompt toggles between "Off" and "on." Scroll 6. Turn Internal SCSI B Off/On Pressing the Execute button toggles between turning the optional second SCSIbox on and off. Each time the button is pressed, the menu prompt toggles between "Off" and "On." Scroll 7.
• choosing whether or not to clear memory on system reset • resetting the non-volatile RAM (NVRAM) configuration • choosing whether or not to run system power-on diagnostics • entering power-on diagnostics (POD) mode • choosing whether or not the System Controller selects the bootmaster CPU • setting “manual mode,” where all CPU (IP19 and IP21) PROM console output is sent to the external UART (serial port) on the System Controller Figure 4-7 describes how to enter the Debug Menu and set the vari
To access the Debug Settings menu: First turn the key switch to the On position. Press the Scroll Up and Scroll Down buttons simultaneously. Turn the key switch to the Manager position. Press both scroll buttons simultaneously again. Scroll through the menus until the Debug Settings menu appears. If the menu is not selected within 30 seconds, it disappears.
Chapter 5 5. 5.1 PROM Monitor Overview This Chapter describes the power-on tests, describes how the Everest boards are configured, and explains the Monitor boot commands. 5.2 Power-On Tests The power-on tests are initiated when the System Controller sends the SCLR signal, resetting the processors. This series of tests begins with the CPU logic supporting each individual processor and expands to test and configure the entire system.
Set up R4400 registers Test A chip registers Fail Flash LEDs Fail CC chip local test Flash LEDs Pass Configure the local A chip Pass CC chip config Test Fail Flash LEDs E-bus test no.
1 Slave Code Configure the CC chipConfig Regs. Send "I'm Alive" Interrupt Begin Bootmaster Arbitration Wait for interrupt Wait for time determined by slot/CPU number Receive MPCONF interrupt Am I the bootmaster? Receive interrupt from another CPU Receive rearbitrate interrupt Send "Trying to talk to mem" interrupt My time expired Master Code Am I the bootmaster Fill in our slot in MPCONF Communicate with System Controller E-bus Test No.
2 Configure cache as stack - jump to C code Configure console port Check NVRAM for things to disable Test main IO4 Fail Display message on System Controller - stop Are we disabled? Pass Configure IO4s Yes Jump to bootmaster arbitration No Test raw memory and store results* Check main EPC Fail Display message on System Controller - stop Configure memory Pass Read NVRAM Test configured memory and store results Fail Check EPC UART Display message on System Controller - continue Pass Test PROM
3 Load IO4 PrOM Test IO4 PROM Fail Go into POD mode Pass Test caches and bus tags Make slaves test their caches and bus tags Test I/O devices Initialize I/O PROM and drivers Call PROM Command Monitor (or GUI) Figure 5-4 5.2.
. IP21 Power−On and Configuration Tests Fail Initialize LED values, SR, and trap registers Flash LEDs Test CC config registers Pass Fail Test Icache Check EAROM Flash LEDs Configure UART Pass Check Scache Size Print IP21 PROM Header Initialize CPU Test FPU/IU data path (read/write test) Test A chip Flash LEDs Fail Flash LEDs Pass Pass Fail Check Ebus Fail Test CC local registers Flash LEDs Flash LEDs Pass Pass A Initialize CC local reg.
Bootmaster Arbitration A Configure stack in dcache for C code Wait for time determined by slot number Allow 4−set associative and wirte−back in cache Am I the bootmaster? No B Slave Code Yes (time expired) Connect to System Controller Jump to C code C Invalidate dcache Invalidate gcache Figure 5-6 IP21 Power-On Test Sequence (2 of 4) CHALLENGE/Onyx Diagnostic Road Map 5-7
B Slave Code Invalidate I and D caches Send "I'm alive" interrupt Pass Wait for interrupt from master Test CC join registers Invalidate I & D cache Fail Report to Master Pass Print diagnostic Pass Test CC write gatherer register Invalidate gcache Fail Fail Print diagnostic Fail Report to Master Pass Pass Allow 4−set associative and write−back in cache Initialize process MPCONF Tell master we're OK Test slave gcache Fail Report to Master B Pass Fail Invalidate and test D cache Figure 5-
C Fail Enter POD mode Initialize IO4 Test CC write gatherer register Fail Print diagnostic Pass Init EPC UART Fail Print diagnostic Check EAROM Print IP21 PROM Vers.
5.3 Power-On Test Status Messages This section lists of all of the status messages that are displayed by the System Controller during the normal power-on process for IP19-based systems. IP21 systems vary slightly. The messages are listed in the order in which they appear. 1. Starting System... Displayed once bootmaster arbitration has completed. Indicates that the master processor has started up correctly and is capable of communicating with the system controller. 2. EBUS diags 2...
13. Reading inventory... Displayed before we attempt to read the system inventory out of the IO4 NVRAM. If the inventory is invalid or we cannot read it for some reason, we initialize the inventory fields with appropriate default values. 14. Running BIST... Displayed before we run the memory hardware’s built-in self test. 15. Configuring memory... Displayed before we actually configure the banks into a legitimate. 16. Testing memory... Printed before we start executing the memory post-configuration tests.
31. Initing graphics... Displayed when we initialize the graphics device (if any). 32. Starting slaves... Displayed when we kick the slave processors into the IO4 PROM slave loop. 33. Startup complete... Displayed when we’ve finished initializing everything and we’re ready to display the main menu. At this point, either the boot menu appears or the system autoboots. 5.
Turn off individual banks of memory: Type disable x y, where x is the slot number of the selected memory board and y is the bank number. Note: The system must be left with enough enabled memory to successfully boot. If you attempt to disable too much memory, the command will fail. If memory is disabled, use the reconf command to reset the interleaving. Reconfigure the enabled memory: Type reconf to reconfigure the memory using the currently enabled banks. The configuration will be displayed.
Test Description COUNTER Runs until a certain instruction count is reached and passed. The count is proportional to the Niblet process ID. MPMON Verifies that repetitive Everest reads and writes are identical. MPINTADD Two processors add values to a common variable, hit a barrier, and compare the final sum. MPINTADD_4 Four-processor version of MPINTADD. MPSLOCK A software locking protocol test.
Test Description niblet 6 Runs MPSLOCK, MPMON, INVALID, MPSLOCK, MPMON. Test takes disproportionately longer on single-processor compared to multi-processor machines. niblet 7 Runs MPROVE, MPROVE. niblet 8 Runs INVALID, MPMON, MPMON, MPROVE, MPROVE, MPROVE, MPINTADD, MPINTADD, MPHLOCK, MPHLOCK (total of 10 processes). niblet 9 Runs MPINTADD_4, MPINTADD_4, MPINTADD_4, MPINTADD_4, INVALID, MPROVE, MPROVE, MPROVE, MPHLOCK, MPHLOCK, MPSLOCK, MPSLOCK (total of 12 processes).
the IP19 PROM causes the appropriate error message to be displayed. If the system is a server, the error message is also displayed on the terminal. Both status and error messages are displayed in the same format: A short status or error message appears near the top of the display. Immediately below it, a longer more descriptive version of the message scrolls by. This longer message is followed by a three-digit diagnostic code that corresponds to the displayed message.
5.6.
5.6.2 IP19 PROM Messages (Long Form) 040 Memory board configuration has failed. Cannot load IO PROM. 041 All memory banks had to be disabled to test failures. 042 The address line self-test failed. Cannot continue. 043 Memory board configuration has failed. Cannot load IO PROM. 044 Memory board configuration has failed. Cannot load IO PROM. 047 Memory board configuration has failed. Cannot load IO PROM. 048 Memory board configuration has failed. Cannot load IO PROM.
5.6.3 Diagnostic Codes and Their Meanings The following diagnostic codes provide information on these areas of the system: • CPU cache • Memory • Ebus • IO4 • CPU • CC registers • FPU • Miscellaneous areas 5.6.3.1 000 001 002 003 004 005 006 007 008 009 010 011 012 013 Device passed diagnostics. Failed dcache1 data test. Failed dcache1 addr test. Failed scache1 data test. Failed scache1 addr test. Failed icache data test. Failed icache addr test. Dcache test hung. Scache/gcache test hung.
5.6.3.3 060 061 062 063 CPU doesn’t get interrupts from CC. Group interrupt test failed. Lost a loopback interrupt. Bit in HPIL register stuck. 5.6.3.4 070 071 072 073 074 075 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 IO4 Error Codes No working IO4 is present. Bad checksum on IO4 PROM. Bad entry point in IO4 PROM. IO4 PROM claims to be too long. Bad entry point in IO4 PROM. Bad magic number in IO4 PROM. Bus error while downloading IO4 PROM. No EPC chip found on master IO4.
240 242 243 244 245 246 247 248 249 250 251 252 253 253 254 255 5.6.4 CPU writing configuration info. Error in POD command Starting dcache test. Starting icache test. Starting scache (gcache) test. Invalidate I & D caches. Invalidate S (G) cache. Testing CC join counter register. Testing CC writer gatherer register. CPU returning from master’s code to POD mode. Unexpected exception; PROM panic. A nonmaskable interrupt (NMI) occurred. POD mode switch set or POD key pressed.
Once the system enters POD mode, you should display error registers. Enter the following command at the POD prompt (POD 03/00>): devc all The command devc all displays boards in each EBus slot. The output looks like this: Memory size: 64 M Bus clock frequency: 47 MHz Virtual dip switches: 0x0000a400 Slot 0x01: Type = 0x31, Name = MC3 Rev: 16 Inventory: 0x00000031 Diag Value: 0x00000000, Enabled Bank 0: IP 0, IF 0, SIMM type 1, Bloc 0x00000000 Inventory 0x01, DiagVal 0x00, Enabled Bank 1: Not populated.
Locate all the IP19 boards and their slots. Then enter a dc command for each IP19 slot, where the first argument is the hex slot number and the second argument is 6. This displays the A Chip Error Register message as described in Section 2.4, “ASIC Error Detection.
Locate all the MC3 boards and their slots. Then enter a dmc command for each MC3 slot, where the argument is the hex slot number.
Leaf Error Field Value Description Error = 0000 0008 Multiple occurrence of Read correctable (Single Bit) Error Table 5-4 5.7 Leaf Error Field Values and Descriptions CPU Board Fault/Status Indicators The IP19 and IP21 CPU boards have one bank of six LEDs for each processor on the board. Thus, a four processor CPU board has twenty-four banks of fault and status LEDs. Figure 5-9 shows the location and orientation of the LEDs on a four-processor IP19 board.
Error codes are displayed if a fatal error prevents the power-on tests from completing. The LEDs will flash the error value until the system is powered down or reset. Error codes have the prefix “FLED” (Flashing LED) attached to their descriptions. For IP19 systems, see Section 5.7.2, “IP19 LED Error Codes.” For IP21 systems, see Section 5.7.4, “IP21 LED Error Codes.” 5.7.1 IP19 LED Status Codes These binary error codes apply to all of the microprocessors resident on the board.
LED Pattern Displayed X=Lit O=Unlit Description (Constant Value Displayed) LSB O O O O X O MSB PLED_BMASTER (16) - This processor is the bootmaster. LSB X O O O X O MSB PLED_CKEBUS2 (17) - Running second Ebus test. Run only by the bootmaster. LSB O X O O X O MSB PLED_POD (18) - Setting up this CPU slice for POD mode. LSB X X O O X O MSB PLED_PODLOOP (19) - Entering POD loop. LSB O O X O X O MSB PLED_CKPDCACHE1 (20) - Checking the primary data cache.
LED Pattern Displayed X=Lit O=Unlit Description (Constant Value Displayed) LSB X X X O O X MSB PLED_CKSCACHE2 (39) - Checking secondary data cache writeback mechanism. LSB O O O X O X MSB PLED_CKBT (40) - Check the bus tags. LSB X O O X O X MSB PLED_BTINIT (41) - Clearing the bus tags. LSB O X O X O X MSB PLED_CKPROM (42) - Checksumming the I/O PROM. LSB X X O X O X MSB PLED_INSLAVE (43) - This CPU is entering slave mode. LSB O O X X O X MSB PLED_PROMJUMP (44) - Jumpering to the I/O PROM.
LED Pattern Displayed X=Lit O=Unlit Description (Flashing Value Displayed) LSB X X X O X X MSB FLED_BADCACHE (55) - CPU’s primary data cache test failed. LSB O O O X X X MSB FLED_BADIO4 (56) - IO4 board is bad (can’t get to console). LSB X O O X X X MSB FLED_UTLBMISS (57) - Took a TLB refill exception. LSB O X O X X X MSB FLED_XTLBMISS (58) - Took an extended TLB refill exception. LSB X X O X X X MSB FLED_CACHE (59) - Unused. LSB O O X X X X MSB FLED_GENERAL (60) - Took a general exception.
LED Pattern Displayed X=Lit O=Unlit Description (Constant Value Displayed) LSB X O X X O O MSB PLED_NOCLOCK_INITUART (13) - CC clock isn’t running init uart anyway LSB O X X X O O MSB PLED_CCINIT2 (14) - Init CC chip config registers LSB X X X X O O MSB PLED_UARTINIT (15) - Init CC chip UART. Hanging in this test usually means that the UART clock is bad. Check the connection to the system controller.
LED Pattern Displayed X=Lit O=Unlit Description (Constant Value Displayed) LSB X O O X X X MSB PLED_SCACHE_TAG_DATA (57) LSB O X O X X X MSB PLED_SCACHE_ADDR (58) LSB X X O X X X MSB PLED_SCACHE_DATA (59) LSB O O X X X X MSB PLED_SCACHE_INIT (60) LSB X O X X X X MSB PLED_SCACHE_INIT (61) Table 5-7 IP21 Board Test Status LED Codes CHALLENGE/Onyx Diagnostic Road Map 5-31
5.7.4 IP21 LED Error Codes Table 5-8 lists the IP21 board power-on test failure LED codes: LED Pattern Displayed X=Lit O=Unlit Description (Flashing Value Displayed) LSB X O X O O X MSB FLED_ICACHE_FAIL (37) LSB O X X O O X MSB FLED_CANTSEEMEM (38) - Flashed by slave processors if they take an exception while trying to write their evconfig entries. Often means the processor’s getting D-chip parity errors.
LED Pattern Displayed X=Lit O=Unlit Description (Flashing Value Displayed) LSB O O O X X X MSB PLED_SCACHE_TAG_ADDR (56) LSB X O O X X X MSB PLED_SCACHE_TAG_DATA (57) LSB O X O X X X MSB PLED_SCACHE_ADDR (58) LSB X X O X X X MSB PLED_SCACHE_DATA (59) LSB O O X X X X MSB PLED_SCACHE_INIT (60) LSB X O X X X X MSB PLED_SCACHE_INIT (61) Table 5-8 5.7.
Check the current system configuration using either the hinv command, or two variations of it: hinv -b and hinv -b -v: • hinv performs exactly as it did in previous releases. • hinv -b is similar to the info command in POD and provides additional information, such as: the number of processors present, the amount of memory installed, and whether or not an IO4 board is present. • hinv -b -v supplies additional information about each processor, memory bank, and I/O adapter.
diskless=0 dbaud=9600 sgilogo=y netaddr=192.48.150.68 ConsoleOut=multi(0)serial(0) ConsoleIn=multi(0)serial(0) cpufreq=50 Note: The lines in bold contain the values that must be changed before the system will boot from the new disk. In this example, the address of the new system disk is “4.” 4.
Chapter 6 6. 6.1 Interactive Diagnostics Environment (IDE) Overview This chapter describes the Everest board tests that are presently supported by the interactive diagnostics environment (IDE) and explains the various types of error messages. 6.1.1 Available IDE Tests The general sets of IDE tests are: • IO4 tests, described in Section 6.3, “IO4 IDE Tests” • IP19 tests, described in Section 6.4, “IP19 IDE Tests” • MC3 tests, described in Section 6.
4. Select the appropriate testing modes. The different testing modes change the way that the tests run. Not all modes are available for all tests. For example, to enable quickmode, enter setenv quickmode 1 See the specific board-test section for available modes. 5. Run the specific test. For example, to check all memory addresses to see if they are writable, type the following: mem4 6. 6.3 Interpret the results and either take action to correct the problem or run more tests to obtain more information.
After setting the report level, choose a test mode (if desired). The following modes are available: quickmode Runs the tests slightly faster than usual. Note that currently there is little difference in testing time with quickmode enabled. To enable quickmode, enter setenv quickmode 1 To disable quickmode, enter unsetenv quickmode continue-on-error Normally, IO4 tests stop after the first error. Enabling continue-on-error mode causes the tests to continue even after an error is encountered.
6.3.1 IO4 Interface Table 6-2 shows the tests available for the IO4 interface. Test Function Description check_iocfg Checks the IO4 configuration against the nonvolatile RAM (NVRAM) Compares the actual setup of the IO4 board to the values specified in the NVRAM. Each IO4 board in the system is checked to see that it has all the adapters specified in the NVRAM and that they are of the specified types.
6.3.2 VME Adapter Table 6-3 lists the VME adapter tests Test Function Description fregs Test the VMECC F chip Checks version number for correctness.
Test Function Description vmeregs Test the VMECC registers Performs a register test on the following vmecc registers: VMECC_RMWMASK VMECC_RMWSET VMECC_RMWADDR VMECC_RMWAM VMECC_RMWTRIG VMECC_ERRADDRVME VMECC_ERRXTRAVME VMECC_ERRORCAUSES VMECC_ERRCAUSECLR VMECC_DMAVADDR VMECC_DMAEADDR VMECC_DMABCNT VMECC_DMAPARMS VMECC_CONFIG VMECC_A64SLVMATCH VMECC_A64MASTER VMECC_VECTORERROR VMECC_VECTORIRQ1 VMECC_VECTORIRQ2 VMECC_VECTORIRQ3 VMECC_VECTORIRQ4 VMECC_VECTORIRQ5 VMECC_VECTORIRQ6 VMECC_VECTORIRQ7 VMECC_VE
Test Function Description vmelpbk Test the VMECC loopback capability This test performs VME accesses in A24 PIO Loopback mode and A32 PIO Loopback Mode. Halfword accesses and word accesses are tested for each of these cases. cddata Test the cdsio interrupts The cdsio loopbacks are performed at several different baud rates. The received data from the loopback is tested for accuracy. The status bits from the serial port are tested for framing, overrun and parity errors.
6.3.3 SCSI Adapter Table 6-4 lists the IO4 SCSI adapter tests Test Function Description S1_regtest Read/write test for the S1 chip registers Tests and performs address-in-address testing for the following S1 chip registers: S1_INTF_R_SEQ_REGS 0 - 0xF S1_INTF_R_OP_BR_0 S1_INTF_R_OP_BR_1 S1_INTF_W_SEQ_REGS 0 - 0xF S1_INTF_W_OP_BR_0 S1_INTF_W_OP_BR_1 A total of 36 registers are tested.
6.3.4 Everest Peripheral Controller (EPC) Table 6-5 lists tests for the Everest peripheral controller (EPC) on the IO 4 board Test Function Description epc_regtest Read/write test for EPC chip registers. Performs basic read/write tests on EPC chip registers, including the parallel port registers.
Test Function Description epc_rtcreg Read/write test for the real-time Tests the RTC registers and a clock (RTC) chip and NVRAM small amount of NVRAM in the RTC address-space portion of the RTC chip. Tests the following registers: NVR_SEC NVR_SECALRM NVR_MI NVR_MINALRM NVR_HOUR NVR_HOURALRM NVR_WEEKDAY NVR_DAY NVR_MONTH NVR_YEAR NVRAM tested is in the range 0xE – 0x3F. epc_rtcinc RTC increment test Tests the ability of the RTC chip to handle time-of-day transitions.
6.4 IP19 IDE Tests The IP19 IDE tests are divided into four categories: • IP tests, described in Section 6.4.1, “IP Tests” • Translation lookaside buffer (TLB) tests, described in Section 6.4.2, “Translation Lookaside Buffer (TLB) Tests” • Floating-point unit (FPU) tests, described in Section 6.4.3, “Floating-Point Unit (FPU) Tests” • Cache tests, described in Section 6.4.4, “Cache Tests” To start an IP19 IDE test, boot IDE from the Command Monitor. See Section 6.2, “Running an IDE Test.
There are several commands that run a battery of tests. These commands are: ipall Invokes tests ip1 through ip8. tlball Invokes tests tlb1 through tlb9 fpuall Invokes tests fpu1 through fpu14. cacheall Invokes tests cache1 through cache48. ip19 Invokes all IP, TLB, FPU and CACHE tests. cache49 Invokes a short version of cache48. cstate[0 - 21] Invokes individual cache state tests in cache48. quickfpu Invokes tests fpu1 through fpu13, skipping fpu14.
6.4.1 IP Tests There are eight IP tests. These test components that are not covered by the TLB, FPU, and CACHE tests. Table 6-7 summarizes the IP test commands: Test Function Description ip1 (local_regtest) Checks cache-coherency (CC) local registers Performs read/write tests on some CC registers and read tests on some read-only registers. ip2 (cfig_regtest) Checks configuration registers Performs read/write tests of the configuration registers.
The following sections provide details about each test. ip1 (local_regtest) - Check CC Local Registers Basic write/read test for the local registers.
ip2 (cfig_regtest) - Check Configuration Registers Basic write/read test for the configuration registers. The registers tested are limited to the following: EV_PGBRDEN Write gatherer destination EV_PROC_DATARATE Write gatherer control EV_WGRETRY_TOUT Interrupts 0 - 63 EV_CACHE_SZ Interrupts 64 - 127 EV_CMPREGO - 3 Timer comparator registers Note: The timer comparator registers are checked via the read-only RTC compare register.
ip5 (intr_level0) - Check IP19 Level 0 Interrupt This test generates level-0 interrupts at different priority values and execution levels. It also checks multiple level-0 interrupts occurring at the same time.
ip8 (intr_group) - Check IP19 Processor Group Interrupt This test generated level 0 interrupts using different processor groups at different priority levels including broadcast interrupts. Possible errors: 010301e: Group interrupt pending not set correctly in EV_IP0 : Expected 0x%llx Got 0x%llx 010301f: 0103020: 0103021: 0103022: 0103023: 0103024: 0103025: 6.4.
tlb4 (tlb_valid) - Check TLB Valid Exception Tests to see if TLB invalid accesses generate exceptions. Maps the TLB entries to invalid addresses in k2seg and attempts to access them. Possible errors: 0108016: TLB entry %d invalid exception VADDR error : Expected 0x%x Got 0x%x 0108017: TLB entry %d invalid exception didn’t occur tlb5 (tlb_mod) - Check TLB Modification Exception This test sets up the TLB to map each page as nonwritable, then attempts to write to each of the mapped pages.
tlb8 (tlb_c) - Check C Bits In TLB Entry Attempts to access TLB-mapped memory in both cached and uncached modes. Tests all slots by writing and reading back a pattern, first in cached mode, then in uncached mode. This test checks basic functionality, and does not attempt to detect cached/uncached interactions.
fpu2 (fpmem) - FPU Load/Store Memory Test Loads FPU from memory and stores memory from FPU. Possible errors: 010901c: Load/store FP reg %d data error : Expected 0x%x Got 0x%x 010901d: Load/store FP reg %d inverted data error : Expected 0x%x, Got 0x%x fpu3 (faddsubs) - FPU Add/Subtract (Single Precision) Tests addition and subtraction using simple single-precision arithmetic.
fpu7 (fmulsubs) - FPU Multiply/Subtract (Single Precision) Tests multiplication and subtraction using simple single-precision arithmetic. Possible errors: 0109016: FP single mul/div result error : Expected 0x%x Got 0x%x 0109017: Fixed to single conversion failed : Before 0x%x After 0x%x 0109018: FP single mul/div status error : 0x%x fpu8 (fmulsubd) - FPU Multiply/Subtract (Double Precision) Tests multiplication and subtraction using simple double-precision arithmetic.
fpu12 (funderflow) - FPU Underflow Test Generates a single-precision overflow by dividing an at-the-limit small value by 2. After the exception, the floating-point status register is checked to make sure the underflow flag was set.
6.4.4 Cache Tests There are forty-eight tests to check the primary and secondary cache of the MIPS R4000/R4400. They are described in the following sections. cache1 (Taghitst) - TagHi Register Test This tests the data integrity of the TagHi register. A sliding-one and a sliding-zero pattern are used.
cache5 (PdTagKh) - Primary Data TAG Knaizuk Hartmann Test This tests the data integrity of the primary data cache TAG RAM with the Knaizuk Hartmann algorithm. It treats the TAG RAM array as a ordinary memory array. The parity bit is not checked in this test. Note: This algorithm is used to perform a fast but nonexhaustive memory test. It will test a memory subsystem for stuck-at faults in both the address lines as well as the data locations.
cache8 (PiTagKh) - Primary Instruction TAG RAM Knaizuk Hartmann Test This tests the data integrity of the primary instruction cache TAG RAM with the Knaizuk Hartmann algorithm. It treats the TAG RAM array as a ordinary memory array. The parity bit is not checked in this test.
cache12 (d_tagparity) - Primary Data TAG RAM Parity Test This tests the functionality of the parity bit in the primary data cache tag. For each tag, a stream of ones and zeros are shifted into the tag to check if the parity bit change state accordingly.
cache15 (d_slide_data) - Primary Data RAM Data Line Test Possible errors: 0104021: D-cache tag functional error in PTAG field PTag field does not contain correct tag bits Cache line address: 0x%08x Expected PTag: 0x%06x Actual PTag: 0x%06x TAGLO Register %x Re-read DTAG %x 0104022: D-cache tag functional cache state error Cache line address: 0x%08x Expected cache state: 0x%08x Actual cache state: 0x%08x TAGLO Register %x Re-read DTAG %x cache15 (d_slide_data) - Primary Data RAM Data Line Test This tests th
cache17 (d_kh) - Primary Data RAM Knaizuk Hartmann Test This tests the data integrity of the D-cache with the Knaizuk Hartmann algorithm. Data pattern 0x55555555 and 0xaaaaaaaa are used.
cache20 (d_function) - Primary Data Functionality Test This tests the functionality of the entire data cache. It checks the block fill, write back on a dirty line replacement, and no write back on a clean line replacement function of the data cache lines.
cache23 (i_tagcmp) - Primary Instruction TAG RAM Comparitor Test This tests the comparator at the I-cache tag for hit and miss detection.
cache26 (i_aina) - Primary Instruction Data RAM Address In Address Test Performs an address in address test on the primary instruction cache. Possible error: 0107041: I-cache address in address error addr %x, exp %x, act %x, XOR %x cache27 (i_function) - Primary Instruction Functionality Test This tests the functionality of the entire instruction cache. It checks the block fill and hit write back of the instruction cache lines.
cache30 (i_hitwb) - Primary Instruction Hit Writeback Test This tests the Hit Writeback cache operation on the instruction cache.
cache33 (d_hitwb) - Primary Data Hit Writeback Test This is hit writeback cache operation on the data cache.
cache35 (d_refill) - Primary Data Refill from Secondary Cache Test This verifies the block write/read mode in data cache. It writes to K0 (0x80020000) cached space, causing the cache to become dirty. Then it replaces the cache line by reading 0x80022000, which is a different cache line with same offset. This causes the data in primary data cache to be written back to the secondary. The address 0x80020000 is reread and compared. There should be a cache hit in the secondary cache.
cache37 (sd_dirtywbh) - Secondary Dirty Writeback (Half-word) Test This verifies the block (four words) write mode in data cache. It writes to K0 (0x80020000) cached space, causing the cache to become dirty. Then it replaces the cache line by reading 0x80022000, which is a different cache line with same offset. This causes the data in 0x80020000 to be written back to memory which now has the same data as in 0x80020000. Multiple cache lines are tested back to back. Half-word transactions are tested.
cache40 (sdd_hitinv) - Secondary Hit Invalidate Test This verifies the Hit Invalidate cache operation.
cache41 (sd_hitwb) - Secondary Hit Writeback Test This verifies the hit writeback cache operation. It verifies that the data can be written back from the secondary, or in the case where the primary data is more current, that the data is written from the primary to memory. Also checked is the fact that the cache lines are not invalidated as with the hit writeback invalidate cache operation. Instead, it checks that the lines are set to the clean exclusive state.
cache42 (sd_hitwbinv) - Secondary Hit Writeback Invalidate Test This verifies the hit writeback invalidate cache operation. It verifies that the data can be written back from the secondary or in the case where the primary data is more current, that the data is written from the primary to memory. Also checked is that the cache lines are invalidated.
cache43 (cluster) - Secondary Cluster Test Possible errors: 0105075: SCache data incorrectly written to memory during a dirty writeback operation 1st mem block Mem Address 0x%08x Expected 0x%08x, Actual 0x%08x, XOR 0x%08x 0105076: SCache data incorrectly written to memory during a dirty writeback operation 2nd mem block Mem Address 0x%08x Expected 0x%08x, Actual 0x%08x, XOR 0x%08x cache44 (clusterwb) - Secondary Cluster Writeback Test Possible errors: 0105077: SCache data incorrectly written to memory duri
cache48 (cache_states) - Complete Cache-State Transitions Test There are twenty-two individual cache tests. Table 6-8 lists the tests and describes them.
Cache State Transition Test Description cstate0 (RHH_CE_CE) Read hit primary (CE) and secondary (CE). Check that the value is correct (the physmem addr) and that both tags are still CE. cstate1 (RHH_DE_DE) Read hit primary (DE) and secondary (DE). Check value and that both are still DE. cstate2 (WHH_CE_CE) Write hit primary (CE) and secondary (CE). Check that secondary and memory still have old value and that both cache lines are now DE.
Cache State Transition Test Description cstate11 (WMH_DE_DE) Write miss primary (DE) and hit secondary (DE). cstate12 (RMM_I_I) Read miss primary (I) and secondary (I). Check that value is correct, that secondary and memory still have old value and that both lines are CE. cstate13 (RMM_I_CE) Read miss primary (I) and miss secondary (CE). Check that value is correct, that secondary and memory still have old value and that both lines are CE.
Cache State Transition Test Description cstate19 (WMM_I_DE) Write miss primary (I) and miss secondary (DE). Check that secondary line matches memory, that both tags are DE, that the addr tags on both lines are correct, and that the dirty altaddr secondary line was flushed to memory. cstate20 (WMM_CE_CE) Write miss primary (CE) and miss secondary (CE).
6.5 MC3 IDE Tests To start an MC3 IDE test, boot IDE from the Command Monitor. See Section 6.2, “Running an IDE Test.” Set the desired report level. The default report level is 2. Available report levels are shown in Table 6-9. Report Level Function Comments Level 5 Displays debugging messages. Too much detail for most testing scenarios. Level 4 Prints out memory locations as they are Increases testing time. written. Level 3 Prints out one-line functional descriptions within tests.
After setting the report level, choose a test mode (if desired). The following modes are available: quickmode For the memory tests, quick mode tests every nth byte instead of every byte, where n varies from 96 to 7680 depending upon the test. The goal in quickmode is to test 16 GB in about 10 minutes, which is accomplished by testing every nth byte. n varies depending upon how fast or slow a test was timed to run.
Table 6-10 lists and describes the available MC3 diagnostic commands. Test Function Description mem1 Read the MC3 configuration registers (very fast test) This tests reads (probes) the following MC3 configuration The mem1 test is very similar to registers: 00 - Bank enable the mem14 test, which is the POD DMC command.
Test Function Description mem4 Write/Read data patterns This test does word (ported from the IP17 mem3 test) read/writes of all-1’s and (4 minutes/128 MB) all-0’s patterns. It shows if all addresses appear to be writable, and that all bits may be set to both 1 and 0. However, it provides no address error or adjacent-bits-shorted detection.
Test Function Description mem6 Walking ones and zeros memory Another traditional test – test (slow; 40 minutes/32 MB) walking ones and walking zeros through memory. This is a whole-memory test that is very good at shaking out shorted data bits, but provides little protection for addressing errors.
Test Function Description mem9 Memory with ECC test (ported from the IP17 mem6 test) This test writes to memory via uncached space and reads back through cached space (ECC exceptions enabled). Although it provides a simple level of ECC checking, its main function is to verify that cached and uncached memory addresses are accessing the same area of physical memory.
Test Function Description mem11 User-specified pattern/location Typing mem11 with no write/read test (ported from the arguments displays a use IP17 mem7 test) message: Usage: mem11 [-b|h|w] [-r] [-l] [-v 0xpattern] RANGE This test is allows the technician to fill a range of memory with a specified test value and read it back, done as a series of byte (–b), half-word (–h), or word (–w) writes and reads.
Chapter 7 7. 7.1 IRIX Error Reporting Overview This section describes the various types of UNIX kernel messages displayed by the console. These messages may also appear in /var/adm/SYSLOG, where they are prefixed by “ unix:.” Not all kernel messages appear in the SYSLOG file because a daemon must be running to transfer the error message from the kernel to the file. If the system panics, the kernel messages appear only on the console and in a system core dump.
7.2.1 Interpreting Panic Messages The following message usually indicates a hardware problem: WARNING: Kernel Bus Error Exception ... HARDWARE ERROR STATE: ... PANIC: CPU n: Kernel Bus Error Exception ... This kind of message also indicates a hardware problem: WARNING: Bus Error Exception in User mode ... HARDWARE ERROR STATE: ... PANIC: CPU n: Bus Error Exception in User mode ... There are some cases in which this message displays because of software bugs. This is discussed in further detail below.
7.2.1.1 IP19-Specific Messages The following message means that the R4400 detected a problem in its interface to the CC chip, or in the secondary cache SIMMs: CPU 26: ECC PANIC: Uncorrectable HARDWARE ECC error... PANIC MSG: ... XXX: ...
HARDWARE ERROR STATE Caused by Software The following message can be caused by software that mistakenly generates a non-existent address: A Chip ADDR_HERE not asserted A non-existent address on the EBus results in a display with the A Chip Error Register bit ADDR_HERE not asserted message. For example: pb pb pb pb pb pb pb pb pb pb pb pb 8: <4>WARNING: CPU 3 Bus Error Exception in User mode...
7.4 Driver Messages The driver message syntax is: dddn: xxxx, where “ddd” is a two- or three-character string indicating the driver name, “n” is a number indicating the controller, and “xxxx” is the string indicating the general area of the fault. These messages are sometimes embedded inside a warning message. Driver messages are generally hardware specific and will not directly cause a kernel panic. An example of a message from the SCSI driver is: dks0d1s6: invalid partition.
SYSLOG Message Type Table 7-1 (continued) Message Meaning Firmware compensating for blower RPM problem. The system controller has set the blower speed to a higher speed than should be required to maintain adequate cooling. This usually indicates a blower problem. System Controller Alarms and Warnings from SYSLOG a. All of these cause a controlled shutdown if the cleanpower flag is set to “on” using the chkconfig command. b. These are logged, but no action is taken.