Spontaneous domain reboot on SunFire 6800
Because the respective case at Sun was closed, I want to add this note for the future reference, just in case. So… One day I came to my desk and found that one the domains on SF6800 had been reboot for no reason, at least the very first impression was exactly like that. Superficially and quickly looking at /var/adm/messsage, prtdiag output revealed no hardware or software issues. The next step was to login into SC to go a bit deeper into analyzing the problem. Thus showboards, showfru, showchs, showplatform – everything was fine, but the showlogs, and especially showlogs -d C, output put me on my guard:
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 757768 local6.crit] ErrorMonitor: Domain C has a SYSTEM ERROR May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 346505 local6.error] RP2 encountered the first error May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 628870 local6.error] ArAsic reported first error on /N0/IB8 May 15 07:38:51 SF6900-1-sc0 Domain-C.SC: [ID 894554 local6.error] /partition1/domain0/IB8/ar0: >>> L2CheckError[0x6150] : 0x06068606 CMDVSyncErr [12:09] : 0x3 Ports [9:6] command valid mismatched against internal expected command valid PreqSyncErr [04:01] : 0x3 Ports [9:6] prereq mismatched against internal expected prereq AccCMDVSyncErr [28:25] : 0x3 accumulated valid command mismatch FE [15:15] : 0x1 AccPreqSyncErr [20:17] : 0x3 accumulated prerequisite mismatch May 15 07:38:51 SF6900-1-sc0 Domain-C.SC: [ID 612655 local6.error] /partition1/RP2/sdc0: >>> SafariPortError8[0x280] : 0x00088008 FE [15:15] : 0x1 AccParL2ErrDT [19:19] : 0x1 ParL2ErrDT [03:03] : 0x1 L2 parity error for DTransID May 15 07:38:52 SF6900-1-sc0 Domain-C.SC: [ID 286372 local6.error] [AD] Event: SF6800.ASIC.SDC.PAR_L2_ERR_DT.60143038 CSN: 0344MM204E DomainID: C ADInfo: 1.SCAPP.20.3 Time: Fri May 15 07:38:52 MSD 2009 FRU-List-Count: 2; FRU-PN: 5014404; FRU-SN: 046286; FRU-LOC: /N0/IB8 FRU-PN: 5016418; FRU-SN: 004613; FRU-LOC: RP2 Recommended-Action: Service action required
Does it look like a bunch of some cryptic messages which only initiated into Sun’s engineering secretes could decipher? Well, as always the truth is somewhere in between, because in our case we could only make an assumption about which part of our big system is faulty or just went off the beam for a jiffy. So, lets go forward…
First, we see two errors that took place simultaneously:
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 346505 local6.error] RP2 encountered the first error May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 628870 local6.error] ArAsic reported first error on /N0/IB8
Since we have (First Error) FE [15:15]: 0x1 in both errors that indeed means that these two alerts happened at the same time. But keep in mind, they’re unrelated to each other since FE bit is only valid for a single ASIC and has no relation to errors reported by other ASICs in the system. Next:
/partition1/domain0/IB8/ar0: >>> L2CheckError[0x6150] : 0x06068606 CMDVSyncErr [12:09] : 0x3 Ports [9:6] command valid mismatched against internal expected command valid PreqSyncErr [04:01] : 0x3 Ports [9:6] prereq mismatched against internal expected prereq AccCMDVSyncErr [28:25] : 0x3 accumulated valid command mismatch FE [15:15] : 0x1 AccPreqSyncErr [20:17] : 0x3 accumulated prerequisite mismatch
It just tells us that ports 6 through 9 of the AR (Address Repeater), on IO board 8, received CMDVSyncErr and PreqSyncErr. More details could be found here.
0x3 is a hint that tells us that RP2/RP3 were involved. Acc stand for “accumulated” and hence Acc[CMDVSyncErr|PreqSyncErr] lines just inform us that these errors occurred more than once.
Continue with the second error.
/partition1/RP2/sdc0: >>> SafariPortError8[0x280] : 0x00088008 FE [15:15] : 0x1 AccParL2ErrDT [19:19] : 0x1 ParL2ErrDT [03:03] : 0x1 L2 parity error for DTransID
This is a clear indication of the parity error on port 8 of SDC (Serengeti Data Controller), on RP2. Consulting “Sun Fire™ 6800/4800/4810/3800 Systems Troubleshooting Manual” revealed that port 8 connects to IB8.
In the end we have a list of suspected FRU:
- RP2
- IB8
What’s next? With probability of 99%, you will be given a recommendation to monitor you box for a couple of weeks and only if the same error knocks your server down again one of those parts will be replaced and the investigation spins up at the deeper level.