Identifying a broken disk in HP DL360
Since I don’t have great level of experience with HP DL series I was puzzled for a bit when I found out that one of the disk was broken. How I did that? Easy, by gazing at the “Faulty Led” that went steadily on. Not bad but I wanted to be able to grab more detailed information from the console. Since DL360 has built in “HP Smart Array” you won’t be able to squeeze much of the information from the system with ordinary tools i.e. fdisk because the system could see only a logical drive presented by the array controller.
The solution was on the surface – I forwarded my path to www.hp.com downloaded and installed hpacucli RPM and that was it. So now I could do everything I wanted:
# hpacucli ctrl all show Smart Array 6i in Slot 0 (Embedded) hpacucli ctrl all show detail Smart Array 6i in Slot 0 (Embedded) Bus Interface: PCI Slot: 0 RAID 6 (ADG) Status: Disabled Controller Status: OK Chassis Slot: Hardware Revision: Rev B Firmware Version: 2.36 Rebuild Priority: Low Expand Priority: Low Surface Scan Delay: 15 secs Post Prompt Timeout: 0 secs Cache Board Present: True Cache Status: OK Accelerator Ratio: 100% Read / 0% Write Total Cache Size: 64 MB No-Battery Write Cache: Disabled Battery/Capacitor Count: 0 SATA NCQ Supported: False # hpacucli ctrl slot=0 logicaldrive all show Smart Array 6i in Slot 0 (Embedded) array A (Failed) logicaldrive 1 (136.7 GB, RAID 1, Interim Recovery Mode) # hpacucli ctrl slot=0 physicaldrive all show Smart Array 6i in Slot 0 (Embedded) array A (Failed) physicaldrive 1:0 (port 1:id 0 , Parallel SCSI, ??? GB, Failed) physicaldrive 1:1 (port 1:id 1 , Parallel SCSI, 146.8 GB, Predictive Failure)
Sorted.
Missing voolboot file
If executing ” vxdctl enable” you receive the following error:
VxVM vxdctl ERROR V-5-1-1589 enable failed: Volboot file not loaded
then this sequence could help you to resolve the problem:
vxio set 10 vxconfigd -d vxdctl init vxdctl enable
Good luck.
In: Veritas · Tagged with: VxVM
Steadfast tin soldier
Lat weekend was very saturated with events of every sort and kind. First of all, since we’ve finally entered into summer time, I moved away from stuffy and noisy Moscow, into country side but the drawback was quite noticeable – from now on, since I will be bouncing between Moscow and my new place daily for the next three months, I will also have to spend in the traffic jams at list 4-5 per day sitting locked in a car. Gosh! Anyway, the overall results tremendously outweighs all inconveniences, plus, apart from living in a quiet and neat place, all the nice and beautiful sightseeing spots near Moscow became closer. To one of such places, Borodino, we set of on Sunday.
This is one of the most memorable and exiting places I’ve ever been to. Endless, picturesque sceneries, profoundly sprinkled with blood of soldiers defending the Fatherland
Here, every last Sunday of May, a kid’s festival called “Steadfast tin soldier” is carried out. During the feast everyone could observe old military bivouac of Napoleon’s epoch, admire viewing the marching soldiers vested in old military attire, with shakos on their heads and armed with the muskets. There are also many people in quaint dressing could be seen hanging around. Purely fantastic!
Overtly speaking, this event by itself is very rich in every aspect: the atmosphere, the mood and the overall openness, the moral and the historical inflation. Very touching and it’s simply impossible to stay indifferent.
The culmination of the festival is a reconstruction of episodes of Battle of Borodino, that took place here almost 200 years ago in 1812, with cannons, cavalry, real blasting, clods of soil and smog caused by shooting from muskets. Spectacular sight!
Unified Storage Simulator
Recently I had a chance to fiddle with Sun Storage 7000 Simulator and was totally amazed about this product. It’s absolutely fantastic and awesome because it gives everyone an opportunity to study and familiarize with Sun 7xxx storage appliance just by the means of VirtualBox (or VMWare). Once you boot and go through the initial configuration you will be presented with 15 virtual disks which you could create filesystems and/or LUNs on and share them in whatever manner you prefer: iSCSI, WebDAV, NFS, CIFS, FTP, NDMP.
Initially I was thinking about giving a step-by-step installation and configuration review but once I went through it by myself I cast aside this idea because of its simplicity and plainness. It just that easy and straightforward. More than that, it comes with an easy to understand documentation but if you prefer to use CLI don’t think of being deprived: tab completion and “help command” just don’t give you a single chance to get lost and confused.
Web interface is certainly more friendly and in my opinion it is your day-to-day assistant and the place you will do the most part of you work from. But all the background and cron job scripts will definitely be pumped through CLI.
And of course I just can’t pass over in silence the notorious Analytics feature. It’s epical! From a single menu, have no idea what the DTRACE is, you could drill down to the very source of your problem by identifying the culprit no matter on what tier it is. View the data on-line and in real-time, analyze CPUs, Caches, Disks, Protocols broken down by dozens of metrics i.e type of operations (read or write), clients, files, latency an much much more. Just see it and spend some time playing with it.
OpenSolaris 2009.06 is here
Today, during ongoing CommunityOne confrerence, the new OpenSolaris release was announced with bunch of compelling new features i.e. Crossbow, ClearView, COMSTAR, SPARC support and a lot of more. Release details could be retrieved from OpenSolaris web site.
Utterly depressed
Online petition – Let Alexandra Come Back to Portugal
Spontaneous domain reboot on SunFire 6800
Because the respective case at Sun was closed, I want to add this note for the future reference, just in case. So… One day I came to my desk and found that one the domains on SF6800 had been reboot for no reason, at least the very first impression was exactly like that. Superficially and quickly looking at /var/adm/messsage, prtdiag output revealed no hardware or software issues. The next step was to login into SC to go a bit deeper into analyzing the problem. Thus showboards, showfru, showchs, showplatform – everything was fine, but the showlogs, and especially showlogs -d C, output put me on my guard:
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 757768 local6.crit] ErrorMonitor: Domain C has a SYSTEM ERROR May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 346505 local6.error] RP2 encountered the first error May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 628870 local6.error] ArAsic reported first error on /N0/IB8 May 15 07:38:51 SF6900-1-sc0 Domain-C.SC: [ID 894554 local6.error] /partition1/domain0/IB8/ar0: >>> L2CheckError[0x6150] : 0x06068606 CMDVSyncErr [12:09] : 0x3 Ports [9:6] command valid mismatched against internal expected command valid PreqSyncErr [04:01] : 0x3 Ports [9:6] prereq mismatched against internal expected prereq AccCMDVSyncErr [28:25] : 0x3 accumulated valid command mismatch FE [15:15] : 0x1 AccPreqSyncErr [20:17] : 0x3 accumulated prerequisite mismatch May 15 07:38:51 SF6900-1-sc0 Domain-C.SC: [ID 612655 local6.error] /partition1/RP2/sdc0: >>> SafariPortError8[0x280] : 0x00088008 FE [15:15] : 0x1 AccParL2ErrDT [19:19] : 0x1 ParL2ErrDT [03:03] : 0x1 L2 parity error for DTransID May 15 07:38:52 SF6900-1-sc0 Domain-C.SC: [ID 286372 local6.error] [AD] Event: SF6800.ASIC.SDC.PAR_L2_ERR_DT.60143038 CSN: 0344MM204E DomainID: C ADInfo: 1.SCAPP.20.3 Time: Fri May 15 07:38:52 MSD 2009 FRU-List-Count: 2; FRU-PN: 5014404; FRU-SN: 046286; FRU-LOC: /N0/IB8 FRU-PN: 5016418; FRU-SN: 004613; FRU-LOC: RP2 Recommended-Action: Service action required
Does it look like a bunch of some cryptic messages which only initiated into Sun’s engineering secretes could decipher? Well, as always the truth is somewhere in between, because in our case we could only make an assumption about which part of our big system is faulty or just went off the beam for a jiffy. So, lets go forward…
First, we see two errors that took place simultaneously:
May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 346505 local6.error] RP2 encountered the first error May 15 07:38:50 SF6900-1-sc0 Domain-C.SC: [ID 628870 local6.error] ArAsic reported first error on /N0/IB8
Since we have (First Error) FE [15:15]: 0x1 in both errors that indeed means that these two alerts happened at the same time. But keep in mind, they’re unrelated to each other since FE bit is only valid for a single ASIC and has no relation to errors reported by other ASICs in the system. Next:
/partition1/domain0/IB8/ar0: >>> L2CheckError[0x6150] : 0x06068606 CMDVSyncErr [12:09] : 0x3 Ports [9:6] command valid mismatched against internal expected command valid PreqSyncErr [04:01] : 0x3 Ports [9:6] prereq mismatched against internal expected prereq AccCMDVSyncErr [28:25] : 0x3 accumulated valid command mismatch FE [15:15] : 0x1 AccPreqSyncErr [20:17] : 0x3 accumulated prerequisite mismatch
It just tells us that ports 6 through 9 of the AR (Address Repeater), on IO board 8, received CMDVSyncErr and PreqSyncErr. More details could be found here.
0x3 is a hint that tells us that RP2/RP3 were involved. Acc stand for “accumulated” and hence Acc[CMDVSyncErr|PreqSyncErr] lines just inform us that these errors occurred more than once.
Continue with the second error.
/partition1/RP2/sdc0: >>> SafariPortError8[0x280] : 0x00088008 FE [15:15] : 0x1 AccParL2ErrDT [19:19] : 0x1 ParL2ErrDT [03:03] : 0x1 L2 parity error for DTransID
This is a clear indication of the parity error on port 8 of SDC (Serengeti Data Controller), on RP2. Consulting “Sun Fire™ 6800/4800/4810/3800 Systems Troubleshooting Manual” revealed that port 8 connects to IB8.
In the end we have a list of suspected FRU:
- RP2
- IB8
What’s next? With probability of 99%, you will be given a recommendation to monitor you box for a couple of weeks and only if the same error knocks your server down again one of those parts will be replaced and the investigation spins up at the deeper level.
Maximum number of processes
If you’re with some Linux background under you belt then probably the first command you would think about is ulimit -a. The same command exists under Solaris
root@root # ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited open files (-n) 32768 pipe size (512 bytes, -p) 10 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 19995 virtual memory (kbytes, -v) unlimited
But there is a small difference. Whilst under Linux you are free to use it to change the maximum number of processes available to a single user, under Solaris it won’t work complaining:
ulimit: max user processes: cannot modify limit: Invalid argument
So what’s next? Remember that the maximum size of the process table depends on the total amount of physical memory installed in the system. This dependance is reflected in internal variable, called maxusers, and is determined at boot time.
#define MIN_DEFAULT_MAXUSERS 8u #define MAX_DEFAULT_MAXUSERS 2048u #define MAX_MAXUSERS 4096u if (maxusers == 0) { pgcnt_t physmegs = physmem >> (20 - PAGESHIFT); pgcnt_t virtmegs = vmem_size(heap_arena, VMEM_FREE) >> 20; maxusers = MIN(MAX(MIN(physmegs, virtmegs), MIN_DEFAULT_MAXUSERS), MAX_DEFAULT_MAXUSERS);} }
It is also used to set two other kernel variables: max_nprocs and maxuprc to describe the maximum number of process systemwide and the maximum number of processes an ordinary user can have respectively.
if (max_nprocs == 0) max_nprocs = (10 + 16 * maxusers); if (platform_max_nprocs > 0 && max_nprocs > platform_max_nprocs) max_nprocs = platform_max_nprocs; if (max_nprocs > maxpid) max_nprocs = maxpid; if (maxuprc == 0) maxuprc = (max_nprocs - reserved_procs);
To display the current values form the console just run mdb to explorer these variables:
> maxusers/D maxusers: maxusers: 2048 > max_nprocs/D max_nprocs: max_nprocs: 20000 > maxuprc/D maxuprc: maxuprc: 19995
To set the maximum number of processes a non-root user could have just update maxuprc value through either mdb or /etc/system file. Keep in mind that:
- maxuprc must be less than max_nprocs
- If you want to make your settings permanent across the reboots – use /etc/system file.
Whilst what I’ve said here is true both for Solaris 9 and 10 in Solaris 10 using “Resource Management” you could create more refined constrains to define the way a user can run his/her processes.
What’s new in OpenSolaris 2009.06
If you’re curious about new feature and technologies that are going to be introduced in the new upcoming OpenSolaris release then this presentation prepared by Peter Dennis is a must read.
No way I want to make the same mistakes again
To avoid stepping on the same rake again and to fix the issue described in this post, I came out with a simple expect script to save current configuration of Qlogic Sanbox switches.
#!/usr/local/bin/expect -f set switches "switch1 switch2" set user {user} set pass {pass} set ftp_user {ftp_user} set ftp_pass {ftp_pass} set timeout 10 log_user 0 set prompt "(%|#|\\$) $" catch {set prompt $env(EXPECT_PROMPT)} set sec [clock seconds] set date [clock format $sec -format %d%m%Y] set back [clock add $sec -7 days] set bdate [clock format $back -format %d%m%Y] for {set x 0} {$x<[llength $switches]} {incr x} { set current_switch [lindex $switches $x] spawn telnet $current_switch expect { timeout {puts "timeout while connecting to $host"; exit 1} "login:" } send "$user\r" expect { timeout {puts "timed out waiting for the password prompt"; exit 1} "Password:" } send "$pass\r" expect { timeout {puts "timed out after login"; exit 1} "#>" } send "admin start\r" expect { timeout {puts "timed out waiting for admin mode"; exit 1} "(admin) #>" } send "config backup\r" expect { "(admin) #>" } send "admin end\r" expect { "#>" } send "quit\r" spawn ftp sanbox4 expect { timeout {puts "timed out waiting for ftp login request"; exit 1} "Name" } send "$ftp_user\r" expect { timeout {puts "timed out waiting fro ftp password request"; exit 1} "Password:" } send "$ftp_pass\r" expect { timeout {puts "timed out waiting for ftp prompt"; exit 1} "ftp>" } send "get configdata /pth_to_backup_directory/configdata_$current_switch-$date\r" expect "ftp>" send "quit\r" if {[file exists /path_to_backup_directory/configdata_$current_switch-$bdate]} { exec /usr/bin/rm /path_to_backup_directory/configdata_$current_switch-$bdate } }