How to live with disabled disks
Recently, I was working with a customer’s Sun StorEdgeTek 2540 array which mysteriously had all twelve disks in disabled state.
Details
Name: t85d01
ID: Tray.85.Drive.1
Array:
Array Type: 2540
Tray: 85
Slot Number: 1
Role: Unassigned
Virtual Disk: –
State: Disabled
Status: Optimal
Capacity: 279.396 GB
Type: SAS
Speed (RPM): 15000 RPM
Firmware: SA04
Serial number: 000834C6K8PC JHV6K8PC
Disk World Wide Name: 50:00:CC:A0:0D:0B:EC:60
The funniest thing was that in SupportData the information was opposed. So if you find yourself in the same situation don’t waste your time on googling, resetting controllers and array’s configuration – it won’t help. All you’ll see is the following error:
“The operation cannot be completed because either there is a problem communicating with one or more of the disks in the storage array, or no disks are currently connected. Correct the problem, then retry the operation.”
The only possible salvation from this was opening a support case in Sun. After that, the engineer just wiped out all data by doing sysReset and sysReboot from VxWorks, it’s OS that drives this disk array. To get into it, you will have to know a super-duper password, which Sun keeps in secret.
What was the root cause? My take is that during an initial configuration someone just had done a real bad thing to this array, e.g. abruptly terminated firmware upgrade or tried to downgrade it from CAM, which corrupted drive configuration databases (DACstore). So just be careful and prudent or double check that your backups are up to date.
P.S. This also applies to 6140/6540 arrays.
Brendan Gregg talks about DTrace
Another fantastic presentation from Brendan and just like all the previous ones it’s a must seen video
The end of “Fair Play”
After Theirry’s handball no trust left neither in pompous gentry in FIFA nor in their already fscked slogan. It’s such a shame…
Plugging HP-UX into SAN
Our task for today is to connect HP-UX (11.31 release) to MSA2312fc through SAN with two Brocade switches in between. First, we need to find out all FC cards installed in our server and once we have that piece of information we could dig for WWN numbers to map them latter to the disk array.
Lets start from the beginning, ioscan is our best friend in revealing inner life of our server:
bash-4.0# ioscan -funC fc Class I H/W Path Driver S/W State H/W Type Description =================================================================== fc 0 0/3/0/0/0/0 fcd CLAIMED INTERFACE HP 4Gb Dual Port PCIe Fibre Channel Mezzanine (FC Port 1) /dev/fcd0 fc 1 0/3/0/0/0/1 fcd CLAIMED INTERFACE HP 4Gb Dual Port PCIe Fibre Channel Mezzanine (FC Port 2) /dev/fcd1
The output above says everything we need to know. So now, I’m going to use fcmsutil to display WWN information from our cards:
bash-4.0# fcmsutil /dev/fcd0 Vendor ID is = 0x1077 Device ID is = 0x2432 PCI Sub-system Vendor ID is = 0x103C PCI Sub-system ID is = 0x1705 PCI Mode = PCI Express x4 ISP Code version = 4.2.2 ISP Chip version = 3 Topology = PTTOPT_FABRIC Link Speed = 4Gb Local N_Port_id is = 0x020b01 Previous N_Port_id is = None N_Port Node World Wide Name = 0x5001438004c2e159 N_Port Port World Wide Name = 0x5001438004c2e158 Switch Port World Wide Name = 0x200b00051e868762 Switch Node World Wide Name = 0x100000051e868762 N_Port Symbolic Port Name = oamdwh1_fcd0 N_Port Symbolic Node Name = oamdwh1_HP-UX_B.11.31 Driver state = ONLINE Hardware Path is = 0/3/0/0/0/0 Maximum Frame Size = 2048 Driver-Firmware Dump Available = NO Driver-Firmware Dump Timestamp = N/A Driver Version = @(#) fcd B.11.31.0803 Jan 20 2008 bash-4.0# bash-4.0# fcmsutil /dev/fcd1 Vendor ID is = 0x1077 Device ID is = 0x2432 PCI Sub-system Vendor ID is = 0x103C PCI Sub-system ID is = 0x1705 PCI Mode = PCI Express x4 ISP Code version = 4.2.2 ISP Chip version = 3 Topology = PTTOPT_FABRIC Link Speed = 4Gb Local N_Port_id is = 0x010b01 Previous N_Port_id is = None N_Port Node World Wide Name = 0x5001438004c2e15b N_Port Port World Wide Name = 0x5001438004c2e15a Switch Port World Wide Name = 0x200b00051e855e8b Switch Node World Wide Name = 0x100000051e855e8b N_Port Symbolic Port Name = oamdwh1_fcd1 N_Port Symbolic Node Name = oamdwh1_HP-UX_B.11.31 Driver state = ONLINE Hardware Path is = 0/3/0/0/0/1 Maximum Frame Size = 2048 Driver-Firmware Dump Available = NO Driver-Firmware Dump Timestamp = N/A Driver Version = @(#) fcd B.11.31.0803 Jan 20 2008
Using this information we could proceed with zone configuration on FC switches.
alicreate "hpux_fcd1", "50:01:43:80:04:c2:e1:5a" alicreate "MSA2312_A1", "20:70:00:c0:ff:d8:bb:a4" zonecreate "hpux_msa2312", "hpux_fcd1;MSA2312_A1" cfgadd "HPUX_cfg" "hpux_msa2312"
For brevity I omitted the steps required to configure the second switch since they are almost the same. The only difference is the aliases’, zones’ names and WWN numbers.
Don’t forget to save and enable newly created configuration on the switch:
cfgsave HPUX_cfg cfgenable HPUX_cfg
All that we have to do next, apart from creating and mapping LUNs on our storage, is to tell the system to scan for new disks and to create special files for them:
# insf -C disk
To double check that new disks have been successfully added do the following:
# ioscan -m dsf Persistent DSF Legacy DSF(s) ======================================== /dev/rdisk/disk1 /dev/rdsk/c0t0d0 /dev/rdisk/disk1_p1 /dev/rdsk/c0t0d0s1 /dev/rdisk/disk1_p2 /dev/rdsk/c0t0d0s2 /dev/rdisk/disk1_p3 /dev/rdsk/c0t0d0s3 /dev/pt/pt4 /dev/rscsi/c3t0d0 /dev/pt/pt5 /dev/rscsi/c2t0d0 /dev/rdisk/disk6 /dev/rdsk/c4t0d1 /dev/rdsk/c5t0d1
/dev/rdisk/disk6 is our new lovely friend. To confirm that, use scsimgr command:
# scsimgr inquiry -D /dev/rdisk/disk6 INQUIRY INFORMATION FOR LUN: /dev/rdisk/disk6 Peripheral Device Type: 0 (Direct Access) Peripheral Qualifier: 0 (Peripheral Device Connected) Removable Media: No ANSI Version: 5 (Complies to SPC-3) Normal ACA Support: No Hierarchical Support: 0x1 (Hierarchical addressing model used) Response Data Format: 2 (SPC-3) Additional Length: 155 SCC Support: 0 (No Embedded Storage Array Controller) Access Controls Coordinator: No Target Port Group Support: 0x1 (Implicit Asymmetric Access Support) Third-Party Copy: No Protect: 0 (Protection Information NOT Supported) Basic Queuing: 0 Command Queuing: 0x1 (Full Task Management Model (SAM-3)) Enclosure Services: No Multi-port Device Support: Yes Medium Changer Support: No Supports 16-bit wide SCSI addresses: No Support for 16 Bit Transfers: No Synchronous Data Transfers: No Linked Command Support: No Vendor Identification: "HP " Product Identification: "MSA2312fc " Product Revision Level: "M110" Vendor Specific Data: 20 20 20 20 20 20 20 20 " " 43 41 50 49 20 20 41 41 "CAPI AA" 66 20 20 20 "f " Clocking: 0 (Supports only ST) Quick Arbitration & Selection support: No Information Unit transfers Support: No
Finally, when you decide to create LVM configuration on top of newly added disk don’t use legacy DSF, all those c#t#d#, but use persistent DSF instead, i.e. /dev/rdisk/disk6 and HP-UX will deal with multipathing on its own:
# scsimgr lun_map -D /dev/rdisk/disk6 LUN PATH INFORMATION FOR LUN : /dev/rdisk/disk6 Total number of LUN paths = 2 World Wide Identifier(WWID) = 0x600c0ff000d8d17c1b1bf84a01000000 LUN path : lunpath4 Class = lunpath Instance = 4 Hardware path = 0/3/0/0/0/0.0x247000c0ffd8bba4.0x4001000000000000 SCSI transport protocol = fibre_channel State = STANDBY Last Open or Close state = STANDBY LUN path : lunpath5 Class = lunpath Instance = 5 Hardware path = 0/3/0/0/0/1.0x207000c0ffd8bba4.0x4001000000000000 SCSI transport protocol = fibre_channel State = ACTIVE Last Open or Close state = ACTIVE
Enjoy.
Stranger in HP-UX
For the last couple of days I’ve been heavily playing with HP-UX, one of the unknown and never seen operating system. That’s true – I’ve never touch it before. But these days are over and now I’m overfilled with new experience and… mixed fillings. You see, on the one hand it’s UNIX and most of the tools and the concepts have many parts in common with other UNIX based OS. But… On the flip side, there are a lot of stuff that work differently. I’m not talking about things that are HP-UX specific i.e. pseudo-swap, LVM or persistent DSF, etc. but widely used and adopted by the others. For example, ifconfig. You can’t use it to view your networking devices. Or another one – -h option for ls or df commands. No luck. Not mentioning the fact that features that come for free in, lets say Solaris, Linux or *BSD, would require a special license, e.g. mirror/UX if you’re looking for mirroring the disks. Even kmeminfo, ::memstat analog from Solaris mdb, is not publicly available. On no account I’m trying to pin a label but to show that HP-UX is a bit different from other UNIXs, at least myself, have gotten used to.
Anyway, with this post I’m going to create a new category devoted to HP-UX cherishing hopes that one day someone could learn through my pain. I must admit that at times its quite amusing to plunge into something new, because usually it also comes together with mix of pure anger and course wording, especially when something well-known, as I mentioned before, doesn’t work as expected and a childish joy when you’ve finally understood the logic and the idea behind it. And that brings enormous fun into a life. So lets it keep prevailing!
Little Shop of Performance Horrors
Watch a fantastic presentation by Brendan Gregg who dwells upon different performance aspects and issues with real life examples. This 2.5 hours discussion has been split into three parts: Part 1, Part 2 and Part 3.
Don’t be frighten off by its length, with Brendan’s energy, charisma and ingenious way of giving speeches you definitely won’t notice that.
Replacing broken disk in SVM RAID5
It’s ineluctable that one day metacheck script, I strongly encourage you to use it if you don’t, will report a metadevice problem. In nine cases out of ten the root cause will be a failed disk.
# metastat d90 d90: RAID State: Needs Maintenance Invoke: metareplace d90 c0t13d0s2Interlace: 512 blocks Size: 335421567 blocks (159 GB) Original device: Size: 335424000 blocks (159 GB) Device Start Block Dbase State Reloc Hot Spare c0t11d0s2 8019 No Okay Yes c0t12d0s2 7874 No Okay Yes c0t13d0s2 10218 No Maintenance Yes c0t14d0s2 10218 No Okay Yes c0t15d0s2 8019 No Okay Yes c0t10d0s2 8019 No Okay Yes
Not a big deal to fix it correctly without unmounting a file system. Just plug a new disk into empty slot and do metareplace. But what if all slots are occupied? Well, theoretically it shouldn’t be a problem either:
- Do cfgadm -c unconfigure c#t#d# to unconfigure the disk.
- Replace it with a new one and use cfgadm -c configure c#t#d# to make visible to system.
- Don’t forget about metadevadm -u c#t#d#s#
- Finally, type metareplace to initiate re-syncing.
But in my case these steps just didn’t work and I understand why – my broken disk was still a part of SVM. Actually, it was clearly explained to me in the following error message:
# cfgadm -c unconfigure c0::dsk/c0t13d0 cfgadm: Component system is busy, try again: failed to offline: Resource Information ------------------ --------------------------------------- /dev/dsk/c0t13d0s2 component of RAID "/dev/md/dsk/d90" /dev/md/dsk/d90 mounted filesystem "/usr/fsrv/archive"
Since I couldn’t offline the filesystem and cause a downtime my last resort was to skip cfgadm steps and brusquely remove a disk from its slot. Said and done. Thankfully, it worked smoothly and once a new disk found its shelter inside a server the rest was trivial…
# metareplace -e d90 c0t13d0s2 # metastat d90 d90: RAID State: Resyncing Resync in progress: 0.0% done Interlace: 512 blocks Size: 335421567 blocks (159 GB) Original device: Size: 335424000 blocks (159 GB) Device Start Block Dbase State Reloc Hot Spare c0t11d0s2 8019 No Okay Yes c0t12d0s2 7874 No Okay Yes c0t13d0s2 10218 No Resyncing Yes c0t14d0s2 10218 No Okay Yes c0t15d0s2 8019 No Okay Yes c0t10d0s2 8019 No Okay Yes
Studying HDS 9990 and 9985 (THI1570)
Tomorrow I’m leaving to Saint-Pitersburg for 5-days training titled “Installing, Configuring and Maintaining Hitachi Universal Storage Platform™ V and Universal Storage Platform ™ VM”. Once I’m done with it, my CTS profile will be 100% completed. Not bad at all ;-)
Back to childhood
Having just returned from a child theater (sorry only in Russian) I can’t wait to express the inner fillings that brimmed me. A play, called “A hedgehog in the fog”, was awesome and brilliantly performed. And to tell the truth, sitting there in a dark with my kid on the laps I felt like I was a little boy myself, like there had never been those mature years at all. And my eyes were almost wet… Maybe I’m too sentimental or maybe it’s because I have a birthday today or maybe both. God knows!
Veritas romp
Just came across a really funny piece of code in OpenSolaris.
/* * XXX - Don't port this to new architectures * A 3rd party volume manager driver (vxdm) depends on the symbol romp. * 'romp' has no use with a prom with an IEEE 1275 client interface. * The driver doesn't use the value, but it depends on the symbol. */ void *romp; /* veritas driver won't load without romp 4154976 */