Monday, November 16, 2009

Vxplex DISABLED RECOVER state

After a NetApp 6080 hosting FCP LUNs failed this weekend we came into the office to notice many of the servers using those LUNs had offline volumes and disk groups.


Here was the state of the volume in question

v szdbor006du02 - DISABLED ACTIVE 2727606272 SELECT - fsgen
pl szdbor006du02-01 szdbor006du02 DISABLED RECOVER 2727606272 CONCAT - RW
sd szdbor006ddg01-01 szdbor006du02-01 szdbor006ddg01 0 209646560 0 c1t500A098187197B34d10 ENA
sd szdbor006ddg02-02 szdbor006du02-01 szdbor006ddg02 209648096 943459616 209646560 c1t500A098187197B34d11 ENA
sd szdbor006ddg03-01 szdbor006du02-01 szdbor006ddg03 0 1153107712 1153106176 c3t500A098287197B34d15 ENA
sd szdbor006ddg04-01 szdbor006du02-01 szdbor006ddg04 0 421392384 2306213888 c1t500A09828759382Fd50 ENA

Issued vxrecover on the volume and plex but the state never changed and I didn't find a vxrecovery task with ps or vxtask list. The recovery task was somehow confused I am guessing so to fix here is what I needed to do.
vxplex -g diskgroup det szdbor006du02-01
This put the plex into a DETACHED STALE state
vxmend -g diskgroup fix clean szdbor006du02-01
This put the plex back into a DETACHED CLEAN state at which point I could do a
vxvol -g diskgroup startall (I could have just put the volume name as well)
This enabled and started the volume. FSCK'd and remounted the FS.

Now to figure out why exactly the FAS6080 crashed just because of an HBA hiccup.
Hope this may be useful if you ever run into the same scenario.

Tuesday, November 3, 2009

NetApp fun for 24 hours

I've been working at our ORC Datacenter (Off-Site Records Center)installing 2 NetApp filers that I moved from our downtown DC. WOW...it was all going to so well until i booted up the new FAS3050 filers to replace the older 960 and 980 heads.

1st the 3050's complained about not seeing any disks it could own. Fixed that by booting maint mode and assigning them to the new 3050's.

2nd the 3050's complained that the disks had a mismatched OnTap version on them. 7.3.2 on the disks and the 3050's had 7.2.# on them.

3rd a netboot of the 3050's blew up every time. The nic would just go offline and hence would kill the netboot. I tried a netboot from downtown, from hou and directly connected to my laptop. None of which worked!

4th decided to just reuse the 900's. the first 960 boots up and and complains it cant grab ownership of the disks (because the 3050 grabbed them before). so now i have to re-rack the 3050 plug in the disk and remove_ownership.

5th the network ports locations weren't completely communicated correctly so they were on the correct VLAN's

24 hours later (930AM today) the 2 replication filers are back online in the ORC datacenter.

This freed up 8Kw of power in the downtown DC, so the DC manager is happy again.

(sorry for the bad grammar and capitalization)