среда, 25 марта 2009 г.

VxVM Recovering from a Power Failure

Recovering from a Power Failure

While sitting at my desk on a Friday afternoon (around 6pm), my email client chimed to alert me to the following message:

To: root@system
Subject: Volume Manager failures on host winnie
Content-Length: 240

Failures have been detected by the VERITAS Volume Manager:

failed volumes:
oravol01

I immediately logged into winnie and received the following error when trying to list the directory that was on the failed volume oravol01:

$ ls -la /u01

.: I/O error

A quick check of the system logs revealed numerous SCSI error messages:

$ tail /var/adm/messages

Jul 26 17:49:37 winnie scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1/scsi@4/sd@1,0 (sd1):
Jul 26 17:49:37 winnie SCSI transport failed: reason 'incomplete': retrying command
Jul 26 17:49:37 winnie scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1/scsi@4/sd@4,0 (sd5):
Jul 26 17:49:37 winnie SCSI transport failed: reason 'incomplete': retrying command
Jul 26 17:49:40 winnie scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1/scsi@4/sd@1,0 (sd1):
Jul 26 17:49:40 winnie disk not responding to selection
Jul 26 17:49:40 winnie scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1/scsi@4/sd@4,0 (sd5):
Jul 26 17:49:40 winnie disk not responding to selection
Jul 26 17:49:42 winnie scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1/scsi@4/sd@1,0 (sd1):
Jul 26 17:49:42 winnie disk not responding to selection
Jul 26 17:49:44 winnie scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1/scsi@4/sd@1,0 (sd1):
Jul 26 17:49:44 winnie disk not responding to selection

The Veritas vxprint(1m) utility was also unable to display the disk group configuration (since the configuration database was unavailable):

$ vxprint -g oradg -ht
$

I raced to the machine room to check the status of the hardware and started to speculate that a SCSI cable or the storage array had failed. When I checked the hardware, I noticed that the storage array was powered off, and the equipment in the surrounding racks was working correctly. Based on my findings, I thought that a power cord failed. Once I replaced the cord, the storage array came back to life, but the oradg disk group I needed to access was in the disabled state:

$ vxdg list

NAME STATE ID
oradg disabled 1123603158.13.winnie

A quick check of the disks showed that they were online and associated with the disabled disk group:

$ vxdisk list

DEVICE TYPE DISK GROUP STATUS
c0t0d0s2 auto:none - - online invalid
c0t1d0s2 auto:none - - online invalid
c1t1d0s2 auto:cdsdisk c1t1d0 oradg online dgdisabled
c1t2d0s2 auto:cdsdisk c1t2d0 oradg online dgdisabled
c1t3d0s2 auto:cdsdisk c1t3d0 oradg online dgdisabled
c1t4d0s2 auto:cdsdisk c1t4d0 oradg online dgdisabled
c1t5d0s2 auto:cdsdisk c1t5d0 oradg online dgdisabled
c1t6d0s2 auto:cdsdisk c1t6d0 oradg online dgdisabled

In some situations, Veritas may report offline devices as "failed was: cXtXdXs2". When this happens, the vxreattach(1m) command can reconnect Veritas Volume Manager to "lost" devices. Luckily, in our case, Veritas was able to reconnect to the devices so I unmounted the file system, deported the disk group, and imported the disk group to enable the oradg disk group:

$ umount /u01

$ vxdg deport oradg

$ vxdg import oradg

The deport and import operations are required to fix a disabled disk group and will validate that the disk group configuration records are consistent. Once the disk group was imported, I ran vxinfo(1m) to view the volume and plex status:

$ vxinfo -g oradg -p

vol oravol01 fsgen Startable
plex oravol01-03 ACTIVE
vol oravol01-L01 fsgen Startable
plex oravol01-P01 ACTIVE
plex oravol01-P02 ACTIVE
vol oravol01-L02 fsgen Startable
plex oravol01-P03 ACTIVE
plex oravol01-P04 ACTIVE
vol oravol01-L03 fsgen Startable
plex oravol01-P05 ACTIVE
plex oravol01-P06 ACTIVE

The "Startable" flag indicates that the volume and plexes are in a startable state, so I executed the vxvol(1m) utility to start the volume:

$ vxvol -g oradg start oravol01

Once the volume came online, I ran fsck(1m) to replay the transactions in the VxFS journal:

$ fsck -F vxfs /dev/vx/dsk/oof/oravol01

log replay in progress
replay complete - marking super-block as CLEAN

After fsck(1m) finished the consistency check, I mounted the file system, applied the archive logs, and was able to bring the database back up to an operational state. Due to the recovery features built into Veritas volume manager, Veritas file system, and Oracle, we were able to avoid a full file system restore! Since I received the failure notification immediately after Veritas detected a problem with the volume, I was able to reduce the time it took to recover the faulted system.

Комментариев нет:

Отправить комментарий