|
Literature
Africa |
Linux Administration Notes - Disk Errors<< Bootloader Toggling | AdminNotes | RAID Administration >> I (Sergio) have recently made quite a bit more experience on Linux disk recovery than I would have liked. For the statistics, 3 out of 3 were Maxtor hard disks. But we all know about statistics... General
Recovery without RAID (or failed RAID)Cross your fingers, first of all. Make sure that you have your backup at hand; get yourself a beer (to relax) and a coffee (so that you don't relax too much).
Recovery with RAID and SMARTA complete session of recovery after a bad block on a RAID disk caused that partition to be dumped from the disk. Check RAIDFirst, check the RAID and see which disk has problems:
[sash@srv log]$ cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hdc2[0](F) hda2[1]
20482752 blocks [2/1] [_U]
md1 : active raid1 hdc5[1] hda5[0]
19052992 blocks [2/2] [UU]
md0 : active raid1 hdc1[0] hda1[1]
104320 blocks [2/2] [UU]
unused devices: <none>
[sash@srv log]$ sudo mdadm --detail /dev/md2
/dev/md2:
Version : 00.90.01
Creation Time : Sun Nov 30 11:26:14 2003
Raid Level : raid1
Array Size : 20482752 (19.53 GiB 20.97 GB)
Device Size : 20482752 (19.53 GiB 20.97 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 2
Persistence : Superblock is persistent
Update Time : Tue Nov 7 16:04:59 2006
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Number Major Minor RaidDevice State
0 0 0 -1 removed
1 3 2 1 active sync /dev/hda2
2 22 2 -1 faulty /dev/hdc2
UUID : 820a815b:27648685:40b1e72e:06d1fd6b
Events : 0.507758
Check the SMART status and fail the partionLook at the error log and selftest logs, and identify the partition in which the problem happened (necessary if the RAID has not given an error yet - otherwise that already gave you the information). [sash@srv log]$ sudo smartctl -a /dev/hdc ... SMART overall-health self-assessment test result: PASSED ... 5 Reallocated_Sector_Ct 0x0033 252 252 063 Pre-fail Always - 9 .. 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 2 198 Offline_Uncorrectable 0x0008 252 252 000 Old_age Offline - 1 ... Error 329 occurred at disk power-on lifetime: 9610 hours (400 days + 10 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 e5 2f 93 e0 Error: UNC 8 sectors at LBA = 0x00932fe5 = 9646053 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 e5 2f 93 e0 08 11d+14:18:27.232 READ DMA c8 00 08 e5 5f 93 e0 08 11d+14:18:27.232 READ DMA c8 00 08 ad 49 93 e0 08 11d+14:18:27.216 READ DMA c8 00 50 9d 56 7f e1 08 11d+14:18:27.216 READ DMA c8 00 08 1d 30 7f e1 08 11d+14:18:27.200 READ DMA ... SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 9061 - ... If you do not read PASSED, backup the data and throw the disk in the bin! NOW! There is a correct test, but it's older than the error, so it's not significant. What is significant is that there are reallocated sectors, others pending reallocation, and most of all one that cannot be automatically reallocated! And that is most probably the one that caused the RAID to fail, and the one recorded in the error log as Error 319. You can check with partition corresponds to the given sector by comparing its number to the start-stop given by fdisk: [sash@srv log]$ sudo fdisk $ sudo fdisk -u -l /dev/hdc Disk /dev/hdc: 41.1 GB, 41110142976 bytes 255 heads, 63 sectors/track, 4998 cylinders, total 80293248 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/hdc1 * 63 208844 104391 fd Linux raid autodetect /dev/hdc2 208845 41174594 20482875 fd Linux raid autodetect /dev/hdc3 41174595 42186689 506047+ 82 Linux swap /dev/hdc4 42186690 80292869 19053090 5 Extended /dev/hdc5 42186753 80292869 19053058+ fd Linux raid autodetect which tells, in this case, that it's Once you know the partition, and before running badblocks if possible, force a fail of the partition in the raid, if it has not failed yet. You can use [sash@srv log]$ sudo mdadm /dev/md2 --fail /dev/hdc2 mdadm: failed /dev/hdc2 >><< !!! Execute a short test: >>frame<< [@ [sash@srv log]$ sudo smartctl -t short /dev/hdc smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Tue Nov 7 16:03:07 2006 Use smartctl -X to abort test. now wait a bit... [sash@srv log]$ date Tue Nov 7 16:03:37 CET 2006 [sash@srv log]$ sudo smartctl --log=selftest /dev/hdc smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 40% 9663 21717 # 2 Extended offline Completed without error 00% 9061 - # 3 Short offline Completed without error 00% 9060 - # 4 Extended offline Completed: read failure 40% 8963 1115 # 5 Short offline Completed: read failure 50% 8960 1168 Write on the bad blockThe basic idea is that a fresh write to the block will "give permission" to SMART to do a reallocation of the block to a good area of disk. For some reason, the LBA did not seem to correspond to the partition, if I follow the recipe in the BadBlock HowTo; and a check with [sash@srv log]$ sudo badblocks -sv -n /dev/hdc2 I am not completely sure if badblocks -n is actually working well to overwrite, with SATA disks; so you may have to use After than, you can run [sash@srv log]$ sudo smartctl -t offline /dev/hdc and check that [sash@srv log]$ sudo smartctl -A /dev/hdc ... 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 1 198 Offline_Uncorrectable 0x0008 253 252 000 Old_age Offline - 0 ... Then run [sash@srv log]$ sudo smartctl -t long /dev/hdc and at the end verify [sash@srv log]$ sudo smartctl -A -t selftest /dev/hdc ... 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0 198 Offline_Uncorrectable 0x0008 253 252 000 Old_age Offline - 0 ... on Maxtor HDs, for some reason, Reactivate the RAIDNow, when SMART checks fine, you can remove and re-add the partition to the RAID array, and watch it rebuild :-)
[sash@srv log]$ sudo mdadm /dev/md2 --remove /dev/hdc2
mdadm: hot removed /dev/hdc2
[sash@srv log]$ sudo mdadm /dev/md2 --add /dev/hdc2
mdadm: hot added /dev/hdc2
[sash@srv log]$ mdadm /dev/md2 --detail
....
[sergio@daq-pc sergio]$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[2] sdb2[1]
40957632 blocks [2/1] [_U]
[>....................] recovery = 2.8% (1174592/40957632) finish=12.4min speed=53390K/sec
md0 : active raid1 sda1[0] sdb1[1]
15358016 blocks [2/2] [UU]
ConclusionsIn the end, the disk that caused the error is back to work, with no (apparent) bad blocks. Now, if it was not a RAID disk, I would throw it away immediately, since the fact that the sector was not reallocated before hitting a hard error is a very bad sign, that the error rates are so high (or happen so suddently) that the on-disk controller does not cope. But, as part of an array, it can still be very useful, as the probability of both disks going bad in the same data area is still very low. So I'll keep using it, but it must stay under control: from now on I will be running weekly a Update, a few months laterThe same Maxtor disk had some trouble, and the raid did not help. I do not really know what happened, because the |