PSIgroup | Computing / Linux Administration Notes

<< Bootloader Toggling | AdminNotes | RAID Administration >>

I (Sergio) have recently made quite a bit more experience on Linux disk recovery than I would have liked. For the statistics, 3 out of 3 were Maxtor hard disks. But we all know about statistics...

General

for SMART, see
- http://smartmontools.sourceforge.net/BadBlockHowTo.txt
- http://www.linuxjournal.com/article/6983
CERN policy on disk replacement for SMART errors
recovery tools for damaged disks:
- gparted can find info about partitions when the partition table is damaged
- ddrescue
  - Interpreting logfile of restore to get list of corrupted files
for arrays, use mdadm --remove to remove a failed disk from an array, and mdadm --add to add it back after recovering (after checking that the error IS recovered!)
use smartctl to check the SMART status.

Recovery without RAID (or failed RAID)

Cross your fingers, first of all. Make sure that you have your backup at hand; get yourself a beer (to relax) and a coffee (so that you don't relax too much).

Boot from CD or boot in single mode
- single mode works only to fix non-system partitions
if you think you've lost data that you can't replace, backup what you can NOW
- the most important point is not writing to a damaged disk, as that could worsen things:
  - if the damage is in the FileSystem structural data, the FS might get messed up
  - if the damage is in the data areas, you might lose more...
- Your best chance to save the data is to use ddrescue to
  copy the whole partition to another hard disk, either in a spare partition or in a file.
- Note that this is one of the main reasons why I don't like huge partitions
- When you have a complete copy, you can try to mount it as read only and backup the data; or run e2fsck to try to fix it.
- Otherwise, you can try to find out what file corresponds to the bad block, if it's only a few of them
while dd_rescue runs, read the section below, to use SMART to force a reallocate-on-write
- read more from the Bad Block Howto
do what it says there
- just skip the parts about mdadm
- try your best to avoid clearing data with dd, unless you know that you won't need any of the data on the disk
run fsck
- (good luck)
go and buy a new disk, and put them in RAID
if it was the system disk, it's much better to reinstall the system
restore the data from your backup.

Recovery with RAID and SMART

A complete session of recovery after a bad block on a RAID disk caused that partition to be dumped from the disk.
If instead you got a SMART error, but the RAID has not failed yet, skip to the next point.

Check RAID

First, check the RAID and see which disk has problems:

[sash@srv log]$ cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hdc2[0](F) hda2[1]
      20482752 blocks [2/1] [_U]

md1 : active raid1 hdc5[1] hda5[0]
      19052992 blocks [2/2] [UU]

md0 : active raid1 hdc1[0] hda1[1]
      104320 blocks [2/2] [UU]

unused devices: <none>

[sash@srv log]$ sudo mdadm --detail /dev/md2
/dev/md2:
        Version : 00.90.01
  Creation Time : Sun Nov 30 11:26:14 2003
     Raid Level : raid1
     Array Size : 20482752 (19.53 GiB 20.97 GB)
    Device Size : 20482752 (19.53 GiB 20.97 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Tue Nov  7 16:04:59 2006
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0


    Number   Major   Minor   RaidDevice State
       0       0        0       -1      removed
       1       3        2        1      active sync   /dev/hda2
       2      22        2       -1      faulty   /dev/hdc2
           UUID : 820a815b:27648685:40b1e72e:06d1fd6b
         Events : 0.507758

Check the SMART status and fail the partion

Look at the error log and selftest logs, and identify the partition in which the problem happened (necessary if the RAID has not given an error yet - otherwise that already gave you the information).

[sash@srv log]$ sudo smartctl -a /dev/hdc
...
SMART overall-health self-assessment test result: PASSED
...
  5 Reallocated_Sector_Ct   0x0033   252   252   063    Pre-fail  Always       -       9
..
197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       2
198 Offline_Uncorrectable   0x0008   252   252   000    Old_age   Offline      -       1
...
Error 329 occurred at disk power-on lifetime: 9610 hours (400 days + 10 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 e5 2f 93 e0  Error: UNC 8 sectors at LBA = 0x00932fe5 = 9646053

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 e5 2f 93 e0 08  11d+14:18:27.232  READ DMA
  c8 00 08 e5 5f 93 e0 08  11d+14:18:27.232  READ DMA
  c8 00 08 ad 49 93 e0 08  11d+14:18:27.216  READ DMA
  c8 00 50 9d 56 7f e1 08  11d+14:18:27.216  READ DMA
  c8 00 08 1d 30 7f e1 08  11d+14:18:27.200  READ DMA
...
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      9061         -
...

If you do not read PASSED, backup the data and throw the disk in the bin! NOW!

There is a correct test, but it's older than the error, so it's not significant. What is significant is that there are reallocated sectors, others pending reallocation, and most of all one that cannot be automatically reallocated! And that is most probably the one that caused the RAID to fail, and the one recorded in the error log as Error 319.

You can check with partition corresponds to the given sector by comparing its number to the start-stop given by fdisk:

[sash@srv log]$ sudo fdisk $ sudo fdisk -u -l /dev/hdc

Disk /dev/hdc: 41.1 GB, 41110142976 bytes
255 heads, 63 sectors/track, 4998 cylinders, total 80293248 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/hdc1   *          63      208844      104391   fd  Linux raid autodetect
/dev/hdc2          208845    41174594    20482875   fd  Linux raid autodetect
/dev/hdc3        41174595    42186689      506047+  82  Linux swap
/dev/hdc4        42186690    80292869    19053090    5  Extended
/dev/hdc5        42186753    80292869    19053058+  fd  Linux raid autodetect

which tells, in this case, that it's /dev/hdc2 - which also corresponds to the failure information from RAID.

Once you know the partition, and before running badblocks if possible, force a fail of the partition in the raid, if it has not failed yet. You can use cat /proc/mdstats to find which one contains the troublesome partition.

[sash@srv log]$ sudo mdadm /dev/md2 --fail /dev/hdc2
mdadm: failed /dev/hdc2
>><<

!!! Execute a short test:
>>frame<<
[@
[sash@srv log]$ sudo smartctl -t short /dev/hdc
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Tue Nov  7 16:03:07 2006

Use smartctl -X to abort test.

now wait a bit...

[sash@srv log]$ date
Tue Nov  7 16:03:37 CET 2006

[sash@srv log]$ sudo smartctl --log=selftest /dev/hdc
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       40%      9663         21717
# 2  Extended offline    Completed without error       00%      9061         -
# 3  Short offline       Completed without error       00%      9060         -
# 4  Extended offline    Completed: read failure       40%      8963         1115
# 5  Short offline       Completed: read failure       50%      8960         1168

Write on the bad block

The basic idea is that a fresh write to the block will "give permission" to SMART to do a reallocation of the block to a good area of disk.

For some reason, the LBA did not seem to correspond to the partition, if I follow the recipe in the BadBlock HowTo; and a check with badblocks on the partition that I get from it confirmed that everything was ok there. So I instead use badblocks in non-destructive write mode.

[sash@srv log]$ sudo badblocks -sv -n /dev/hdc2

I am not completely sure if badblocks -n is actually working well to overwrite, with SATA disks; so you may have to use dd if=/dev/zero of=/dev/hdc2 bs=1k skip=# count=# to force reallocate-on-write of the block.

After than, you can run

[sash@srv log]$ sudo smartctl -t offline /dev/hdc

and check that Offline_Uncorrectable goes to 0, meaning that the offline check will be able to remap the bad sector:

[sash@srv log]$ sudo smartctl -A /dev/hdc
...
197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       1
198 Offline_Uncorrectable   0x0008   253   252   000    Old_age   Offline      -       0
...

Then run

[sash@srv log]$ sudo smartctl -t long /dev/hdc

and at the end verify

[sash@srv log]$ sudo smartctl -A -t selftest /dev/hdc
...
197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0008   253   252   000    Old_age   Offline      -       0
...

on Maxtor HDs, for some reason, Reallocated_Sector_Ct goes to 0 after a successful long test.

Reactivate the RAID

Now, when SMART checks fine, you can remove and re-add the partition to the RAID array, and watch it rebuild :-)

[sash@srv log]$ sudo mdadm /dev/md2 --remove /dev/hdc2
mdadm: hot removed /dev/hdc2
[sash@srv log]$ sudo mdadm /dev/md2 --add /dev/hdc2
mdadm: hot added /dev/hdc2
[sash@srv log]$ mdadm /dev/md2 --detail 
....
[sergio@daq-pc sergio]$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[2] sdb2[1]
      40957632 blocks [2/1] [_U]
      [>....................]  recovery =  2.8% (1174592/40957632) finish=12.4min speed=53390K/sec
md0 : active raid1 sda1[0] sdb1[1]
      15358016 blocks [2/2] [UU]

Conclusions

In the end, the disk that caused the error is back to work, with no (apparent) bad blocks. Now, if it was not a RAID disk, I would throw it away immediately, since the fact that the sector was not reallocated before hitting a hard error is a very bad sign, that the error rates are so high (or happen so suddently) that the on-disk controller does not cope. But, as part of an array, it can still be very useful, as the probability of both disks going bad in the same data area is still very low. So I'll keep using it, but it must stay under control: from now on I will be running weekly a smartctl -t long on all disks.

Update, a few months later

The same Maxtor disk had some trouble, and the raid did not help. I do not really know what happened, because the / mount (from /dev/md0) went read-only, and so there is no log and no emergency mail. I just see an error in the SMART log and that /dev/sda has been kicked out of both RAID shares; and I had a lot of trouble to get things restarted - the filesystem had a lot of bad inodes, and I still do not know how much of it is damaged (apparently the system is ok, but it's hard to tell with thousands of system files...). Anyway, so much for my idea that a bad disk in a RAID was better than nothing - I very much suspect that this would not have happened without that disk.

Linux Administration Notes - Disk Errors