Linux Administration Notes - Disk Errors
<< Bootloader Toggling | AdminNotes | RAID Administration >>
I (Sergio) have recently made quite a bit more experience on Linux disk recovery than I would have liked. For the statistics, 3 out of 3 were Maxtor hard disks. But we all know about statistics...
On this page... (hide)
1. General
- for SMART, see
- CERN policy on disk replacement for SMART errors
- recovery tools for damaged disks:
- gparted can find info about partitions when the partition table is damaged
- ddrescue
- for arrays, use
mdadm --remove
to remove a failed disk from an array, andmdadm --add
to add it back after recovering (after checking that the error IS recovered!) - use
smartctl
to check the SMART status.
2. Recovery without RAID (or failed RAID)
Cross your fingers, first of all. Make sure that you have your backup at hand; get yourself a beer (to relax) and a coffee (so that you don't relax too much).
- Boot from CD or boot in single mode
- single mode works only to fix non-system partitions
- if you think you've lost data that you can't replace, backup what you can NOW
- the most important point is not writing to a damaged disk, as that could worsen things:
- if the damage is in the FileSystem structural data, the FS might get messed up
- if the damage is in the data areas, you might lose more...
- Your best chance to save the data is to use ddrescue to
copy the whole partition to another hard disk, either in a spare partition or in a file. - Note that this is one of the main reasons why I don't like huge partitions
- When you have a complete copy, you can try to mount it as read only and backup the data; or run e2fsck to try to fix it.
- Otherwise, you can try to find out what file corresponds to the bad block, if it's only a few of them
- the most important point is not writing to a damaged disk, as that could worsen things:
- while dd_rescue runs, read the section below, to use SMART to force a reallocate-on-write
- read more from the Bad Block Howto
- do what it says there
- just skip the parts about mdadm
- try your best to avoid clearing data with
dd
, unless you know that you won't need any of the data on the disk
- run
fsck
- (good luck)
- go and buy a new disk, and put them in RAID
- if it was the system disk, it's much better to reinstall the system
- restore the data from your backup.
3. Recovery with RAID and SMART
A complete session of recovery after a bad block on a RAID disk caused that partition to be dumped from the disk.
If instead you got a SMART error, but the RAID has not failed yet, skip to the next point.
3.1 Check RAID
First, check the RAID and see which disk has problems:
[sash@srv log]$ cat /proc/mdstat Personalities : [raid1] md2 : active raid1 hdc2[0](F) hda2[1] 20482752 blocks [2/1] [_U] md1 : active raid1 hdc5[1] hda5[0] 19052992 blocks [2/2] [UU] md0 : active raid1 hdc1[0] hda1[1] 104320 blocks [2/2] [UU] unused devices: <none> [sash@srv log]$ sudo mdadm --detail /dev/md2 /dev/md2: Version : 00.90.01 Creation Time : Sun Nov 30 11:26:14 2003 Raid Level : raid1 Array Size : 20482752 (19.53 GiB 20.97 GB) Device Size : 20482752 (19.53 GiB 20.97 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 2 Persistence : Superblock is persistent Update Time : Tue Nov 7 16:04:59 2006 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Number Major Minor RaidDevice State 0 0 0 -1 removed 1 3 2 1 active sync /dev/hda2 2 22 2 -1 faulty /dev/hdc2 UUID : 820a815b:27648685:40b1e72e:06d1fd6b Events : 0.507758
3.2 Check the SMART status and fail the partion
Look at the error log and selftest logs, and identify the partition in which the problem happened (necessary if the RAID has not given an error yet - otherwise that already gave you the information).
[sash@srv log]$ sudo smartctl -a /dev/hdc ... SMART overall-health self-assessment test result: PASSED ... 5 Reallocated_Sector_Ct 0x0033 252 252 063 Pre-fail Always - 9 .. 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 2 198 Offline_Uncorrectable 0x0008 252 252 000 Old_age Offline - 1 ... Error 329 occurred at disk power-on lifetime: 9610 hours (400 days + 10 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 e5 2f 93 e0 Error: UNC 8 sectors at LBA = 0x00932fe5 = 9646053 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 e5 2f 93 e0 08 11d+14:18:27.232 READ DMA c8 00 08 e5 5f 93 e0 08 11d+14:18:27.232 READ DMA c8 00 08 ad 49 93 e0 08 11d+14:18:27.216 READ DMA c8 00 50 9d 56 7f e1 08 11d+14:18:27.216 READ DMA c8 00 08 1d 30 7f e1 08 11d+14:18:27.200 READ DMA ... SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 9061 - ...
If you do not read PASSED, backup the data and throw the disk in the bin! NOW!
There is a correct test, but it's older than the error, so it's not significant. What is significant is that there are reallocated sectors, others pending reallocation, and most of all one that cannot be automatically reallocated! And that is most probably the one that caused the RAID to fail, and the one recorded in the error log as Error 319.
You can check with partition corresponds to the given sector by comparing its number to the start-stop given by fdisk:
[sash@srv log]$ sudo fdisk $ sudo fdisk -u -l /dev/hdc Disk /dev/hdc: 41.1 GB, 41110142976 bytes 255 heads, 63 sectors/track, 4998 cylinders, total 80293248 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/hdc1 * 63 208844 104391 fd Linux raid autodetect /dev/hdc2 208845 41174594 20482875 fd Linux raid autodetect /dev/hdc3 41174595 42186689 506047+ 82 Linux swap /dev/hdc4 42186690 80292869 19053090 5 Extended /dev/hdc5 42186753 80292869 19053058+ fd Linux raid autodetect
which tells, in this case, that it's /dev/hdc2
- which also corresponds to the failure information from RAID.
Once you know the partition, and before running badblocks if possible, force a fail of the partition in the raid, if it has not failed yet. You can use cat /proc/mdstats
to find which one contains the troublesome partition.
[sash@srv log]$ sudo mdadm /dev/md2 --fail /dev/hdc2 mdadm: failed /dev/hdc2 >><< !!! Execute a short test: >>frame<< [@ [sash@srv log]$ sudo smartctl -t short /dev/hdc smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Tue Nov 7 16:03:07 2006 Use smartctl -X to abort test.
now wait a bit...
[sash@srv log]$ date Tue Nov 7 16:03:37 CET 2006 [sash@srv log]$ sudo smartctl --log=selftest /dev/hdc smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 40% 9663 21717 # 2 Extended offline Completed without error 00% 9061 - # 3 Short offline Completed without error 00% 9060 - # 4 Extended offline Completed: read failure 40% 8963 1115 # 5 Short offline Completed: read failure 50% 8960 1168
3.3 Write on the bad block
The basic idea is that a fresh write to the block will "give permission" to SMART to do a reallocation of the block to a good area of disk.
For some reason, the LBA did not seem to correspond to the partition, if I follow the recipe in the BadBlock HowTo; and a check with badblocks
on the partition that I get from it confirmed that everything was ok there.
So I instead use badblocks
in non-destructive write mode.
[sash@srv log]$ sudo badblocks -sv -n /dev/hdc2
I am not completely sure if badblocks -n is actually working well to overwrite, with SATA disks; so you may have to use dd if=/dev/zero of=/dev/hdc2 bs=1k skip=# count=#
to force reallocate-on-write of the block.
After than, you can run
[sash@srv log]$ sudo smartctl -t offline /dev/hdc
and check that Offline_Uncorrectable
goes to 0, meaning that the offline check will be able to remap the bad sector:
[sash@srv log]$ sudo smartctl -A /dev/hdc ... 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 1 198 Offline_Uncorrectable 0x0008 253 252 000 Old_age Offline - 0 ...
Then run
[sash@srv log]$ sudo smartctl -t long /dev/hdc
and at the end verify
[sash@srv log]$ sudo smartctl -A -t selftest /dev/hdc ... 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0 198 Offline_Uncorrectable 0x0008 253 252 000 Old_age Offline - 0 ...
on Maxtor HDs, for some reason, Reallocated_Sector_Ct
goes to 0 after a successful long test.
3.4 Reactivate the RAID
Now, when SMART checks fine, you can remove and re-add the partition to the RAID array, and watch it rebuild :-)
[sash@srv log]$ sudo mdadm /dev/md2 --remove /dev/hdc2 mdadm: hot removed /dev/hdc2 [sash@srv log]$ sudo mdadm /dev/md2 --add /dev/hdc2 mdadm: hot added /dev/hdc2 [sash@srv log]$ mdadm /dev/md2 --detail .... [sergio@daq-pc sergio]$ cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda2[2] sdb2[1] 40957632 blocks [2/1] [_U] [>....................] recovery = 2.8% (1174592/40957632) finish=12.4min speed=53390K/sec md0 : active raid1 sda1[0] sdb1[1] 15358016 blocks [2/2] [UU]
3.5 Conclusions
In the end, the disk that caused the error is back to work, with no (apparent) bad blocks. Now, if it was not a RAID disk, I would throw it away immediately, since the fact that the sector was not reallocated before hitting a hard error is a very bad sign, that the error rates are so high (or happen so suddently) that the on-disk controller does not cope. But, as part of an array, it can still be very useful, as the probability of both disks going bad in the same data area is still very low. So I'll keep using it, but it must stay under control: from now on I will be running weekly a smartctl -t long
on all disks.
3.6 Update, a few months later
The same Maxtor disk had some trouble, and the raid did not help. I do not really know what happened, because the /
mount (from /dev/md0
) went read-only, and so there is no log and no emergency mail. I just see an error in the SMART log and that /dev/sda
has been kicked out of both RAID shares; and I had a lot of trouble to get things restarted - the filesystem had a lot of bad inodes, and I still do not know how much of it is damaged (apparently the system is ok, but it's hard to tell with thousands of system files...). Anyway, so much for my idea that a bad disk in a RAID was better than nothing - I very much suspect that this would not have happened without that disk.