2012-01-10

Help, my RAID array does not complete synchronization!

Let us suppose the following situation: You have a Linux server with a software RAID1 array (md) and, for one reason or another (mostly because your are a lazy admin, admit it!), both disks are reporting unreadable sectors, either through SMART or through actual failed readout attempts.

So you installed a 3rd good disk, set it as a spare, then failed one of the 2 bad ones to initiate synchronisation onto the good new disk. However, all hell breaks lose as you find out your synchronisation doesn't complete (/proc/mdstat reports U_ or _U) and instead of ignoring the unreadable sectors as it should, md decides that it cannot continue.

Worse, if you look at your dmesg, you find out that it is being polluted by a continuous stream of:
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:sda1
disk 1, wo:1, o:1, dev:sdb1
Help!!!!

OK, first of all, since this information is quite hard to find, especially if you are in a hurry, here are what the abbreviations above mean:
  • wd: working disks
  • rd: raid disks
  • wo: write-only (if set to 1, this usually indicates a problem, and that data duplication doe not occur for this device)
  • o: online
Obviously, wd:1 as well as wo:1 for the second disk is not something we want to see. Why can't our good spare disk be added as R/W to the gorram array? Heck, if the problematic disk fails, that single-handedly contains our up-to-date data now, we will be in big trouble. What's the point of providing redundancy, really, if md fails to synchronize as soon as there's one measly sector it cannot read!

It's a bird! It's a plane! No, it's hdparm!

Well, the sad truth of md on Linux (which may have improved with newer versions) is that it isn't resilient at all when it comes to unreadable sectors during sync. I guess the developers decided that, since the point of redundancy is to always have at least one good set of data, they didn't need to focus on situation where the "good" set of data may also have some corruption, and therefore never planned for anything but try and re-read an unreadable sector forever, until the disk magically repairs itself (right... fat chance!).

Now (and for the rest of this post I will mostly be following the excellent information provided by Bas on his blog) to compensate for that oversight, the trick is to have md read the problematic sectors one way or another, so that the synchronisation can complete. May sound easier said than done but most of the time it shouldn't be an issue, as recent disks with SMART are engineered with a set of spare sectors, to be allocated in replacement of unreadable or unwritable ones for exactly this kind of situation. The issue however is that reallocation of sectors only occurs on write access.

What this means then is that, while the disk has the technology to "fix" itself, as long as you are only attempting to read the problematic sectors, reallocation will not be triggered and you will continue to get read errors. Thus, you must manually issue a write to the problematic sector(s) to trigger the "recovery" mechanism (NB: I'm using "fix" and "recovery" loosely, as you can of course not recover data from these sectors if they are reallocated, therefore will end up with some corrupted data).

This can be confirmed by checking the Offline_Uncorrectable (#198) and Reallocated_Sector_Ct (#5) reports from SMART:
# smartctl -A /dev/sda
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       105
  2 Throughput_Performance  0x0026   054   054   000    Old_age   Always       -       2759
  3 Spin_Up_Time            0x0023   084   084   025    Pre-fail  Always       -       4989
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       11496
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   060   000    Old_age   Always       -       32 (Lifetime Min/Max 20/40)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       2
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       10
If you see a zero at the end of these attributes but the disk still reports that it has trouble reading sectors, it indicates that the sector reallocation process hasn't kicked in yet, and needs to be triggered manually.

The first order of the day then is to find the address of the sector(s) we should trigger a write to. This is fairly easy, as all you need to do is run a SMART test, with something like smartctl -t long /dev/sda and write down the first sector address where a read error is reported:
# smartctl -a /dev/sda
(...)
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       60%     10864         293039329
(...)
Once we have that address, we could of course use dd, but an even simpler approach is to use a recent version of hdparm, as it adds easy support for reading/writing a single sector.

First thing to try with hdparm then, is confirm that we have a problem accessing that sector:
# hdparm --read-sector 293039329 /dev/sda

/dev/sda: Input/Output error
This confirms what the SMART test reported. You can try a few more read attempts, to validate that the sector is busted, and then, you can issue a write so that the disk finally realizes it should reallocate that sector. Note that, because the operation obviously means destroying existing data, hdparm requires you to add a --yes-i-know-what-i-am-doing flag to issue the write, hence:
# hdparm --yes-i-know-what-i-am-doing --write-sector 293039329 /dev/sda

/dev/sda: re-writing sector 293039329: succeeded
You can then issue a read again, which will confirm that the sector has been reallocated:
# hdparm --read-sector 293039329 /dev/sda

/dev/sda:
reading sector 293039329: succeeded
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
If you issue smartctl -A again, you should also see that the sector has been reallocated:
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -        1
It's usually a good idea to use hdparm to read adjacent sectors as well, and correct them as needed, then repeat the operations above until the SMART self test completes without error and you have smoked out all the problematic sectors. At this stage, if you issue a resync of the array with the new disk, it should complete successfully and redundancy will be restored. Time to order another replacement and check your data for corruption. But at least, you are redundant again.

Addons
  • To get details of your md array, you can use mdadm --detail. Eg.
    # mdadm --detail /dev/md2
    /dev/md2:
            Version : 0.90
      Creation Time : Tue May  6 18:43:16 2008
         Raid Level : raid1
         Array Size : 130030016 (124.01 GiB 133.15 GB)
      Used Dev Size : 130030016 (124.01 GiB 133.15 GB)
       Raid Devices : 2
      Total Devices : 3
    Preferred Minor : 2
        Persistence : Superblock is persistent
    
        Update Time : Tue Jan 10 13:42:29 2012
              State : clean
     Active Devices : 2
    Working Devices : 3
     Failed Devices : 0
      Spare Devices : 1
    
               UUID : 0be47c81:ede086ae:0c460403:d81de298
             Events : 0.3658859
    
        Number   Major   Minor   RaidDevice State
           0       8        3        0      active sync   /dev/sda3
           1       8       19        1      active sync   /dev/sdb3
    
           2       8       35        -      spare   /dev/sdc3
  • You are strongly encouraged to check your syslog or messages for reports of I/O issues, especially if you want to locate the data that may have been affected.
  • This method is not guaranteed to work! Sometimes a SMART test will report a read error but a readout of the sector using hdparam will work fine, so you won't be able to get the disk to reallocate it. However, tis shouldn't matter too much for md resync which is what we are interested in here.
  • If your disk has a lot of unreadable sectors, it is possible that you may run out of spare sectors for reallocation. It's hard to say how many spare sectors are made available by hard drive manufacturers, but I assume it isn't that many.
  • You may have a problem recompiling a recent version of hdparm on some older Linux systems:
    fallocate.c: In function ‘do_fallocate_syscall’:
    fallocate.c:39: error: ‘__NR_fallocate’ undeclared (first use in this function)
    fallocate.c:39: error: (Each undeclared identifier is reported only once
    fallocate.c:39: error: for each function it appears in.)
    make: *** [fallocate.o] Error 1
    If that is the case, just add:
    #define __NR_fallocate 285
    in fallocate.c
  • Some disks seem to be smart enough (no pun intended) to do further correction, once they have registered Offline_Uncorrectable sectors, so you may actually find out that, after a few hours, the value of Offline_Uncorrectable falls back to zero, and still the sectors can be read or written with extended SMART tests not reporting any issue. Pretty neat, but I still wouldn't entirely trust the disk...

2012-01-04

Using LILO to boot disks by UUID

If you're plugging USB drives in an out and using LILO to boot a Linux distro (eg. Slackware) you may have ended up with a kernel panic because your /dev/sd# were shuffled around and the kernel was no longer able to find its root partition on the expected device. Of course, having Linux failing to boot just because you happened to plug an extra drive sucks big time, so we want to fix that.

The well known solution of course it to use UUIDs or labels, since these are fixed. However, while recent versions of LILO are supposed to support root partitions that are identified by UUID/Label, in practice, this doesn't work UNLESS you are using an initrd disk. I'm not sure who of LILO or the kernel is responsible for this new layer of "suck" (I'd assume the kernel, since the expectation is that LILO is using the dev mappings that are being fed by the kernel), but I can only say that there really are some areas of Linux that could still benefit from long awaited improvements...

Thus, to be able to use UUIDs or labels for your root partition in LILO, you must boot using an initrd. Worse, as previously documented, you will most likely need to compile a new kernel that embeds the initrd, lest you want to run into the following issue while running LILO:
Warning: The initial RAM disk is too big to fit between
the kernel and the 15M-16M memory hole.

In practice (as also illustrated by this post), this means you will need to:
  1. Create an initrd cpio image that can be embedded into a kernel with:
    cd /boot
    mkinitrd -c
    cd initrd-tree
    find . | cpio -H newc -o > ../initrd.cpio
  2. Recompile a kernel, while making sure that you have the General Setup → Initial RAM filesystem and RAM disk (initramfd/initrd) support selected, and then set General Setup → Initramfs source file(s) to /boot/initrd.cpio

  3. Edit your /etc/lilo.conf and add an append = "root=UUID=<YOUR-DISK-GUID>" to your Linux boot entry. An example of a working lilo.conf is provided below. Note that you probably also want to use a fixed IDs for boot=, so that running LILO is also not dependent on the current /dev/sd# organization.. 

  4. Run LILO, plug drives around and watch in amazement as your system still boots the Linux partition regardless of how the drives are assigned
Example lilo.conf:
# Start LILO global section
boot = /dev/disk/by-id/ata-ST3320620AS_ABCD1234
compact
lba32
# LILO doesn't like same volume IDs of RAID 1
disk = /dev/sdb inaccessible
default = Windows
bitmap = /boot/slack.bmp
bmp-colors = 255,0,255,0,255,0
bmp-table = 60,6,1,16
bmp-timer = 65,27,0,255
# Append any additional kernel parameters:
append=" vt.default_utf8=1"
prompt
timeout = 35
# End LILO global section

image = /boot/vmlinuz
  append = "root=UUID=2cc11aaf-f838-4474-9d9a-f3881569f97c"
  label = Linux
  read-only
image = /boot/vmlinuz.rescue
  append = "root=UUID=2cc11aaf-f838-4474-9d9a-f3881569f97c"
  label = Rescue
  read-only
other = /dev/sda
  # Windows doesn't go to S3 sleep and has issues with backup,
  # unless it sees its disk as first in BIOS...
  boot-as = 0x80
  label = Windows
other = /dev/disk/by-id/ata-ST3320620AS_ABCD1234-part4
  label = OSX
Oh, and of course, don't forget to edit your /etc/fstab as required, if you still use /dev/sdX# entries there.