How to securely keep a hard drive with bad blocks in a raid array

I’m using a custom-made NAS at home, with two hard-drives of 1 Tb. Some of the partitions of these 2 disks are organized in a RAID-1 setup using dmraid (and the mdadm user-space tool). This morning, I had some freaking lines in the dmesg. Any sysadmin sighs when he sees those (and cries if he doesn’t have backups, but all sysadmins have backups, right?)

# dmesg
ata1.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000008
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/80:40:56:2b:a9/00:00:73:00:00/40 tag 8 ncq 65536 in
         res 41/40:00:90:2b:a9/00:00:73:00:00/40 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/133
sd 0:0:0:0: [sda] Unhandled sense code
sd 0:0:0:0: [sda]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda]  Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
        73 a9 2b 90 
sd 0:0:0:0: [sda]  Add. Sense: Unrecovered read error - auto reallocate failed
sd 0:0:0:0: [sda] CDB: Read(10): 28 00 73 a9 2b 56 00 00 80 00
end_request: I/O error, dev sda, sector 1940466576
ata1: EH complete

The “auto reallocate failed” part specially sucks. A quick look at smartctl on the faulty underlying drive of the raid1 showed a not-nice value of 13 for Offline_Uncorrectable. Any sysadmin in any company would then proceed to just replace the faulty drive, and happily wait for the raid to resync. But when it’s your home drive, and we’re talking here about 13 faulty blocks out of several zillion blocks, it suddenly seems a bit stupid to just throw the hard drive, when 99.9999993% of the remaining blocks are perfectly okay (yes, this is the actual ratio). I’m using EXT4 for this raid partition, so I wanted to take advantage of the badblocks mechanism of this filesystem, as mdraid doesn’t (yet?) have any such mechanism. I ran a read of the entire partition to locate where the problems were, and a grep in dmesg determined that the bad blocks were between the logical sectors 1015773 and 1015818 of the partition (which uses a blocksize of 4K, as reported by tune2fs -l). So, I’ve taken a security margin, and went to blacklist the logical sectors from 1015700 to 1015900. I’ve first made a list of all the inodes impacted, using debugfs:

# seq 1015700 1015900 | sed -re 's/^/icheck /' | debugfs /dev/sda3 2>/dev/null | awk '/^[0-9]+[[:space:]]+[0-9]+$/ {print $2}' | tee badinodes
125022
125022
125022
[... snip ...]

Then searched the file names attached to those inodes:

# sort -u badinodes | sed -re 's/^/ncheck /' | debugfs /dev/sda3 2>/dev/null | awk '/^[0-9]/ { $1=""; print }' | tee badfiles
 /usr/src/linux-headers-2.6.35-28/arch/microblaze/include
 /usr/src/linux-headers-2.6.35-28/arch/microblaze/include/asm
 /var/lib/mlocate/mlocate.db
[... snip ...]

In my case, it was only non-critical system files, but if more important files are impacted, doing a backup from the good partition would probably be a good idea, just in case… I rebooted on a live system to be able to work with my root filesystem, and started the array with only the good drive in it.

# mdadm --assemble /dev/md0 /dev/sdb3
mdadm: /dev/md0 has been started with 1 drive (out of 2).

And used fsck to manually add a list of the badblocks.

# seq 1015700 1015900 > badblocks
# fsck.ext4 -C 0 -l badblocks -y /dev/md0
e2fsck 1.41.11 (14-Mar-2010)
slash: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
                                                                               
Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 125022: 1015700 1015701 1015702 1015703 1015704 1015705 1015706 1015707 1015708 1015709 1015710 1015711 1015712 1015713 1015714 1015715 1015716 1015717 1015718 1015719 1015720 1015721 1015722 1015723 1015724 1015725 1015726 1015727 1015728 1015729 1015730 1015731 1015732 1015733 1015734 1015735 1015736 1015737 1015738 1015739 1015740 1015741 1015742 1015743
Multiply-claimed block(s) in inode 179315: 1015744
Multiply-claimed block(s) in inode 179316: 1015745
Multiply-claimed block(s) in inode 179317: 1015746
Multiply-claimed block(s) in inode 179318: 1015747
Multiply-claimed block(s) in inode 179319: 1015748
Multiply-claimed block(s) in inode 179320: 1015749
[... snip ...]
Multiply-claimed block(s) in inode 179376: 1015805
Multiply-claimed block(s) in inode 179377: 1015806
Multiply-claimed block(s) in inode 179378: 1015807
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 65 inodes containing multiply-claimed blocks.)

File /lib/modules/2.6.35-28-generic-pae/kernel/net/sunrpc/sunrpc.ko (inode #125022, mod time Tue Mar  1 15:57:40 2011) 
  has 44 multiply-claimed block(s), shared with 1 file(s):
        <The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes

File /usr/src/linux-headers-2.6.35-28/arch/cris/arch-v10/lib (inode #179315, mod time Sat Mar 19 05:32:10 2011) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        <The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes

File /usr/src/linux-headers-2.6.35-28/arch/cris/boot (inode #179316, mod time Sat Mar 19 05:32:10 2011) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        <The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes

File /usr/src/linux-headers-2.6.35-28/arch/cris/boot/compressed (inode #179317, mod time Sat Mar 19 05:32:10 2011) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        <The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes

[... snip ...]

File /usr/src/linux-headers-2.6.35-28/arch/microblaze/lib (inode #179376, mod time Sat Mar 19 05:32:10 2011) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        <The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes

File /usr/src/linux-headers-2.6.35-28/arch/microblaze/include (inode #179377, mod time Sat Mar 19 05:32:02 2011) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        <The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes

File /usr/src/linux-headers-2.6.35-28/arch/microblaze/include/asm (inode #179378, mod time Sat Mar 19 05:32:10 2011) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
        <The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity                                        
Pass 4: Checking reference counts                                              
Pass 5: Checking group summary information                                     
Free blocks count wrong for group #0 (7663, counted=7555).                     
Fix? yes

Free blocks count wrong for group #30 (387, counted=495).
Fix? yes

                                                                               
slash: ***** FILE SYSTEM WAS MODIFIED *****
slash: 261146/472352 files (0.2% non-contiguous), 1733488/1886240 blocks

Look as how fsck detects the inodes that are claiming the same blocks. This is totally normal, as some of the badblocks are associated to files (as we found above), hence are referenced in the badblocks inode AND in the real files inodes. The fsck fix is exactly what we need: it just duplicates the data block so that each inode has its own data block. … wait, did it modify the files inodes or the badblocks inode?

# dumpe2fs -b /dev/md0
dumpe2fs 1.41.11 (14-Mar-2010)
1015700
1015701
[... snip ...]
1015899
1015900

Alright, fsck did exactly the right thing, it modified the real files inodes, by copying the data from the blocks in the badblocks to new unallocated blocks. As this has been done with the working disk of the raid array, the impacted files have not lost their integrity. Just for fun, let’s check that the blocks we wanted to ban data from are indeed no longer used by any real inode:

# seq 1015700 1015900 | sed -re 's/^/icheck /' | debugfs /dev/md0 2>/dev/null
debugfs:  Block Inode number
1015700 <block not found>
debugfs:  Block Inode number
1015701 <block not found>
debugfs:  Block Inode number
1015702 <block not found>
debugfs:  Block Inode number
1015703 <block not found>
debugfs:  Block Inode number
1015704 <block not found>
debugfs:  Block Inode number
1015705 <block not found>
debugfs:  Block Inode number
1015706 <block not found>
debugfs:  Block Inode number
1015707 <block not found>
[... snip ...]

Okay! We now just have to resync the array over the bad disk,

# mdadm --re-add /dev/md0 /dev/sda3
mdadm: re-added /dev/sda3

Wait for it to finish… (check /proc/mdstat to see the progression), then reboot! :)

How to securely keep a hard drive with bad blocks in a raid array

Further Reading

Fixing a locale-archive breakage

How OCZ Vertex 2 and OCZ Petrol SSD successfully killed my data. Twice.

What if you don't want UTF-8 as a default?

Trending Tags