How OCZ Vertex 2 and OCZ Petrol SSD successfully killed my data. Twice.

Short story of my Vertex 2’s short life

 

Several months ago, I wanted to give a little speedup to my aging laptop. It also seemed a perfect time to give a first try to the SSDs everybody had been talking about. All in all, the OCZ Vertex 2 seemed like a good choice : the Vertex 3 was just out, so the Vertex 2 price was going down and making it almost affordable.

vertex2

On that faithful day of 25th September 2011, I bought a Vertex 2 online, and installed it in my laptop. The change was so noticeable ! It was really a great buy… except that just 4 months later, when powering on my laptop, I found out that it was hanging at the hard drive detection BIOS step, only to time out after 30 seconds, blatantly stating that no hard drive was found. Nothing I could do would bring the hard drive back, even when trying in another computer. After some research online, I found out that the exactly same thing happened to a countless number of people around the globe.

My hard drive had just entered in the panic lock mode, a deadly and overly stupid non-documented feature of the included SandForce controller, which meant that all data stored on it was inevitably lost, and the disk itself was dead for good. Just like that. For no apparent reason.

So, I got it refunded to buy a replacement, an Intel 320 “PostVille refresh”

320pvillerefresh

…which is still alive by today and works like a charm, and sweared I would never buy OCZ again.

 

Never forgive, never forget

 

It should have been my motto on this day, but it wasn’t. On September 2012, I decided to replace my old laptop by a full-featured PC. I bought an-OEM pre-built one, as it had all the pieces I wanted, and it was cheaper than if I had to buy all the pieces separately. It had an hybrid configuration for mass storage : a 1 Tb HDD, and an SSD. Unfortunately, it was an OCZ. Well, the PC was cheap enough and I had almost forgiven OCZ for their terrible Vertex 2 model. This one was an OCZ Petrol SSD, and it was not powered by a SandForce controller. So, I thought I was safe from the stupid panic mode.

petrol

Well, I was right… but unfortunately it was actually way worse than a panic lock, way more insidious, as I found out what it was doing only months after it started to destroy my data.
This is the nightmare of any sysadmin (or anybody that cares for his data) : silent data corruption.

It started with randomly crashing programs. The first one was Chromium : one day it would just start to crash every time I tried to launch it. I thought it had something to do with Flash or some other broken plugin, but didn’t even bother to look into it and just raised my shoulders and switched to Firefox. Then some time later Firefox started to crash too, so I switched to Midori, the XFCE browser, telling myself I would have to look into that later. Yes, I can be extremely lazy at times.

I eventually came to the conclusion that Flash had nothing to do with that when I started to get freezes and floods of unreadable sector messages in the kernel log. Hell, the drive was (again) just 4 months old ! And all the SMART indicators as seen by smartctl were green ! Bah, yet another cursed drive from OCZ ! I used some liveusb install to use ddrescue and try to get as much data back as possible from the drive before it would suffer an eventual instant death, as I knew anything was possible with an OCZ drive.

I already used ddrescue and friends numerous times for mechanical hard drives (yes, I’m unlucky with hard drives), and it always worked pretty well. This was the first time I used it on an SSD however… well, It was a nightmare and took me three days.

 

Welcome to data rescue hell

 

I already used ddrescue and friends numerous times for mechanical hard drives (yes, I’m unlucky with hard drives), and it always worked pretty well. This was the first time I used it on an SSD however… well, It was a nightmare and took me three days.

I was getting read speeds of exactly 150 Kb/sec on very large areas of the drive, which was pretty frightening for an SSD. For the record, OCZ rates this drive at 350 Mb/sec sequential read speeds. Let me spare you the maths, this is 2390 times slower than the normal speed. But even, it would have been too easy: when OCZ wants to eat your data, it won’t let you get it back so easily. The main problem was not even the speed: once the drive tried to read a problematic sector, it would freeze the SATA bus for 2 whole minutes, then a kernel timeout fired and tried to reset the SATA bus to try to get it back in a working state. But the reset never worked and always timed out too, and that’s when the /dev/sda block device would disappear each and every time. Then ddrescue would happily continue try to read from the non-existent device, and conclude that none of the remaining data was readable.

I was about to patch ddrescue to handle this issue more correctly, but in the end I just used the option to tell it to stop after the first error. Because then, the SSD would have become unreachable from the OS anyway. I could do nothing to make it reappear in a scriptable way, not even triggering a SATA bus reset or rescan… the only thing that worked was to electrically unplug the drive and plug it back, then instruct the kernel to rescan the SATA bus for new devices. Pretty brutal, but only that or a reboot would work. It means I had to be there, in front of the computer, imitating a robot that puts and pulls a plug. Really annoying. Anyway, after three days of plugging/unplugging between rare successful reads at 150 Kb/sec, it finally recovered 100% of my data. Supposedly at least: that was strange because some sectors that were unreadable on some day ended being readable the day after.

The rescued SSD image passed the filesystem check almost successfully, so it was reassuring. After copying it to some other drive and booting from it, I still had the crashes from Chromium and Firefox however. So, I instructed rpm to verify the checksum of all the files from installed packages. It found a lot of mismatches, and Chromium/Firefox were sure among those. The solution was easy: reinstalling the concerned packages. Of course, it worked.

 

Silent corruption ?

 

But what about my data ? Not the stuff in /usr or /var, the stuff in $HOME ? Why wouldn’t it be corrupted too ? How can you trust your backups when you have a drive that has been silently corrupting your data for months ? Well, you can’t.

As my backups are incremental and rsync-based (actually, rsnapshot-based), the decision to re-upload a modified file is taken by having a look at it’s size and modify date. If these are matching on both ends, then the file wasn’t modified, so no need to re-backup it. The problem with silent corruption, you see, is that, well, it’s silent. Which means, when some bytes are modified by a shitfaulty hard drive, the size or modification date won’t be. The different cases that could happen on a given file would be :

  1. File backed up cleanly before I switched to the faulty drive, later corrupted by it, but never modified by me or a program before I did the data recovery. Then, the copy on the backup would be safe.
  2. File backed up cleanly before I switched to the faulty drive, later corrupted by it, then modified by me or a program, hence re-backup’d. These legit modification vs corruption vs new incremental backup cycles being repeated several times, eventually. Here, by comparing two adjacent backups, there’s no way to tell which parts of the modifications are normal and which parts originates from the corruption.
  3. File created after I switched to the faulty drive, and maybe corrupted by it before the first backup, so no way to tell if the first or even any of its later backups are safe.

In short, it’s a pretty lame situation. The two last cases are more or less hopeless, except for files where the corruptions are obvious, like images for example. For the first case, there’s still hope to get the clean data back from the backups. But how to detect this case ?

New generations of file systems like ZFS or Btrfs are supposed to mitigate that, by check-summing everything, from data to metadata. Hence silent corruptions can’t go undetected, and can even be fixed if you have sufficient replicas of your data, by a process called scrubbing. Anyway in my case it was classical ext4, so no luck.

 

Tentatively detect and fix silent corruption from incremental backups

 

I started rsyncd on my NAS (on which I have my backups) with the following configuration :

root@nas:~# cat /etc/rsyncd.conf
[backup]
comment = nas backup
path = /raid1/hdda/rsnapshot-quasar/monthly.4/speed/home/speed/
use chroot = yes
lock file = /var/lock/rsyncd
read only = yes
list = yes
uid = root
gid = nogroup
strict modes = yes
ignore errors = no
ignore nonreadable = yes
transfer logging = no
timeout = 600

I used the incremental backups from 4 months old, which I know are not corrupted because it was before I had the drive. This was to try to save the files and personal documents that I have on my hard drive since a long time and that don’t get modified a lot. The uid is set to root so that I’m sure that rsyncd can read all the backup files (that might have several different uids that only make sense on my original computer anyway).

root@nas:~# rsync --no-detach --daemon --verbose --config /etc/rsyncd.conf

Then, on the PC with the rescued drive image, I launched the following command so that rsync lists the files that it thinks are different between the backup source and the rescued data. This is, files that have different sizes and/or modification times. So those files were genuinely modified between 4 months ago and now. Maybe they were corrupted too in the process, but there’s no way to tell. That’s files from the first two cases in the list above.

speed@quasar:~$ rsync -va --dry-run rsync://nas/backup/ . | tee rsync_list_1.txt

After the first list is complete, populate a second list but with a crucial difference : if the size and modification dates match, rsync will no longer assume the files are the same, it’ll do a byte-per-byte comparison. This second list should contain all the files from the first list, plus the files that have been silently corrupted (where the byte-per-byte comparison will fail).

speed@quasar:~$ rsync -va --dry-run --checksum rsync://nas/backup/ . | tee rsync_list_2.txt

And then, running a simple diff shows that indeed a non-negligible number of my $HOME files are corrupted :

speed@quasar:~$ diff -u rsync_list_{1,2}.txt | grep ^+ | wc -l
728

A visual example of a diff between a backup and the current text file is self explanatory :

corrupted1

Thank you, OCZ.

Before restoring those files from my backup, I wanted to measure how many bytes were corrupted by the drive.

speed@quasar:~$ perl -ne 'm{(.+)} or next; @out=qx{cmp -l "/home/speed/mnt/sshfs/raid1/hdda/rsnapshot-quasar/monthly.4/speed/home/speed/$1" "$1"}; $corrupted+=scalar(@out); print $corrupted."\n"' RSYNC.SILENT.CORRUPTION.LIST | tail -n1
60904419
$ perl -ne 'm{(.+)} or next; $total+=(stat($1))[7]; print $total."\n"' RSYNC.SILENT.CORRUPTION | tail -n1
1951053630

So, this wonderful drive silently corrupted roughly 58 MiB of data in my backuped files (the fucked up files have a total size of 1860 MiB). It has corrupted other files too (system ones, fixed by packages reinstall), and all the files I couldn’t detect because they had legitimate modifications in addition to the corruption modifications. Anyway, as I couldn’t measure that, let’s do a favor to OCZ and don’t even count those. This disk has a capacity of 128 Gb. It wasn’t even full at all, but let’s pretend it was, to do yet another favor to OCZ. Hence, we can see here that it corrupted roughly at least 1 bit each 2101 bits.

To compare with classical consumer-level hard drives (Seagate, Western Digital, …) you’ll see in their technical sheet that those are usually rated to corrupt one bit out of 10^14 bits read or written. This is 1 bit corrupted each 100 000 000 000 000 bits.

From there, it’s pretty straightforward, and backed up by the facts I have to come up to this conclusion : an OCZ drive is roughly 47 596 382 675 times more likely to silently corrupt your data than any other drive, if you want to put it that way.

How to securely keep a hard drive with bad blocks in a raid array

I’m using a custom-made NAS at home, with two hard-drives of 1 To.
Some of the partitions of these 2 disks are organized in a RAID-1 setup using dmraid (and the mdadm user-space tool).

This morning, I had some freaking lines in the dmesg. Any sysadmin sighs when he sees those (and cries if he doesn’t have backups, but all sysadmins have backups, right?)

root# dmesg
ata1.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000008
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/80:40:56:2b:a9/00:00:73:00:00/40 tag 8 ncq 65536 in
res 41/40:00:90:2b:a9/00:00:73:00:00/40 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/133
sd 0:0:0:0: [sda] Unhandled sense code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
73 a9 2b 90
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
sd 0:0:0:0: [sda] CDB: Read(10): 28 00 73 a9 2b 56 00 00 80 00
end_request: I/O error, dev sda, sector 1940466576
ata1: EH complete

The “auto reallocate failed” part specially sucks.
A quick look at smartctl on the faulty underlying drive of the raid1 showed a not-nice value of 13 for Offline_Uncorrectable.

Any sysadmin in any company would then proceed to just replace the faulty drive, and happily wait for the raid to resync. But when it’s your home drive, and we’re talking here about 13 faulty blocks out of several zillion blocks, it suddenly seems a bit stupid to just throw the hard drive, when 99.9999993% of the remaining blocks are perfectly okay (yes, this is the actual ratio).

I’m using EXT4 for this raid partition, so I wanted to take advantage of the badblocks mechanism of this filesystem, as mdraid doesn’t (yet?) have any such mechanism.

I ran a read of the entire partition to locate where the problems were, and a grep in dmesg determined that the bad blocks were between the logical sectors 1015773 and 1015818 of the partition (which uses a blocksize of 4K, as reported by tune2fs -l).
So, I’ve taken a security margin, and went to blacklist the logical sectors from 1015700 to 1015900.

I’ve first made a list of all the inodes impacted, using debugfs:

root# seq 1015700 1015900 | sed -re 's/^/icheck /' | debugfs /dev/sda3 2>/dev/null | awk '/^[0-9]+[[:space:]]+[0-9]+$/ {print $2}' | tee badinodes
125022
125022
125022
[... snip ...]

Then searched the file names attached to those inodes:

root# sort -u badinodes | sed -re 's/^/ncheck /' | debugfs /dev/sda3 2>/dev/null | awk '/^[0-9]/ { $1=""; print }' | tee badfiles
/usr/src/linux-headers-2.6.35-28/arch/microblaze/include
/usr/src/linux-headers-2.6.35-28/arch/microblaze/include/asm
/var/lib/mlocate/mlocate.db
[... snip ...]

In my case, it was only non-critical system files, but if more important files are impacted, doing a backup from the good partition would probably be a good idea, just in case…

I rebooted on a live system to be able to work with my root filesystem, and started the array with only the good drive in it.

root@PartedMagic # mdadm --assemble /dev/md0 /dev/sdb3
mdadm: /dev/md0 has been started with 1 drive (out of 2).

And used fsck to manually add a list of the badblocks.

root@PartedMagic # seq 1015700 1015900 > badblocks
root@PartedMagic # fsck.ext4 -C 0 -l badblocks -y /dev/md0
e2fsck 1.41.11 (14-Mar-2010)
slash: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 125022: 1015700 1015701 1015702 1015703 1015704 1015705 1015706 1015707 1015708 1015709 1015710 1015711 1015712 1015713 1015714 1015715 1015716 1015717 1015718 1015719 1015720 1015721 1015722 1015723 1015724 1015725 1015726 1015727 1015728 1015729 1015730 1015731 1015732 1015733 1015734 1015735 1015736 1015737 1015738 1015739 1015740 1015741 1015742 1015743
Multiply-claimed block(s) in inode 179315: 1015744
Multiply-claimed block(s) in inode 179316: 1015745
Multiply-claimed block(s) in inode 179317: 1015746
Multiply-claimed block(s) in inode 179318: 1015747
Multiply-claimed block(s) in inode 179319: 1015748
Multiply-claimed block(s) in inode 179320: 1015749
[... snip ...]
Multiply-claimed block(s) in inode 179376: 1015805
Multiply-claimed block(s) in inode 179377: 1015806
Multiply-claimed block(s) in inode 179378: 1015807
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 65 inodes containing multiply-claimed blocks.)
File /lib/modules/2.6.35-28-generic-pae/kernel/net/sunrpc/sunrpc.ko (inode #125022, mod time Tue Mar 1 15:57:40 2011)
has 44 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/cris/arch-v10/lib (inode #179315, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/cris/boot (inode #179316, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/cris/boot/compressed (inode #179317, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
[... snip ...]
File /usr/src/linux-headers-2.6.35-28/arch/microblaze/lib (inode #179376, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/microblaze/include (inode #179377, mod time Sat Mar 19 05:32:02 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/microblaze/include/asm (inode #179378, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #0 (7663, counted=7555).
Fix? yes
Free blocks count wrong for group #30 (387, counted=495).
Fix? yes

slash: ***** FILE SYSTEM WAS MODIFIED *****
slash: 261146/472352 files (0.2% non-contiguous), 1733488/1886240 blocks

Look as how fsck detects the inodes that are claiming the same blocks. This is totally normal, as some of the badblocks are associated to files (as we found above), hence are referenced in the badblocks inode AND in the real files inodes. The fsck fix is exactly what we need : it just duplicates the data block so that each inode has its own data block.

… wait, did it modified the files inodes or the badblocks inode ?

root@PartedMagic # dumpe2fs -b /dev/md0
dumpe2fs 1.41.11 (14-Mar-2010)
1015700
1015701
[... snip ...]
1015899
1015900

Alright, fsck did exactly the right thing, it modified the real files inodes, by copying the data from the blocks in the badblocks to new unallocated blocks. As this has been done with the working disk of the raid array, the impacted files have not lost their integrity.

Just for fun, let’s check that the blocks we wanted to ban data from are indeed no longer used by any real inode :

root@PartedMagic # seq 1015700 1015900 | sed -re 's/^/icheck /' | debugfs /dev/md0 2>/dev/null
debugfs: Block Inode number
1015700 <block not found>
debugfs: Block Inode number
1015701 <block not found>
debugfs: Block Inode number
1015702 <block not found>
debugfs: Block Inode number
1015703 <block not found>
debugfs: Block Inode number
1015704 <block not found>
debugfs: Block Inode number
1015705 <block not found>
debugfs: Block Inode number
1015706 <block not found>
debugfs: Block Inode number
1015707 <block not found>
[... snip ...]

Okay !

We now just have to resync the array over the bad disk,

root@PartedMagic # mdadm --re-add /dev/md0 /dev/sda3
mdadm: re-added /dev/sda3

Wait for it to finish… (check /proc/mdstat to see the progression), then reboot ! :)

lzop vs compress vs gzip vs bzip2 vs lzma vs lzma2/xz benchmark, reloaded

I’ve had a couple of interesting comments at my last attempt to benchmark those algorithms.
So, here is a more complete benchmark, with hopefully more detailed results.

1) Benchmark protocol

We are benchmarking all the algorithms supported by recent tar versions (1.22 was used):

programextensionversioncommentsupported compression levels
lzop.lzop1.02rc1known to be very fast1 to 9, but 2 to 6 are equivalent, 3 by default
compress.Z4.2.4the legacy UNIX compression algorithmnot configurable
gzip.gz (.tgz)1.3.12replaced compress in recent UNIX-like operating systems1 to 9, 6 by default
bzip2.bzip2 (.tbz, .tbz2)1.0.5known to have a better compression ratio than gzip, but much slower1 to 9, 9 by default
lzma.lzma4.999.9betanew algorithm aiming at high compression ratios0 to 9, 6 by default
lzma2.xz (.txz)4.999.9betaxz is a compression format, and uses by default the lzma2 algorithm, it has some new features over lzma, for example integrity checking, as seen on the French Wikipedia page0 to 9, 6 by default

Benchmark protocol at a glance:

  • I used the Linux 2.4.0 kernel archive contents as data to compress. The uncompressed version takes 100 132 718 bytes of disk space (or 95.5 Mb).
  • Each algorithm has been tested with all supported compression levels
  • The resulting archive size has of course been measured
  • Compression and decompression tests have been run 3 times per algorithm per compression level
  • RAM memory used has been measured during both compression and decompression
  • The time elapsed during compression and decompression has been measured
  • All thoses tests have been done in /dev/shm (i.e. in memory) to avoid disk I/O overhead
  • I tried to use the multithreading features of LZMA/LZMA2, but it’s not yet implemented, as reported by the man and as tested by myself

For reference, the following script has been used to automate the benchmark:

#! /bin/sh
NBLOOP=3
COMPRESS_OBJECT=linux-2.4.0
 
memstats()
{
  (
  renice 19 $$ >/dev/null 2>&1
  while : ; do
    ps --no-headers -o rss -C $1 || break
    sleep 1
  done | tail -n 1
  )
}
bench()
{
  for i in $(seq 1 $NBLOOP) ; do
    trap "rm -f out.$2" EXIT
    /usr/bin/time -f "DONE: comp $1-$3 ($i) time: %e" tar cf out.$2 $COMPRESS_OBJECT --$1 2>&1 >/dev/null & sleep 1
    mem=$(memstats $1)
    size=$(stat -c '%s' out.$2)
    echo "... mem: $mem size: $size"
    echo
    mkdir tmp_extract_$$ || exit 1
    trap "rm -f out.$2 ; rm -Rf tmp_extract_$$" EXIT
    /usr/bin/time -f "DONE: decomp $1-$3 ($i) time: %e" tar xf out.$2 -C tmp_extract_$$ 2>&1 >/dev/null & sleep 1
    mem=$(memstats $1)
    echo "... mem: $mem"
    echo
    rm -f out.$2
    rm -Rf tmp_extract_$$
    trap - EXIT
  done
}
 
for level in none ; do
  echo "=== COMPRESS ==="
  bench compress Z
done
for level in 1 3 7 8 9 ; do
  echo "=== LZOP -$level ==="
  export LZOP="-$level"
  bench lzop lzo $level
done
for level in 1 2 3 4 5 6 7 8 9 ; do
  echo "=== GZIP -$level ==="
  export GZIP="-$level"
  bench gzip gz $level
done
for level in 1 2 3 4 5 6 7 8 9 ; do
  echo "=== BZIP2 -$level ==="
  export BZIP2="-$level"
  bench bzip2 bz2 $level
done
for level in 0 1 2 3 4 5 6 7 8 9 ; do
  echo "=== LZMA -$level ==="
  export XZ_OPT="-$level"
  bench lzma lzma $level
done
for level in 0 1 2 3 4 5 6 7 8 9 ; do
  echo "=== XZ (LZMA2) -$level ==="
  export XZ_OPT="-$level"
  bench xz xz $level
done

2) Benchmark results

Here are the raw -and somewhat unreadable- results:

ctime: compression time, cmem: memory used during compression
dtime: decompression time, dmem: memory used during decompression

algosize (Mb)ctime (s)cmem (Kb)dtime (s)dmem (Kb)
compress39.562.641 1241.60548
lzop-136.171.041 0040.63?
lzop-336.381.119400.65?
lzop-727.0713.151 3120.70?
lzop-826.7427.671 3080.65?
lzop-926.7333.31 3080.60?
gzip-128.722.747081.42486
gzip-227.442.907081.42486
gzip-326.503.227081.40484
gzip-424.773.567081.33486
gzip-523.824.437181.27500
gzip-623.435.787161.29488
gzip-723.336.747001.25488
gzip-823.259.826921.27488
gzip-923.2313.26941.25486
bzip2-121.8117.51 5544.62898
bzip2-220.5917.62 3364.481 288
bzip2-320.0217.83 1204.431 700
bzip2-419.6618.53 9004.493 900
bzip2-519.4220.04 6884.562 468
bzip2-619.2520.65 4684.762 878
bzip2-719.0721.96 2565.073 250
bzip2-818.9422.57 0405.083 644
bzip2-918.8922.67 8205.384 040
lzma-023.1610.31 9803.42840
lzma-121.9413.12 0003.34824
lzma-220.0813.15 4763.111 272
lzma-317.2460.313 6002.441 788
lzma-416.6466.825 3762.402 814
lzma-516.2169.248 9262.284 858
lzma-615.6290.596 0302.218 952
lzma-715.3697.6190 2602.2417 146
lzma-815.17106378 6882.2533 536
lzma-915.04113689 9562.2366 304
xz-023.1610.72 0883.63864
xz-121.9511.52 0663.31875
xz-220.0813.25 5562.961 300
xz-317.2563.013 6842.701 830
xz-416.6465.625 4502.602 836
xz-516.2170.049 0122.484 886
xz-615.6290.596 1122.509 000
xz-715.3697.4190 3242.4017 196
xz-815.17110378 7402.4435 556
xz-915.05117690 0602.4666 326


3) Results analysis

The outsiders

The compress algorithm is completely awful: it has the worst compression ratio. Other algorithms perform better, faster, and using less RAM. There’s not much more to say: forget this one.

The lzop algorithm is indeed very fast, it can compress the whole kernel tree in about one second. The level 3 (which is the default) is really weird: it has a lower compression ratio and a lower compression speed than the level 1! So, it really has no advantages over the level 1. Levels 7, 8 and 9 are totally useless: very slow compression time, and still an awful compression ratio. So, the only interesting level of lzop seems to be 1. Take it if you need blazing speed at the cost of a terrible compression ratio, compared to the other algorithms (you’ll also get a low RAM usage for no additional cost).

Difference between XZ and LZMA2

Short answer: xz is a format that (currently) only uses the lzma2 compression algorithm.

Long answer: think of xz as a container for the compression data generated by the lzma2 algorithm. We also have this paradigm for video files for example: avi/mkv/mov/mp4/ogv are containers, and xvid/x264/theora are compression algorithms. The confusion is often made because currently, the xz format only supports the lzma2 algorithm (and it’ll remain the default, even if some day, others algorithms may be added). This confusion doesn’t happen with other formats/algorithms, as for example gzip is both a compression algorithm and a format. To be exact, the gzip format only supports to encapsulate data generated by gzip… the compression algorithm. In this article I’ll use “xz” to say “the lzma2 algorithm whose data is being encapsulated by the xz format”. You’ll probably agree it’s way simpler :)

Performance of LZMA vs LZMA2 (XZ)

The performance of lzma and xz are extremely close. Lzma2 doesn’t outperform lzma (“lzma1″), as one might expect : there’s no real difference between lzma and lzma2 in terms of compression ratio, compression/decompression speed, or RAM usage. This is because lzma2 has just a few modifications over lzma1, and most of them are not regarding the compression algorithm itself, it just fixes some practical issues lzma1 had (according to the xz man page). The ”.lzma” format will most likely disappear in a near future in favor of the ”.xz” format (which is already widely preferred over ”.lzma”). And if you have read the above paragraph, yes, lzma1 was both a compression algorithm and a (messy) format. :)

Results ordered by compression ratio

In the following table, I’ve removed lzma for brevity’s sake (if you read the above paragraph, you know why).
You can use the toolbar below to export, filter or print the results.

ctime: compression time, cmem: memory used during compression
dtime: decompression time, dmem: memory used during decompression

algosize (Mb)ctime (s)cmem (Kb)dtime (s)dmem (Kb)
xz-915.05117690 0602.4666 326
xz-815.17110378 7402.4435 556
xz-715.3697.4190 3242.4017 196
xz-615.6290.596 1122.509 000
xz-516.2170.049 0122.484 886
xz-416.6465.625 4502.602 836
xz-317.2563.013 6842.701 830
bzip2-918.8922.67 8205.384 040
bzip2-818.9422.57 0405.083 644
bzip2-719.0721.96 2565.073 250
bzip2-619.2520.65 4684.762 878
bzip2-519.4220.04 6884.562 468
bzip2-419.6618.53 9004.493 900
bzip2-320.0217.83 1204.431 700
xz-220.0813.25 5562.961 300
bzip2-220.5917.62 3364.481 288
bzip2-121.8117.51 5544.62898
xz-121.9511.52 0663.31875
xz-023.1610.72 0883.63864
gzip-923.2313.26941.25486
gzip-823.259.826921.27488
gzip-723.336.747001.25488
gzip-623.435.787161.29488
gzip-523.824.437181.27500
gzip-424.773.567081.33486
gzip-326.503.227081.40484
lzop-926.7333.31 3080.60?
lzop-826.7427.671 3080.65?
lzop-727.0713.151 3120.70?
gzip-227.442.907081.42486
gzip-128.722.747081.42486
lzop-136.171.041 0040.63?
lzop-336.381.119400.65?
compress39.562.641 1241.60548


The lines in grey mean that the current algorithm+level is suboptimal: it has a lower compression ratio and an higher compression time than the algorithm+level of the immediately above row. In short: these are combinations you shouldn’t use.

Two numbers in dark red have a big gap between them, this is to ease readability and pinpoint the major magnitude transitions between the numbers.

Some highlights

As we already seen, lzop is the fastest algorithm, but if you’re looking for pure speed, you might better want to take a look at gzip and its lowest compression levels. It’s also pretty fast, and achieves a way better compression ratio than lzop.

The higher level of gzip (9, which is the default), and the lower levels of bzip2 (1, 2, 3) are outperformed by the lower levels of xz (0, 1, 2).

The level 0 of xz might not be used, its use is somewhat discouraged in the man, because its meaning might change in a future version, and select an non-lzma2 algorithm to try to achieve an higher compression speed.

The higher levels of xz (3 and above) might only be used if you want the best compression ratio, and definitely don’t care about the enormous time of compression, and gigantic amount of RAM used. The levels 7 to 9 are particularly insane in this regard, while offering you a ridiculously tiny better compression ratio than mid-levels.

The bzip2 decompression time is particularly bad, whatever level is used. If you care about the decompression time, better avoid bzip2 entirely, and use gzip if you prefer speed or xz if you prefer compression ratio.

Fixing a locale-archive breakage

1) The symptoms of the problem

If you are greeted with the following errors when trying to use perl:

user$ perl -e ''
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").

Or when using yum:

user$ yum help >/dev/null
Failed to set locale, defaulting to C

Or using any GTK application:

user$ gedit
(process:24839): Gtk-WARNING **: Locale not supported by C library.
Using the fallback 'C' locale

Or having your scripts failing in strange and unexpected ways:

user$ ...
/etc/profile.d/lang.sh: line 19: warning: setlocale: LC_CTYPE: cannot change locale (fr_FR.UTF-8): No such file or directory
/etc/profile.d/lang.sh: line 20: warning: setlocale: LC_COLLATE: cannot change locale (fr_FR.UTF-8): No such file or directory
/etc/profile.d/lang.sh: line 23: warning: setlocale: LC_MESSAGES: cannot change locale (fr_FR.UTF-8): No such file or directory
/etc/profile.d/lang.sh: line 26: warning: setlocale: LC_NUMERIC: cannot change locale (fr_FR.UTF-8): No such file or directory
/etc/profile.d/lang.sh: line 29: warning: setlocale: LC_TIME: cannot change locale (fr_FR.UTF-8): No such file or directory

Then you have a locale problem.

2) The solution

Here’s the command to list all the locales available on your system:

user$ locale -a
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
C
POSIX

C and POSIX are the two default locales, always supported when everything else is broken. Here, obviously, something is wrong. I’m experiencing a locale-archive breakage. The file /usr/lib/locale/locale-archive is used by the apps to know which locales are supported on your system. If this database is broken, then you’ll have lots of warnings in all your programs, and they’ll always fallback to the C/POSIX locale (english, and plain us-ascii). To rebuild this database under Fedora, just issue the following command as root:

root# /usr/sbin/build-locale-archive

If you get an error when running the above command, see below. When the command is completed, you can check if it worked, issuing a locale -a at the prompt, you should now have a fairly complete list, way more than C and POSIX alone.

3) If it still doesn’t work

If the build-locale-archive command failed:

root# build-locale-archive
build-locale-archive: cannot open locale archive template file "/usr/lib/locale/locale-archive.tmpl": No such file or directory

Then you can try to just create this file with no data in it, it works for not too-recent Fedora versions. Then, try to rebuild the database:

root# touch /usr/lib/locale/locale-archive.tmpl
root# build-locale-archive

If you get another error:

root# build-locale-archive
build-locale-archive: cannot read archive header

Then you’re using a recent Fedora version, and you can no longer rebuild the locale-archive yourself. The packagers decided to truncate the /usr/lib/locale/locale-archive.tmpl file after build (to save disk space). In older versions, the archive could be rebuilt anyway, but this is no longer the case. Don’t panic, you only need to reinstall the glibc-common rpm (this is pretty painless), your locale-archive will be rebuilt in the process:

root# yum reinstall glibc-common

Listening to multi-channel audio with only two speakers in stereo using mplayer

Have you already noticed while listening to multi-channel music with only two stereo speakers with mplayer, how the sound is low ? Same thing applies with videos using the AC3 format, often composed of 5 channels.
Mplayer, being the swiss-army knife of video players, gives you the tools to deal with multi-channel audio with your two poor laptop speakers.

Here’s the command-line I use when I have to deal with 5-channel audio files (or videos) :

user$ mplayer file.ac3 -channels 5 -af pan=2:'1:0':'0:1':'0.7:0':'0:0.7':'0.5:0.5'

The quotes are here only for readability. Here’s what this command-line says :

  • -channels 5 : I want to use the 5 channels of the input file
  • -af pan=2 : I only have 2 speakers and so only need 2 output channels
  • 1:0 : route the entire front-left input channel to the left output channel
  • 0:1 : route the entire right-left input channel to the right output channel
  • 0.7:0 : lower its volume a bit and route the entire rear-left input channel to the left output channel
  • 0:0.7 : lower its volume a bit and route the entire rear-right input channel to the right output channel
  • 0.5:0.5 : eavenly mix the center input channel into the two output channels, 50% each

The resulting left channel will be a mix of :

  • 100% of the input front-left channel
  • 70% of the input rear-left channel
  • 50% of the input center channel

Same thing goes for the right channel, just replace all “left” occurences by “right” above 😉
If you want to test by yourself, I recommend the use of an AC3 test file, available here : 5.1 Surround Test File, thanks to Bjorne Lynne. There are some very nice tunes on this site by the way :)

compiling the nVidia and VirtualBox modules for all available Fedora kernels

I like to keep some old kernel versions around, just in case.

The problem is that their manually compiled modules are often not kept up to date by Fedora (notably VirtualBox modules and nVidia’s blob). So here are the two commands I use to do just that manually.

nVidia’s blob (as packaged by RPMusion) uses the akmod system, so this is pretty straightforward:

root# for D in /lib/modules/* ; do akmods --kernels $(basename $D) ; done

And now, for VirtualBox, somewhat less nice (I had to dig through the code to find it):

root# for D in /lib/modules/* ; do KERN_DIR=$D/build MODULE_DIR=$D/misc /etc/init.d/vboxdrv setup ; done

Don’t pay attention to the Virtualbox script yelling about the impossibility of modprobing the newly compiled modules, this is totally normal: you can only modprobe modules compiled for your running kernel, which is not the case here.

collectd problems with grsecurity/PaX

I recompiled collectd some days ago, to get the latest version : 4.9.1.
When I tried to restart it, I was greeted with a nice error message:

root# /etc/init.d/collectd start
Starting statistics collection and monitoring daemon: collectd
lt_dlopen (/opt/collectd/lib/collectd/netlink.so) failed: file not found
Unable to load plugin netlink.
root# ls -l /opt/collectd/lib/collectd/netlink.so
-rwxr-xr-x 1 root root 26K 2010-03-30 23:43 /opt/collectd/lib/collectd/netlink.so

*gasp*, so what’s wrong?

Well, I had to run strace to find out:

root# strace -fq /etc/init.d/collectd start >/var/tmp/strace.log 2>&1

Here’s the interesting part (unnecessary noise removed):

$ ...
[pid 29239] lstat64("/opt/collectd/lib/collectd/netlink.so", {st_mode=S_IFREG|0755, st_size=26082, ...}) = 0
[pid 29239] open("/opt/collectd/lib/collectd/netlink.so", O_RDONLY) = 4
[pid 29239] read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\25\0\0004\0\0\0"..., 512) = 512
[pid 29239] mmap2(NULL, 24836, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x527d4000
[pid 29239] mmap2(0x527d9000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x4) = 0x527d9000
[pid 29239] close(4) = 0
[pid 29239] mprotect(0x527d4000, 20480, PROT_READ|PROT_WRITE) = -1 EACCES (Permission denied)
[pid 29239] munmap(0x527d4000, 24836) = 0
[pid 29239] socket(PF_FILE, 0x80002 /* SOCK_??? */, 0) = 4
[pid 29239] write(2, "lt_dlopen (/opt/collectd/lib/collectd/netlink.so) failed: file not found", 73) = 73

It turns out it fails because of the PROT_WRITE flag of mprotect, which my grsecurity/PaX configuration denies. The error message that follows this error is extremely misleading, though…
netlink is the only plugin (on my configuration at least) that wants to use this flag. The proper way of fixing this situation should be to dig into the code, and see if we can remove the necessity of this flag.
But for now I just want a working configuration! So, I just asked PaX to allow this kind of thing for the collectd binary.

root# paxctl -cm /opt/collectd/sbin/collectd
file /opt/collectd/sbin/collectd had a PT_GNU_STACK program header, converted
root# /etc/init.d/collectd start
Starting statistics collection and monitoring daemon: collectd.

Phew, it works.

Samba user shares broken on Fedora 12

If you’re using Samba user shares on your system (that is, shares that users can mount without being root), you were probably greeted with the following message for the last several weeks, when trying to mount a share:

user$ mount nas
This mount.cifs program has been built with the ability to run as a setuid root program disabled.
mount.cifs has not been well audited for security holes. Therefore the Samba team does not recommend installing it as a setuid root program.

The Samba team does not recommend installing it as a setuid root program ? Wait, no, the samba team unilaterally decided to prevent you to run mount.cifs and umount.cifs with setuid (which is needed for user mounts to work), and there’s nothing you can do about it without recompiling.

They probably decided this after CVE-2009-2948. The problem is that on my home nework, I need the ability to mount Samba shares without being root, and I don’t really care for the above security bug. So, while they audit their code (nobody knows how many time it’ll take), I decided to downgrade my Samba version from the updated one (3.4.5 at the time of this writing) to the one found on the Fedora 12 stock install (3.4.2). Here’s how to do it:

root# yum downgrade samba-client samba-common samba-winbind samba-winbind-clients

Now, let’s try to mount the share:

user$ mount nas
mount error(1): Operation not permitted
Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

Okay, the binaries are not setuid, let’s do it ourselves:

root# chmod 4755 /sbin/mount.cifs /sbin/umount.cifs

And retry:

user$ mount nas && echo mount: success ; umount nas && echo umount: success
mount: success
umount: success

It works ! Now, last thing, don’t forget to prevent yum from updating your Samba again, add the following line to your /etc/yum.conf :

exclude=samba-*

When the Samba guys will have audited their code and allow again setuid on the CIFS mount utils, just remove the exclude line from your /etc/yum.conf, and run yum update, as usual.

EDIT: I’ve looked at the source code of the latest Samba release (3.5.2, released on April, 7th), and the ability to use setuid on the CIFS mount utility is still disabled by default. There is a ”#define” in the source code that enables or disables this functionality, so it should be up to the Samba maintainers of each Linux distro to decide. The ‘fix’ is pretty simple, just change the following line :

#define CIFS_DISABLE_SETUID_CHECK 0

to:

#define CIFS_DISABLE_SETUID_CHECK 1

in the client/mount.cifs.c source file, and recompile. The above line is preceded with the following comment from the developers:

/*
* mount.cifs has been the subject of many "security" bugs that have arisen
* because of users and distributions installing it as a setuid root program.
* mount.cifs has not been audited for security. Thus, we strongly recommend
* that it not be installed setuid root. To make that abundantly clear,
* mount.cifs now check whether it's running setuid root and exit with an
* error if it is. If you wish to disable this check, then set the following
* #define to 1, but please realize that you do so at your own peril.
*/

This is probably what is scaring our maintainers… I’m not that confident the functionality will come back by itself anymore. Will we have to build alternative rpms ourselves, with CIFS_DISABLE_SETUID_CHECK set to 1 ?

Meanwhile, the issue is spreading, Mandriva cooker is now affected too.

Apache logical OR & AND conditions with SetEnvIf

Unfortunately, the Apache SetEnvIf module doesn’t support logical conditions, like OR & AND. More specifically, it is not possible to set a variable only if condition1 AND/OR condition2 are verified.

For example, to log all the POST queries made from the loopback interface in a separate log file, you can’t do this:

CustomLog /var/log/apache2/loopback_posts.log combined env=posting_myself
SetEnvIf Remote_Addr "^127\.0\.0\.1$" AND Request_Method "POST" posting_myself

The first line is valid, it asks the server to log all the requests to the mentioned file, only if the environment variable posting_myself is set.
The second line attemps to set the posting_myself variable if two conditions are met (use of a logical AND), which is not a supported syntax.

On a first attempt to work around this problem, I came up with this:

CustomLog /var/log/apache2/loopback_posts.log combined env=posting_myself
SetEnv loopback_ip 0
SetEnvIf Remote_Addr "^127\.0\.0\.1$" loopback_ip=1
SetEnvIf Request_Method "POST" posting_myself
SetEnvIf loopback_ip 0 !posting_myself

The first line is unchanged.
The second line unconditionally sets the variable loopback_ip to zero (meaning : false).
The third line sets the same variable to one (“true”) if the requests indeed comes from the loopback IP.
The fourth line sets the variable posting_myself to true if the request method is POST.
Note that at this stage, the posting_myself variable can be true, even if somebody else made the request.
This is taken care of at the last line, where posting_myself is unset if the variable posting_myself is equal to zero (false).
After the last line, we will have the posting_myself variable set only if the request comes from the loopback IP **and** if the request is a POST.

Unfortunately, this doesn’t work either, for a subtle reason : the SetEnv directive is executed after the SetEnvIf directives, according to the Apache documentation. This means that when SetEnv sets loopback_ip to zero, it’s way too late.

So, I came up with this other version to emulate a SetEnvIf logical AND:

CustomLog /var/log/apache2/loopback_posts.log combined env=posting_myself
SetEnvIf Remote_Addr "^" loopback_ip=0
SetEnvIf Remote_Addr "^127\.0\.0\.1$" loopback_ip=1
SetEnvIf Request_Method "POST" posting_myself
SetEnvIf loopback_ip 0 !posting_myself

This is similar to the previous attempt, only the second line changes: I have to use SetEnIfv, to artificially set loopback_ip to zero, unconditionally. To do this, I use a regex that will always match : “^”. Note that I used Remote_Addr, but I could have used Request_Protocol, Request_URI or anything else: the only important thing is that it always matches and does set loopback_ip to zero.

Now, to emulate a SetEnvIf logical OR, this is way easier:

SetEnvIf Remote_Addr "^192\.168\.0\." my_networks
SetEnvIf Remote_Addr "^127\.0\.0\.1$" my_networks

Here, the variable my_networks will be set if the remote address starts with “192.168.0.” or is “127.0.0.1”. It’s that simple.
Well, I could have used a smarter regex to do this in one line, but it would have ruined my logical OR example! :)

Yum plugin to show installed package versions when upgrading

There is a feature apt-get has and yum misses : the ability to show installed packages versions, when asking confirmation about upgrading. There is a list of the about-to-be-installed packages version numbers, but not the already installed versions.

For example yum might tell you that you’re about to upgrade to apache v2.2.14-2, but from which version ? If you’re upgrading from v2.2.14-1, you should have nothing to worry about. If you’re upgrading to v2.2.13-4, maybe you’ll want to have a look to the changelog to check which features were added. If you’re upgrading to v1.3 (!), well, you might run into a couple of problems when trying to restart it afterwards :)

Well, this is what this plugin is made for. Here’s how it looks like :

root# yum upgrade
Loaded plugins: changelog, dellsysidplugin2, fastestmirror, presto, priorities, refresh-packagekit, show-upgrade-versions
[...]
Setting up Upgrade Process
Resolving Dependencies
--> Running transaction check
[...]
--> Finished Dependency Resolution
The following packages are release updates:
bind-libs (9.6.1) 13.P2.fc12 => 15.P3.fc12
bind-utils (9.6.1) 13.P2.fc12 => 15.P3.fc12
dhclient (4.1.0p1) 13.fc12 => 17.fc12
dhcp (4.1.0p1) 13.fc12 => 17.fc12
fluidsynth-libs (1.0.9) 4.fc12 => 5.fc12
policycoreutils (2.0.78) 7.fc12 => 10.fc12
policycoreutils-gui (2.0.78) 7.fc12 => 10.fc12
policycoreutils-python (2.0.78) 7.fc12 => 10.fc12
The following packages are version updates:
genisoimage 1.1.9 => 1.1.10
gocr 0.46 => 0.48
homebank 4.0.4 => 4.1
icedax 1.1.9 => 1.1.10
perf 2.6.31.9 => 2.6.31.12
wodim 1.1.9 => 1.1.10
Dependencies Resolved
[...]
Total download size: 33 M
Is this ok [y/N]:

I’ve made an rpm package for this plugin, if you want to try it, you can grab it here :
download yum-plugin-show-upgrade-versions-1.00-1.fc12.noarch.rpm.
It’s tagged for Fedora 12, but it should work on older versions I suppose…