How OCZ Vertex 2 and OCZ Petrol SSD successfully killed my data. Twice.

Short story of my Vertex 2’s short life

 

Several months ago, I wanted to give a little speedup to my aging laptop. It also seemed a perfect time to give a first try to the SSDs everybody had been talking about. All in all, the OCZ Vertex 2 seemed like a good choice : the Vertex 3 was just out, so the Vertex 2 price was going down and making it almost affordable.

vertex2

On that faithful day of 25th September 2011, I bought a Vertex 2 online, and installed it in my laptop. The change was so noticeable ! It was really a great buy… except that just 4 months later, when powering on my laptop, I found out that it was hanging at the hard drive detection BIOS step, only to time out after 30 seconds, blatantly stating that no hard drive was found. Nothing I could do would bring the hard drive back, even when trying in another computer. After some research online, I found out that the exactly same thing happened to a countless number of people around the globe.

My hard drive had just entered in the panic lock mode, a deadly and overly stupid non-documented feature of the included SandForce controller, which meant that all data stored on it was inevitably lost, and the disk itself was dead for good. Just like that. For no apparent reason.

So, I got it refunded to buy a replacement, an Intel 320 “PostVille refresh”

320pvillerefresh

…which is still alive by today and works like a charm, and sweared I would never buy OCZ again.

 

Never forgive, never forget

 

It should have been my motto on this day, but it wasn’t. On September 2012, I decided to replace my old laptop by a full-featured PC. I bought an-OEM pre-built one, as it had all the pieces I wanted, and it was cheaper than if I had to buy all the pieces separately. It had an hybrid configuration for mass storage : a 1 Tb HDD, and an SSD. Unfortunately, it was an OCZ. Well, the PC was cheap enough and I had almost forgiven OCZ for their terrible Vertex 2 model. This one was an OCZ Petrol SSD, and it was not powered by a SandForce controller. So, I thought I was safe from the stupid panic mode.

petrol

Well, I was right… but unfortunately it was actually way worse than a panic lock, way more insidious, as I found out what it was doing only months after it started to destroy my data.
This is the nightmare of any sysadmin (or anybody that cares for his data) : silent data corruption.

It started with randomly crashing programs. The first one was Chromium : one day it would just start to crash every time I tried to launch it. I thought it had something to do with Flash or some other broken plugin, but didn’t even bother to look into it and just raised my shoulders and switched to Firefox. Then some time later Firefox started to crash too, so I switched to Midori, the XFCE browser, telling myself I would have to look into that later. Yes, I can be extremely lazy at times.

I eventually came to the conclusion that Flash had nothing to do with that when I started to get freezes and floods of unreadable sector messages in the kernel log. Hell, the drive was (again) just 4 months old ! And all the SMART indicators as seen by smartctl were green ! Bah, yet another cursed drive from OCZ ! I used some liveusb install to use ddrescue and try to get as much data back as possible from the drive before it would suffer an eventual instant death, as I knew anything was possible with an OCZ drive.

I already used ddrescue and friends numerous times for mechanical hard drives (yes, I’m unlucky with hard drives), and it always worked pretty well. This was the first time I used it on an SSD however… well, It was a nightmare and took me three days.

 

Welcome to data rescue hell

 

I already used ddrescue and friends numerous times for mechanical hard drives (yes, I’m unlucky with hard drives), and it always worked pretty well. This was the first time I used it on an SSD however… well, It was a nightmare and took me three days.

I was getting read speeds of exactly 150 Kb/sec on very large areas of the drive, which was pretty frightening for an SSD. For the record, OCZ rates this drive at 350 Mb/sec sequential read speeds. Let me spare you the maths, this is 2390 times slower than the normal speed. But even, it would have been too easy: when OCZ wants to eat your data, it won’t let you get it back so easily. The main problem was not even the speed: once the drive tried to read a problematic sector, it would freeze the SATA bus for 2 whole minutes, then a kernel timeout fired and tried to reset the SATA bus to try to get it back in a working state. But the reset never worked and always timed out too, and that’s when the /dev/sda block device would disappear each and every time. Then ddrescue would happily continue try to read from the non-existent device, and conclude that none of the remaining data was readable.

I was about to patch ddrescue to handle this issue more correctly, but in the end I just used the option to tell it to stop after the first error. Because then, the SSD would have become unreachable from the OS anyway. I could do nothing to make it reappear in a scriptable way, not even triggering a SATA bus reset or rescan… the only thing that worked was to electrically unplug the drive and plug it back, then instruct the kernel to rescan the SATA bus for new devices. Pretty brutal, but only that or a reboot would work. It means I had to be there, in front of the computer, imitating a robot that puts and pulls a plug. Really annoying. Anyway, after three days of plugging/unplugging between rare successful reads at 150 Kb/sec, it finally recovered 100% of my data. Supposedly at least: that was strange because some sectors that were unreadable on some day ended being readable the day after.

The rescued SSD image passed the filesystem check almost successfully, so it was reassuring. After copying it to some other drive and booting from it, I still had the crashes from Chromium and Firefox however. So, I instructed rpm to verify the checksum of all the files from installed packages. It found a lot of mismatches, and Chromium/Firefox were sure among those. The solution was easy: reinstalling the concerned packages. Of course, it worked.

 

Silent corruption ?

 

But what about my data ? Not the stuff in /usr or /var, the stuff in $HOME ? Why wouldn’t it be corrupted too ? How can you trust your backups when you have a drive that has been silently corrupting your data for months ? Well, you can’t.

As my backups are incremental and rsync-based (actually, rsnapshot-based), the decision to re-upload a modified file is taken by having a look at it’s size and modify date. If these are matching on both ends, then the file wasn’t modified, so no need to re-backup it. The problem with silent corruption, you see, is that, well, it’s silent. Which means, when some bytes are modified by a shitfaulty hard drive, the size or modification date won’t be. The different cases that could happen on a given file would be :

  1. File backed up cleanly before I switched to the faulty drive, later corrupted by it, but never modified by me or a program before I did the data recovery. Then, the copy on the backup would be safe.
  2. File backed up cleanly before I switched to the faulty drive, later corrupted by it, then modified by me or a program, hence re-backup’d. These legit modification vs corruption vs new incremental backup cycles being repeated several times, eventually. Here, by comparing two adjacent backups, there’s no way to tell which parts of the modifications are normal and which parts originates from the corruption.
  3. File created after I switched to the faulty drive, and maybe corrupted by it before the first backup, so no way to tell if the first or even any of its later backups are safe.

In short, it’s a pretty lame situation. The two last cases are more or less hopeless, except for files where the corruptions are obvious, like images for example. For the first case, there’s still hope to get the clean data back from the backups. But how to detect this case ?

New generations of file systems like ZFS or Btrfs are supposed to mitigate that, by check-summing everything, from data to metadata. Hence silent corruptions can’t go undetected, and can even be fixed if you have sufficient replicas of your data, by a process called scrubbing. Anyway in my case it was classical ext4, so no luck.

 

Tentatively detect and fix silent corruption from incremental backups

 

I started rsyncd on my NAS (on which I have my backups) with the following configuration :

root@nas:~# cat /etc/rsyncd.conf
[backup]
comment = nas backup
path = /raid1/hdda/rsnapshot-quasar/monthly.4/speed/home/speed/
use chroot = yes
lock file = /var/lock/rsyncd
read only = yes
list = yes
uid = root
gid = nogroup
strict modes = yes
ignore errors = no
ignore nonreadable = yes
transfer logging = no
timeout = 600

I used the incremental backups from 4 months old, which I know are not corrupted because it was before I had the drive. This was to try to save the files and personal documents that I have on my hard drive since a long time and that don’t get modified a lot. The uid is set to root so that I’m sure that rsyncd can read all the backup files (that might have several different uids that only make sense on my original computer anyway).

root@nas:~# rsync --no-detach --daemon --verbose --config /etc/rsyncd.conf

Then, on the PC with the rescued drive image, I launched the following command so that rsync lists the files that it thinks are different between the backup source and the rescued data. This is, files that have different sizes and/or modification times. So those files were genuinely modified between 4 months ago and now. Maybe they were corrupted too in the process, but there’s no way to tell. That’s files from the first two cases in the list above.

speed@quasar:~$ rsync -va --dry-run rsync://nas/backup/ . | tee rsync_list_1.txt

After the first list is complete, populate a second list but with a crucial difference : if the size and modification dates match, rsync will no longer assume the files are the same, it’ll do a byte-per-byte comparison. This second list should contain all the files from the first list, plus the files that have been silently corrupted (where the byte-per-byte comparison will fail).

speed@quasar:~$ rsync -va --dry-run --checksum rsync://nas/backup/ . | tee rsync_list_2.txt

And then, running a simple diff shows that indeed a non-negligible number of my $HOME files are corrupted :

speed@quasar:~$ diff -u rsync_list_{1,2}.txt | grep ^+ | wc -l
728

A visual example of a diff between a backup and the current text file is self explanatory :

corrupted1

Thank you, OCZ.

Before restoring those files from my backup, I wanted to measure how many bytes were corrupted by the drive.

speed@quasar:~$ perl -ne 'm{(.+)} or next; @out=qx{cmp -l "/home/speed/mnt/sshfs/raid1/hdda/rsnapshot-quasar/monthly.4/speed/home/speed/$1" "$1"}; $corrupted+=scalar(@out); print $corrupted."\n"' RSYNC.SILENT.CORRUPTION.LIST | tail -n1
60904419
$ perl -ne 'm{(.+)} or next; $total+=(stat($1))[7]; print $total."\n"' RSYNC.SILENT.CORRUPTION | tail -n1
1951053630

So, this wonderful drive silently corrupted roughly 58 MiB of data in my backuped files (the fucked up files have a total size of 1860 MiB). It has corrupted other files too (system ones, fixed by packages reinstall), and all the files I couldn’t detect because they had legitimate modifications in addition to the corruption modifications. Anyway, as I couldn’t measure that, let’s do a favor to OCZ and don’t even count those. This disk has a capacity of 128 Gb. It wasn’t even full at all, but let’s pretend it was, to do yet another favor to OCZ. Hence, we can see here that it corrupted roughly at least 1 bit each 2101 bits.

To compare with classical consumer-level hard drives (Seagate, Western Digital, …) you’ll see in their technical sheet that those are usually rated to corrupt one bit out of 10^14 bits read or written. This is 1 bit corrupted each 100 000 000 000 000 bits.

From there, it’s pretty straightforward, and backed up by the facts I have to come up to this conclusion : an OCZ drive is roughly 47 596 382 675 times more likely to silently corrupt your data than any other drive, if you want to put it that way.

Leave a Reply

4 Comments on "How OCZ Vertex 2 and OCZ Petrol SSD successfully killed my data. Twice."

Notify of
avatar

Sort by:   newest | oldest | most voted
wazaa
Guest
wazaa
3 years 10 days ago

I have 2 SSD, Agiliy and Vertex, they are 1st gen ssds, and they both have the same problem, silent data corruption ! They destroy data randomly in some files, I had linux slackware 10 on them, and after 1h or to 10 days you will see that it will die, will not boot, or will display thousand of errors on boot, mysql database will die, and all other files will be randomly corrupted. Spent many $$$ on those two ssd’s, and a lot of time to figureout and fix the problem, now my servers are back on hdd’s.

James
Guest
James
3 years 8 months ago
@Anthony . Your brain is a damaged as his drive . He is talking about OCZ SSD not about SSD . Big difference . I have OCZ VERTEX 2 SSD as well . Bought it couple of years ago so I can boot the windows from there. SMART data say that it has still not 99% but 100% lifetime left . LoL . Now the thing is that every 2 monthes that drive totally out of the blue desides to stop working anymore. So every couple of monthes I have to reinstall windows and everything else thx to that piece… Read more »
anthony
Guest
3 years 10 months ago
It is really unfortunate that you had not backed up your data elsewhere :(. The concern raised by you in your sad SSD experience story is limited only to the 1st generation of controller you are referencing. I can understand the pain you would have gone through after loosing the data you love. But believe me, things have changed a lot since then. SF is the verge of releasing their 3rd gen controllers which is talked to be the best in the industry. Even today, no HDD/SSD manufacturers provides warranty for the data. It adds to our risk to take… Read more »
Ftelf
Guest
Ftelf
2 years 6 months ago
Hi, yeah, I know that this post is about a year old, but my friend’s Vertex 2 just died from the panic lock too so I found this blog with the same problem. You’re definitely right that in that time there were 3rd generation of SF controller and OCZ Vertex 3 SSD which has brand new bugs and also hilarious ones. I remember as today this great BSOD bug and OCZ denies their fault till the last second when evidence finally forced them to admit. Thank’s god that such company bankrupt and closed their SSD business for good. RAM memories… Read more »
wpDiscuz