Hard drives die in spring

Well, it happened again - the hard drive in my good old RedHat machine got corrupted. Strangely enough, previous failures also ocurred around this time in the previous few years. One of the failures was so bad that I had to buy hard drive recovery software to salvage my CVS repository. This time it seems there are 220+ bad blocks (about 1MB, at 4K/block), but most of the content is still accessible, although I still don't know the extent of the damage.

I have to say that I miss Windows' chkdsk, which not only reports bad blocks, but also the names of files or directories affected by the damaged blocks. e2fsck, on the other hand, comes up with pretty cryptic messages, such as "Attempt to read block from filesystem resulted in short read" or "...Force rewrite(y)?". I'm also more careful this time and so far haven't allowed e2fsck to auto-fix the partition - the last time I did this, it cost me the entire hard drive.

To add insult to the injury, it turned out that my Vantec NAS unit cannot handle files greater than 4GB when accessed through the network and the last full backup was just above 4GB, so I don't have a full backup of the damaged drive either.

So, I will spend a bit more time trying to recover the existing content, but if it turns out to be unrecoverable, I will probably just not ship Fedora Core binaries in the next release, as I don't have any other spare machine for FC builds and, with two donations per year, I'm not in the position to buy one.

April 10th, 2008

Last night I ran dump a few times, adding bad inode numbers to the exclusion list after every run and eventually was able to back up almost the whole drive. Once this was done, I ran e2fsck with -y. Half an hour later I had a usable hard drive with a few holes here and there. e2fsck's messages may be cryptic, but it did the job.

May 3rd, 2008

I finally figured out what the problem was. One day I noticed that the CD drive was performing erratically, sometimes failing to read perfectly clean CDs. Suspicious of this, I replaced the drive with a spare I had and, voilà, IDE interface stopped throwing sporadic SDA errors I was seeing once in a few days before.

Comments:
Posted Wed, 19 Nov 2008 13:27:30 GMT by Andre

I'm usually conservative when it comes to file systems. It is also easier to find recovery software for mainstream ones, like ext2.

Interesting technology, though. I never heard about ZFS before and did some reading this morning. I have a couple of space 30G hard drives, so will probably give it a try some time. Thanks for the suggestion.

Posted Wed, 19 Nov 2008 07:10:40 GMT by Seth

May I suggest ZFS (zfs-fuse) for you? It will allways and guaranteed (at least with strong hash probabilities!) detect any silent disk corruption, depending on redundancy automatically recover data (self-healing) and also report which exact filenames have been affected. 

The main difference with CHKDSK being, that with ZFS you *do* know the extent of the damage.

It's what I use. Performance sucks though, so you might keep your longterm backup on ext2 offsite, snapshots in ZFS and run your production stuff on something fast (XFS, jFS?)