Always test your HD first

I just received a WD 500GB Blue, to replace the hard drive of my wife's computer. First thing after unpacking: start to test. And I was right to do it... there are bad blocks on it.

WD 500GB DOA

The story: I tend to do proactive replacement of hard drive, to avoid losing too much data or be in hurry when the current one fails. There are two ways to monitor a hard drive:

install smartmontools, it is the most proactive tool for detecting a hard drive that will fail
just monitor your log using logcheck and logcheck-database

Logcheck is pretty straightforward to install. It scans your log every two hours and send you a report of what is happening. It is not a very precise tool and you have to tune it a little bit to just send you what is relevant (install extra rules to ignore what you know is not of interest). Whenever you start to see log entry like that:

Jan 24 18:16:08 foo kernel: [ 1965.343980] ata5.00: exception Emask 0x50 SAct 0x39 SErr 0x800 action 0x6 frozen
Jan 24 18:16:08 foo kernel: [ 1965.343991] ata5.00: irq_stat 0x08000000, interface fatal error
Jan 24 18:16:08 foo kernel: [ 1965.344001] ata5: SError: { HostInt }
Jan 24 18:16:08 foo kernel: [ 1965.344036] ata5.00: failed command: READ FPDMA QUEUED
Jan 24 18:16:08 foo kernel: [ 1965.344055] ata5.00: cmd 60/08:00:18:31:44/00:00:1a:00:00/40 tag 0 ncq 4096 in
Jan 24 18:16:08 foo kernel: [ 1965.344059]          res 40/00:2c:e6:c2:fd/00:00:26:00:00/40 Emask 0x50 (ATA bus error)
Jan 24 18:16:08 foo kernel: [ 1965.344071] ata5.00: status: { DRDY }
Jan 24 18:16:08 foo kernel: [ 1965.344081] ata5.00: failed command: WRITE FPDMA QUEUED

It is a good time to think about changing your hardrive -- but it is maybe too late.

Smartmontools (aka smartd) is dedicated tool to monitor hard drives and do a good job. I think it is not installed by default in Debian, but it should be. It scans your hard drive for SMART capabilities and monitor the health of the HD using internal tools. In the case of bad blocks, you will start to see entry like that:

Feb 17 14:05:17 bar smartd[1268]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
Feb 17 14:35:17 bar smartd[1268]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

Obviously when you see this log, it is also a good time to change your hard drive.

In the case of my wife's HD, I just got data from logcheck. It means that the error is not that important (transient failure, something is wrong but the HD can cope with it). But I still decided to get a new one for my wife.

Whenever, I receive a new drive, the first thing I do is to check it for errors. You can do that using the program badblocks in write mode. It takes ages to test (count up to 1 day for 1TB on USB), but at the end you know that you have a good candidate -- where it is worth install your data.

You just have to follow this procedure

dmesg | grep sd
try to find what drive is the one you want to test in the output of 1.
cfdisk /dev/sdX, sdX being the drive you want to test
check that what you see in cfdisk is what you expect to test: right name and capacity for the drive
sudo badblocks -wvs /dev/sdX
run sudo tail -f /var/log/syslog in parallel just in case
wait

If some errors appears in /var/log/syslog, you know something bad is happening. Whenever you have a single failing block, don't think it is ok. It is NOT ok for a HD to start its life with failing blocks. In this case, repack the drive and send it for replacement ASAP.

In the case of my hard drive: smartd mail:

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 15 Currently unreadable (pending) sectors

syslog entries:

Mar  2 20:52:57 foo kernel: [ 8317.419715] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Mar  2 20:52:57 foo kernel: [ 8317.419724] ata4.00: irq_stat 0x40000008
Mar  2 20:52:57 foo kernel: [ 8317.419732] ata4.00: failed command: READ FPDMA QUEUED
Mar  2 20:52:57 foo kernel: [ 8317.419747] ata4.00: cmd 60/80:00:00:af:08/00:00:28:00:00/40 tag 0 ncq 65536 in
Mar  2 20:52:57 foo kernel: [ 8317.419750]          res 41/40:00:65:af:08/00:00:28:00:00/40 Emask 0x409 (media error) <F>
Mar  2 20:52:57 foo kernel: [ 8317.419758] ata4.00: status: { DRDY ERR }
Mar  2 20:52:57 foo kernel: [ 8317.419764] ata4.00: error: { UNC }
Mar  2 20:52:57 foo kernel: [ 8317.423959] ata4.00: configured for UDMA/133

I am not blaming any particular brand (like Western Digital), all computer parts I have ever bought had to follow the same procedure and it is a known fact that computer parts as a non-zero percentage of chance to be DOA (dead on arrival) or after a few weeks. But as a consumer you should be aware of that and take action to avoid spending 10h configuring your computer to see it failing after a week... The waste of time to test is a win on the long term.

Comments

1. On Sunday, March 3 2013, 09:48 by Marius Gedminas

What's your opinion on SMART self-tests (smartctl -t long)? Why use badblocks instead?

2. On Sunday, March 3 2013, 11:46 by rjc

Hi,

If it takes a while to test the hard drive I'd use "-F" instead of "-f" with tail.
Personally I got into the habbit of using it every time, unesll there's a reason not to :^)

Regards,

Raf

3. On Sunday, March 3 2013, 21:06 by gildor

+Marius Gedminas:

''smartcl -t long'' are nice to do test. But they are a little harder to setup. Let say that they are the 2nd level when configuring smartmontools.

I will however still run ''badblocks'', this is a more direct diagnostic tools: failing blocks or no failing blocks. And contrary to all SMART analysis, the result is easy to read (you need to do at least 1 or two Internet search to understand the result of a SMART test).

4. On Friday, March 8 2013, 13:05 by Rob

Losing is spelled l-o-s-i-n-g.

5. On Friday, March 8 2013, 15:42 by Sidicas

On modern hard disks, it doesn't matter if they have bad sectors out of the box or not. What matters is whether or not the bad sectors is increasing. You could have gotten a drive that might have been slightly damaged in transit, but the hard disk firmware will automatically detect it and relocate those bad sectors.. So it's generally not a problem. I've bought hard drives with bad sectors out of the box and the drive lasted 8 years. Bad sectors does not mean a drive is defective or will fail anytime soon. Only if the bad sectors count continues to increase with use does it indicate a failing drive.

6. On Friday, March 8 2013, 17:51 by Shnatsel

GNOME Disk Utility (aka palimpsest) also monitors hard disk state and notifies you if it's about to fail. Many distros ship it by default. It's handy.

7. On Friday, March 8 2013, 19:32 by gildor

+Sidicas, I agree on the fact that it is not THAT important and that the drive should "cope" for a moment with it. Also if the drive is still under warranty and you don't need it badly ASAP, you should return it to the manufacturer. This kind of damages are covered by the warranty and you should take advantage of it. When the drive is not anymore under warranty and that some bad sectors appears, that's ok to stay in this situation for a couple of months (which is what happens to me right now).

Although as you told, the sectors could have been damaged during transit, but what if it is not the case? What if the drive is really defective? Will you wait 1 month to see the increase and loose the 14 days return warranty?

8. On Saturday, March 23 2013, 17:42 by andrew

So far I never received any broken harddisk when newly bought, but shit happens anyway.
I have been using smartctl before, too. And recently I found, that this probably does not work, when connecting the drive via USB-adapter to your computer. Again this is just guess-work, it is also possible, that the old Fujitsu IDE-notebook-drive simply does not suppoert SMART at all, I have no way of re-checking this right now.

9. On Saturday, March 23 2013, 21:05 by gildor

On this HD + the USB adapter I use, SMART didn't work for this error. So I suppose SMART doesn't always work on USB (maybe on high-end USB adapter, but I think you should not rely on standard USB adapter for that).

And good news: I get the replacement yesterday and was able to fully check it, no errors.

Blog of Sylvain Le Gall

Comments

They posted on the same topic

Rechercher

S'abonner