I just received a WD 500GB Blue, to replace the hard drive of my wife's computer. First thing after unpacking: start to test. And I was right to do it... there are bad blocks on it.

WD 500GB DOA

The story: I tend to do proactive replacement of hard drive, to avoid losing too much data or be in hurry when the current one fails. There are two ways to monitor a hard drive:

Logcheck is pretty straightforward to install. It scans your log every two hours and send you a report of what is happening. It is not a very precise tool and you have to tune it a little bit to just send you what is relevant (install extra rules to ignore what you know is not of interest). Whenever you start to see log entry like that:

Jan 24 18:16:08 foo kernel: [ 1965.343980] ata5.00: exception Emask 0x50 SAct 0x39 SErr 0x800 action 0x6 frozen
Jan 24 18:16:08 foo kernel: [ 1965.343991] ata5.00: irq_stat 0x08000000, interface fatal error
Jan 24 18:16:08 foo kernel: [ 1965.344001] ata5: SError: { HostInt }
Jan 24 18:16:08 foo kernel: [ 1965.344036] ata5.00: failed command: READ FPDMA QUEUED
Jan 24 18:16:08 foo kernel: [ 1965.344055] ata5.00: cmd 60/08:00:18:31:44/00:00:1a:00:00/40 tag 0 ncq 4096 in
Jan 24 18:16:08 foo kernel: [ 1965.344059]          res 40/00:2c:e6:c2:fd/00:00:26:00:00/40 Emask 0x50 (ATA bus error)
Jan 24 18:16:08 foo kernel: [ 1965.344071] ata5.00: status: { DRDY }
Jan 24 18:16:08 foo kernel: [ 1965.344081] ata5.00: failed command: WRITE FPDMA QUEUED

It is a good time to think about changing your hardrive -- but it is maybe too late.

Smartmontools (aka smartd) is dedicated tool to monitor hard drives and do a good job. I think it is not installed by default in Debian, but it should be. It scans your hard drive for SMART capabilities and monitor the health of the HD using internal tools. In the case of bad blocks, you will start to see entry like that:

Feb 17 14:05:17 bar smartd[1268]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
Feb 17 14:35:17 bar smartd[1268]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

Obviously when you see this log, it is also a good time to change your hard drive.

In the case of my wife's HD, I just got data from logcheck. It means that the error is not that important (transient failure, something is wrong but the HD can cope with it). But I still decided to get a new one for my wife.

Whenever, I receive a new drive, the first thing I do is to check it for errors. You can do that using the program badblocks in write mode. It takes ages to test (count up to 1 day for 1TB on USB), but at the end you know that you have a good candidate -- where it is worth install your data.

You just have to follow this procedure

  1. dmesg | grep sd
  2. try to find what drive is the one you want to test in the output of 1.
  3. cfdisk /dev/sdX, sdX being the drive you want to test
  4. check that what you see in cfdisk is what you expect to test: right name and capacity for the drive
  5. sudo badblocks -wvs /dev/sdX
  6. run sudo tail -f /var/log/syslog in parallel just in case
  7. wait

If some errors appears in /var/log/syslog, you know something bad is happening. Whenever you have a single failing block, don't think it is ok. It is NOT ok for a HD to start its life with failing blocks. In this case, repack the drive and send it for replacement ASAP.

In the case of my hard drive: smartd mail:

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 15 Currently unreadable (pending) sectors

syslog entries:

Mar  2 20:52:57 foo kernel: [ 8317.419715] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Mar  2 20:52:57 foo kernel: [ 8317.419724] ata4.00: irq_stat 0x40000008
Mar  2 20:52:57 foo kernel: [ 8317.419732] ata4.00: failed command: READ FPDMA QUEUED
Mar  2 20:52:57 foo kernel: [ 8317.419747] ata4.00: cmd 60/80:00:00:af:08/00:00:28:00:00/40 tag 0 ncq 65536 in
Mar  2 20:52:57 foo kernel: [ 8317.419750]          res 41/40:00:65:af:08/00:00:28:00:00/40 Emask 0x409 (media error) <F>
Mar  2 20:52:57 foo kernel: [ 8317.419758] ata4.00: status: { DRDY ERR }
Mar  2 20:52:57 foo kernel: [ 8317.419764] ata4.00: error: { UNC }
Mar  2 20:52:57 foo kernel: [ 8317.423959] ata4.00: configured for UDMA/133

I am not blaming any particular brand (like Western Digital), all computer parts I have ever bought had to follow the same procedure and it is a known fact that computer parts as a non-zero percentage of chance to be DOA (dead on arrival) or after a few weeks. But as a consumer you should be aware of that and take action to avoid spending 10h configuring your computer to see it failing after a week... The waste of time to test is a win on the long term.