Andre's Blog
Perfection is when there is nothing left to take away
Hard drive controller errors for USB-attached drives

Some time ago I noticed a bunch of disk controller error messages in Windows' Event Viewer and got alarmed that one of my disks or disk controllers might be corrupting data.

The actual messages looked like this:

The driver detected a controller error on \Device\Harddisk1\DR4.
The driver detected a controller error on \Device\Harddisk2\DR5.

, and physical drive numbers corresponded to hard drives in the Mediasonic ProBox enclosure HF2-SU3S2. I contacted Mediasonic Support and went through the usual troubleshooting hoops replacing drivers, checking SMART output for affected drives, changing USB sleep power-off features, etc. None of these steps eliminated error messages or explained them in any way.

Updated 2021-01-09

The original blog post erroneously described that HF2-SU3S2 was powering down drives on its own and causing disk read errors in the system event log. Capturing and analyzing more USB traces revealed more complex interaction between Windows and HF2-SU3S2 leading to those system event log errors.

The part of the trace below shows USB traffic after the last successful operation against HF2-SU3S2, which is shown as frame 10212 at the top and is set as a zero time point.

After HF2-SU3S2 idled for 45 seconds, Windows (host) sent a START STOP UNIT command (START bit set to 0), to each drive (LUN) in HF2-SU3S2, and I could see the LED for each drive flash momentarily and hear the sound of drives spinning down one by one.

Frame  Time       Src    Dst    Len   Info (SCSI)
10212  *REF*      3.1.1  host   40    Response LUN: 0x00 (Write(10))
10213  45.495501  host   3.1.2  58    Synchronize Cache(10) LUN: 0x01
10216  45.495963  3.1.1  host   40    Response LUN: 0x01 (Synchronize Cache(10))
10217  45.495981  host   3.1.2  58    Start Stop Unit LUN: 0x01 
10220  45.496087  3.1.1  host   40    Response LUN: 0x01 (Start Stop Unit)
10221  45.711346  host   3.1.2  58    Synchronize Cache(10) LUN: 0x02
10224  45.711618  3.1.1  host   40    Response LUN: 0x02 (Synchronize Cache(10))
10225  45.711624  host   3.1.2  58    Start Stop Unit LUN: 0x02 
10228  45.711717  3.1.1  host   40    Response LUN: 0x02 (Start Stop Unit)
10229  45.927148  host   3.1.2  58    Synchronize Cache(10) LUN: 0x00
10232  45.927570  3.1.1  host   40    Response LUN: 0x00 (Synchronize Cache(10))
10233  45.927597  host   3.1.2  58    Start Stop Unit LUN: 0x00 
10236  45.927710  3.1.1  host   40    Response LUN: 0x00 (Start Stop Unit)
10237  46.143562  host   3.1.2  58    Synchronize Cache(10) LUN: 0x03
10240  46.143860  3.1.1  host   40    Response LUN: 0x03 (Synchronize Cache(10))
10241  46.143879  host   3.1.2  58    Start Stop Unit LUN: 0x03 
10244  46.143974  3.1.1  host   40    Response LUN: 0x03 (Start Stop Unit)

Windows hard drive timeout in Windows Power Options is set to 15 minutes, so this 45-second timeout originates in some other Windows layer that is clearly ignoring the system power options settings.

The following part of the trace shows USB traffic after a subsequent read operation against the idling HF2-SU3S2. Notice that the START STOP UNIT command (START bit set to 1) is issued against LUN 0 drive and the read operation is issued against LUN 1 drive, which is still in the powered down state.

Frame  Time       Src    Dst    Len   Info (SCSI)
10245  *REF*      host   3.1.2  58    Start Stop Unit LUN: 0x00 
10248  0.000083   3.1.1  host   40    Response LUN: 0x00 (Start Stop Unit)
10249  0.000086   host   3.1.2  58    Read(10) LUN: 0x01
10252  8.342936   3.1.1  host   8219  Data In LUN: 0x01 (Read(10) Response Data) 
10254  8.343017   3.1.1  host   40    Response LUN: 0x01 (Read(10))
10255  8.343102   host   3.1.2  58    Read(10) LUN: 0x01
10258  8.351780   3.1.1  host   8219  Data In LUN: 0x01 (Read(10) Response Data)

LUN 1 drive spins up for 8 seconds and then returns 8219 bytes of data. This is when Windows logs that disk error in the system event log.

All four drives in HF2-SU3S2 are in a Storage Spaces drive pool, so Windows got confused which specific drive in that drive pool it was supposed to power up and which one it was supposed to read, and when the read operation took 8 seconds to complete, Windows reported this as an error.

Compare the sequence of commands above with the USB traffic below against another enclosure with a single drive.

Frame  Time       Src    Dst    Len   Info (SCSI)
533    *REF*      host   2.3.2  58    Start Stop Unit LUN: 0x00
536    8.585314   2.3.1  host   40    Response LUN: 0x00 (Start Stop Unit)
537    8.585420   host   2.3.2  58    Read(10) LUN: 0x00
540    8.598698   2.3.1  host   539   Data In LUN: 0x00 (Read(10) Response Data)

Notice that it is the START STOP UNIT command that waits for 8 seconds until the drives spins up, not the read command. Windows expects this operation to take time and does not report this delay as an error to the system event log.

All in all, there is a couple of Windows bugs in this configuration.

1. The Storage Spaces drive pool confuses the drive it wakes up with the drive it reads, which causes 8-second reads reported as alarming errors in the system event log. In my case these delayed reads and writes didn't cause actual read/write errors, but it may be just these specific tests and it is unsettling to know that this is happening under the covers.

2. There is an undocumented hard-coded 45-second idle timeout that powers down USB-attached drives repeatedly and causes very unpleasant 8-second delays while working with documents, images and other contents on those drives.

Both bugs are quite annoying and I'm not sure at this point how to resolve either of them. Microsoft certainly doesn't make it easy to report bugs against Windows.

I ran all these tests on a couple of different Windows 10 computers with the same version 20H2 19042.685. Drives in HF2-SU3S2 are configured as a Storage Spaces drive pool and a single-drive enclosure was a USB3 hard drive dock.

Comments:
Posted Sun Dec 20 01:33:51 EST 2020 by Chris

I've just started to have the same errors pop up. Did you ever figure out a solution? I pulled the problem drive out of the Mediasonic Probox to copy data to a new drive before testing. This was done using a dual drive usb dock. I was able to copy all the files except for the corrupt ones. I then ran some tests on the the problem drive only to get the same errors after the Seatools longtest passed. No I'm really confused. I'm wondering if windows changed something during an update. I notice just before all this started that I had some defrag errors in event viewer on all the drives in the probox. I had never seen these errors before. I am suspecting a widows or driver issue.

Posted Sun Dec 20 12:01:19 EST 2020 by Andre

You may be seeing unrelated problems. The error described in this post wasn't causing data loss in my setup. Mind you, though, my enclosure is set up not to power off automatically - that may indeed cause data loss.

I mostly use Linux tools to test drives and don't know what Seatools' long test does, but if it tests for bad sectors and didn't find anything, it would indicate that you are looking at data corruption caused by unfinished writes. For example, if the enclosure was powered off without using Windows safe remove.

Unfinished writes can be further subdivided onto file system corruption vs. file corruption. The former may report errors when trying to read/write files, while the latter will be evident trying to open files in their app (e.g. corrupt images, spreadsheets, etc).

Double-check bad sectors. If you find any, the best recovery tool is to use ddrescue on Linux. It's a command line tool, and it does a very intelligent recovery that maximizes chances of saving data.

If there are no bad sectors, look for what could cause unfinished writes - see if there are any events in the event log for NTFS and disk event types and look for general patterns when the drive could be detached without flushing buffers or just disappears (e.​g. after sleep, enclosure auto shut-off, etc).

Name:

Comment: