Some time ago I noticed a bunch of disk controller error messages in Windows' Event Viewer and got alarmed that one of my disks or disk controllers might be corrupting data.
The actual messages looked like this:
The driver detected a controller error on \Device\Harddisk1\DR4. The driver detected a controller error on \Device\Harddisk2\DR5.
, and physical drive numbers corresponded to hard drives in the Mediasonic ProBox enclosure HF2-SU3S2. I contacted Mediasonic Support and went through the usual troubleshooting hoops replacing drivers, checking SMART output for affected drives, changing USB sleep power-off features, etc. None of these steps eliminated error messages or explained them in any way.
Updated 2021-01-09
The original blog post erroneously described that HF2-SU3S2 was powering down drives on its own and causing disk read errors in the system event log. Capturing and analyzing more USB traces revealed more complex interaction between Windows and HF2-SU3S2 leading to those system event log errors.
The part of the trace below shows USB traffic after the last successful operation against HF2-SU3S2, which is shown as frame 10212 at the top and is set as a zero time point.
After HF2-SU3S2 idled for 45 seconds, Windows (host) sent a START STOP UNIT command (START bit set to 0), to each drive (LUN) in HF2-SU3S2, and I could see the LED for each drive flash momentarily and hear the sound of drives spinning down one by one.
Frame Time Src Dst Len Info (SCSI) 10212 *REF* 3.1.1 host 40 Response LUN: 0x00 (Write(10)) 10213 45.495501 host 3.1.2 58 Synchronize Cache(10) LUN: 0x01 10216 45.495963 3.1.1 host 40 Response LUN: 0x01 (Synchronize Cache(10)) 10217 45.495981 host 3.1.2 58 Start Stop Unit LUN: 0x01 10220 45.496087 3.1.1 host 40 Response LUN: 0x01 (Start Stop Unit) 10221 45.711346 host 3.1.2 58 Synchronize Cache(10) LUN: 0x02 10224 45.711618 3.1.1 host 40 Response LUN: 0x02 (Synchronize Cache(10)) 10225 45.711624 host 3.1.2 58 Start Stop Unit LUN: 0x02 10228 45.711717 3.1.1 host 40 Response LUN: 0x02 (Start Stop Unit) 10229 45.927148 host 3.1.2 58 Synchronize Cache(10) LUN: 0x00 10232 45.927570 3.1.1 host 40 Response LUN: 0x00 (Synchronize Cache(10)) 10233 45.927597 host 3.1.2 58 Start Stop Unit LUN: 0x00 10236 45.927710 3.1.1 host 40 Response LUN: 0x00 (Start Stop Unit) 10237 46.143562 host 3.1.2 58 Synchronize Cache(10) LUN: 0x03 10240 46.143860 3.1.1 host 40 Response LUN: 0x03 (Synchronize Cache(10)) 10241 46.143879 host 3.1.2 58 Start Stop Unit LUN: 0x03 10244 46.143974 3.1.1 host 40 Response LUN: 0x03 (Start Stop Unit)
Windows hard drive timeout in Windows Power Options is set to 15 minutes, so this 45-second timeout originates in some other Windows layer that is clearly ignoring the system power options settings.
The following part of the trace shows USB traffic after a subsequent read operation against the idling HF2-SU3S2. Notice that the START STOP UNIT command (START bit set to 1) is issued against LUN 0 drive and the read operation is issued against LUN 1 drive, which is still in the powered down state.
Frame Time Src Dst Len Info (SCSI) 10245 *REF* host 3.1.2 58 Start Stop Unit LUN: 0x00 10248 0.000083 3.1.1 host 40 Response LUN: 0x00 (Start Stop Unit) 10249 0.000086 host 3.1.2 58 Read(10) LUN: 0x01 10252 8.342936 3.1.1 host 8219 Data In LUN: 0x01 (Read(10) Response Data) 10254 8.343017 3.1.1 host 40 Response LUN: 0x01 (Read(10)) 10255 8.343102 host 3.1.2 58 Read(10) LUN: 0x01 10258 8.351780 3.1.1 host 8219 Data In LUN: 0x01 (Read(10) Response Data)
LUN 1 drive spins up for 8 seconds and then returns 8219 bytes of data. This is when Windows logs that disk error in the system event log.
All four drives in HF2-SU3S2 are in a Storage Spaces drive pool, so Windows got confused which specific drive in that drive pool it was supposed to power up and which one it was supposed to read, and when the read operation took 8 seconds to complete, Windows reported this as an error.
Compare the sequence of commands above with the USB traffic below against another enclosure with a single drive.
Frame Time Src Dst Len Info (SCSI) 533 *REF* host 2.3.2 58 Start Stop Unit LUN: 0x00 536 8.585314 2.3.1 host 40 Response LUN: 0x00 (Start Stop Unit) 537 8.585420 host 2.3.2 58 Read(10) LUN: 0x00 540 8.598698 2.3.1 host 539 Data In LUN: 0x00 (Read(10) Response Data)
Notice that it is the START STOP UNIT command that waits for 8 seconds until the drives spins up, not the read command. Windows expects this operation to take time and does not report this delay as an error to the system event log.
All in all, there is a couple of Windows bugs in this configuration.
1. The Storage Spaces drive pool confuses the drive it wakes up with the drive it reads, which causes 8-second reads reported as alarming errors in the system event log. In my case these delayed reads and writes didn't cause actual read/write errors, but it may be just these specific tests and it is unsettling to know that this is happening under the covers.
2. There is an undocumented hard-coded 45-second idle timeout that powers down USB-attached drives repeatedly and causes very unpleasant 8-second delays while working with documents, images and other contents on those drives.
Both bugs are quite annoying and I'm not sure at this point how to resolve either of them. Microsoft certainly doesn't make it easy to report bugs against Windows.
I ran all these tests on a couple of different Windows 10 computers with the same version 20H2 19042.685. Drives in HF2-SU3S2 are configured as a Storage Spaces drive pool and a single-drive enclosure was a USB3 hard drive dock.
I've just started to have the same errors pop up. Did you ever figure out a solution? I pulled the problem drive out of the Mediasonic Probox to copy data to a new drive before testing. This was done using a dual drive usb dock. I was able to copy all the files except for the corrupt ones. I then ran some tests on the the problem drive only to get the same errors after the Seatools longtest passed. No I'm really confused. I'm wondering if windows changed something during an update. I notice just before all this started that I had some defrag errors in event viewer on all the drives in the probox. I had never seen these errors before. I am suspecting a widows or driver issue.
You may be seeing unrelated problems. The error described in this post wasn't causing data loss in my setup. Mind you, though, my enclosure is set up not to power off automatically - that may indeed cause data loss.
I mostly use Linux tools to test drives and don't know what Seatools' long test does, but if it tests for bad sectors and didn't find anything, it would indicate that you are looking at data corruption caused by unfinished writes. For example, if the enclosure was powered off without using Windows safe remove.
Unfinished writes can be further subdivided onto file system corruption vs. file corruption. The former may report errors when trying to read/write files, while the latter will be evident trying to open files in their app (e.g. corrupt images, spreadsheets, etc).
Double-check bad sectors. If you find any, the best recovery tool is to use ddrescue on Linux. It's a command line tool, and it does a very intelligent recovery that maximizes chances of saving data.
If there are no bad sectors, look for what could cause unfinished writes - see if there are any events in the event log for NTFS and disk event types and look for general patterns when the drive could be detached without flushing buffers or just disappears (e.g. after sleep, enclosure auto shut-off, etc).