Resilient, until it's not

I have been a big proponent of Storage Spaces in Windows 10 for many years and while redundant storage provided by Storage Spaces is not a replacement for a proper backup, it does provide good protection against individual drive failures and some forms of enclosure failures.

When Windows 10 was just released, in addition to drive redundancy, it also allowed formatting Storage Space volumes as ReFS (Resilient File System), which added a layer of protection against bit rot and sudden power loss because of the way it performs disk writes. Later on, Microsoft removed the ability to format new volumes as ReFS from Windows 10, but existing ReFS volumes remained usable and I assumed that Microsoft will be respectful of terabytes of data and will warn me that ReFS will no longer be maintained on Windows 10 when the time comes.

That turned out to be a bad assumption and what followed felt like a gut punch.

RAW File System

My storage solution included two mirrored Storage Space volumes formatted as ReFS, each containing just under 4 TB of data, mostly pictures in Canon RAW format and videos taken by myself and family members.

During the pandemic I didn't travel much and kept both enclosures turned off, sometimes weeks at a time. On February 26th, 2022, I turned on one of them to copy a few files and much to my surprise the drive appeared in Explorer, but it was labeled as if it had RAW file system. I thought it was a glitch and rebooted the computer, but with the same result. Thinking that may be something is wrong with the enclosure, I turned on the second one and it showed the same. I felt sick to my stomach - years of memories, travels and photography were gone.

Configuration Rollback

My first reaction was that one of Windows Updates installed something that caused this and I tried to roll back system configuration. Windows offered a few choices and the oldest one Windows considered "safe" would take me couple of weeks back. This rollback went pretty smoothly, but didn't help.

My second choice was to go as far back as December 31st, 2021, based on a System Image backup. Windows warned me that going this far back is not safe, I acknowledged it and it started the process. It took hours and when it started rebooting at some point, it got stuck on a boot screen. I let it sit for a couple of hours, but then forced the computer to power off. The computer was no longer bootable.

I booted the computer from a recovery USB drive and Windows allowed me to select a different date to recover system configuration. It took some time, but in the end I had the same RAW volumes. The date for this configuration was somewhere in the second half of January.

Partition Recovery

At this point I had a couple of choices - use partition recovery tools or create a support ticket with Microsoft. The latter was about $300 to just create a support case, which would be refunded if it proved to be a bug, but in this case I knew that they would say ReFS was not supported on Windows 10 and I didn't feel like just giving Microsoft $300 for the trouble they caused me.

So, I turned to various partition recovery tools, which included ReclaiMe, UFS, R-Studio, RS Partition Recovery and Active Partition Recovery. R-Studio seemed promising after their initial scan showed some of my pictures, so I purchased a license.

R-Studio

R-Studio found my ReFS volume and took some time to scan it, and it was nice to be able to see progress on a volume block map. However, when it was done, nothing was listed as recoverable. Spending more time with the tool, I noticed that if I interrupt it close to the end, it shows me usable results. I contacted R-Studio Support and got a prompt response asking for various logs. After a bit of back and forth with the support person, they didn't seem to be interested in my feedback and offered nothing that would remedy this problem, so I decided to try another tool.

Active Partition Recovery

I had very good experience with Active Undelete in the past, so I was very hopeful that they will do as good a job with partition recovery. However, Active Partition Recovery just froze on the initial splash screen when either of the enclosures was turned on.

I contacted Active Support via their Contact Us form, sent another email to their sales email address saying that I will buy the whole suite of their recovery products if they get it going on my computer, but received no response on ether of the communications.

RS Partition Recovery

RS Partition Recovery was more expensive than other tools, but their evaluation version showed some results, so I bought a copy and started a scan. The first unpleasant surprise was that RS Partition Recovery must have been written for Windows 3.1 because the scan dialog would freeze for a few minutes, then would come back to life for a few seconds and then would become unresponsive for a few minutes again.

Their remaining time estimate was all over the place - ranging from a few minutes to days between those frozen states. After working on this scan for over a day, it got confused by the timer rolling over 24 hours and started predicting remaining time in minutes again.

Eventually, it froze in the way that even drive LEDs stopped blinking. I let it sit for a couple of hours and then waited for a few minutes for one of those responsive few seconds, which allowed me to cancel the scan and save the results.

The results were promising - I could see some of the pictures and videos, which was better than nothing. I started recovering, but was surprised to see that for my 3.7 TB of data RS Partition Recovery said I would need a 9 TB drive to save recovered files. I started looking closer at what they would recover and realized that they didn't try to recover a working ReFS volume, bur rather scanned it for file signatures, which included all files deleted over the years, some of which weren't even valid.

I contacted RS Partition Recovery support a few times on a couple of their channels, but they didn't bother responding even once.

Note on Partition Recovery Tools

All these tools are terrible for recovering file systems that were healthy before whatever problem hit them, assuming no changes were introduced, like partitions were not accidentally deleted, etc., because they don't attempt to validate file systems and correct whatever was broken, but instead scan drives for file signatures, which usually would recover a mix of things and it would take a significant effort to sort out which files to keep and which to delete again.

I also ruled out contacting data recovery companies for the same reason - they promise their best effort to recover something and not to find out what happened, and this best effort most likely will be just hitting the scan button on one of these recovery tools and giving people whatever that default scan found, which many people still would welcome because it is a something vs. nothing case.

Digging Deeper into ReFS

While RS Partition Recovery was scanning and recovering some of my files, I dove deeper into ReFS, in hopes to find out what happened to my volumes. Microsoft does not publish ReFS internals, so the information was scarce and while I dug up a few bits here and there, this was the best reverse engineering paper on ReFS.

https://www.sciencedirect.com/science/article/pii/S266628172030010X

Initially, I entertained the idea of writing a simple tool to copy ReFS files, but after reading about its structure, it was clear that it is far too complex and not well documented to write a meaningful tool in a reasonable amount of time, so I abandoned this idea. However, from these bits and pieces I was able to interpret the boot sector and establish that the ReFS version of my ReFS volumes was 1.2.

I recalled that I had a couple of drives formatted with ReFS on a Windows Server 2012, so I found them and much to my surprise found out that this ReFS volume was being mounted on my Windows 10. The ReFS version of that volume was 3.7.

This prompted me to look not for ways to recover ReFS partitions, but rather for issues with ReFS v1.2 on Windows and following this hunch I found a couple of threads that shed some light on what happened with my volumes and on possible ways to recover files. This thread provided some timeline on the problem and its underlying causes.

https://borncity.com/win/2022/02/08/microsoft-wird-refs-bug-in-windows-vermutlich-nicht-komplett-fixen/

It led me to this thread, where people experienced the same problem on Windows Server instances:

https://techcommunity.microsoft.com/t5/windows-server-for-it-pro/refs-volume-appears-raw-version-doesn-t-match-expected-value/m-p/3248761

The cause for Windows Server seemed to be that Windows upgrades ReFS to the latest version when one is available, unless the drive is considered removable, in which case they leave it alone because it may become unusable when it is attached to computers running an older version of Window Server.

Hard to say what was the cause in my case - both enclosures are attached via eSATA and are not considered removable in any practical sense, but maybe Windows 10 viewed them otherwise. It is also possible that Windows 10 never bothered upgrading ReFS on Home and Pro editions. One would expect that Microsoft would warn people that they may lose terabytes of data if they are not moved to NTFS volumes, but I guess it is too much to ask of Microsoft to worry about Window 10 Home/Pro users.

From these threads I got January 11th as the date when Windows Update turned my ReFS volumes RAW, so I decided to try to roll back my configuration again.

Configuration Rollback, Take 2

This time I decided to leave my current configuration alone and restored one of the System Image backups I had from 2018 on a new drive. This way I could move through updates without worrying about current data.

I detached my computer from the network, so Windows Updates wouldn't immediately start updating on its own, and restored Windows to 2018, version 1803. I didn't expect this to work on its own because the Storage Pool also has a version and an old version of Windows is not likely to read newer version of the storage pool. I was a bit nervous if Windows would try to change something in the newer version, but it just didn't show me Storage Space drives. So far so good.

I allowed Windows Updates to go one update at a time and it got me to version 20H2. I had high hopes for this update, but instead got the same RAW files system I had before. The next, and only, step on this path was to uninstall some of the updates individually. In my case, there were only two - KB5010342 and KB5007401. This took me to Windows version 2004.

I must admit, I was quite anxious while waiting for the computer to boot after the last step. This was the last hope to get volumes recovered without trying to find a Windows Server instance I could borrow for a few days. As the computer was booting, even before the Desktop was completely rendered, I noticed that the LEDs on the attached enclosure lit up in a familiar sequence, one after another, as if Windows was trying to mount my ReFS volume. With that hope, I hit Win-E keys and there it was - my beautiful ReFS volume.

File Recovery

My plan was to reuse the same hard drives and create another Storage Space volume in the same Storage Pool and move files from ReFS volume to the new NTFS volume. I have done this in the past numerous times and was never worried, but after this experience I didn't feel like just trusting Storage Spaces to move my files safely, so for each enclosure I decided to make a copy first.

I scrambled a few spare drives into another enclosure and used robocopy to copy files. I used /TEE and /LOG+ to see progress and keep a log, in case I need to review what went on.

Both enclosures were attached via eSATA and it still took over a day to copy all files in each case. Having all components plugged into a UPS to handle minor power interruptions was certainly a good idea.

After making copies, I felt a little better, but as it turned out, I wasn't out of the woods yet. When I tried to create a new Storage Space in the same Storage Pool in Control Panel, it would create a drive, flash it in Computer Management, and then immediately drop it with a cryptic "incorrect parameter" error message.

Fortunately, I could create a storage space from PowerShell with this command (line breaks are for display purposes). The size of this Storage Space was exactly the same as is the existing ReFS volume, adjusted for redundancy.

New-VirtualDisk -FriendlyName NEW_NTFS_A
-StoragePoolFriendlyName POOL_A
-ResiliencySettingName Mirror
-Size 7.2TB
-ProvisioningType Thin

This command created a new uninitialized disk, which I initialized in Computer Management and created a new NTFS volume.

The existing ReFS volume and the new NTFS volume shared the same Storage Pool, so files could only be moved, and not copied. After they are moved, source files would not be recoverable via scan tools because the thinly-provisioned NTFS volume would acquire storage released from ReFS volume as files are being moved.

I used robocopy with /MOVE to move files. After all files have been moved, it is a good idea to flush the target volume before you start examining files. Rebooting the computer or using this command to flush volume cache does the trick.

Write-VolumeCache -FileSystemLabel NEW_NTFS_A

I did this for almost all operations, but after the final move, I got impatient and launched one video in an older program and it force-rebooted Windows, so now I need to run some file integrity tool against my intermediate copy to make sure nothing was lost.

After all files have been moved, I deleted empty ReFS volumes, copied all robocopy logs onto new NTFS volumes, and reattached the original system drive to boot normally.

Redundancy vs. Backup

Having two copies of data on different physical disks provides protection against hard drive failures, but not against all forms of enclosure failures or Microsoft's whims to drop support for older ReFS versions without any heads-up that would allow people prepare for these situations.

A backup is a persistent copy of protected resources at a specific point in time and is fundamentally different from having redundancy in the storage solution. In fact, a backup should be stored with redundancy of its own.

This all sounds nice and this is what Enterprise users should be doing, but in practical terms having a redundant backup for 3.7 x2 TB of data is quite expensive for local storage and is prohibitively expensive for cloud storage in terms of bandwidth, access and storage.

After some thinking and reviewing choices, I decided to stay with redundancy, but considering how all my current redundancy is at the mercy of Microsoft Updates, I decided to set up a Linux storage box with cheapest low grade drives, just to have another copy of all data in a different format. I reviewed a few solutions, but it will take me some time to finalize one and that would be a topic for another post.

Final Thoughts

Microsoft traditionally was fairly conservative with their backward compatibility and while they did drop products in the past, which is inevitable, their strategy for long-term technologies was to maintain backward compatibility for a reasonable amount of time and to let people know when things become unsupported. I'm sure people can attest to the contrary, but that has been my experience. Well, until now.

I do wonder if the Product Owner for ReFS at Microsoft made the decision to drop ReFS v1.2 based on poor information of how many ReFS volumes are being actively used or because they have so little respect for their users that they just dismissed a smaller group of users who ended up with ReFS v1.2. In either case, common sense would be to let people know that they have to move their data, but this common courtesy was not on the Microsoft's roadmap. That's just disappointing.

Comments: