Andre's Blog • File Integrity Tracker (fit)

File Integrity Tracker (fit)

Posted Sun, 17 Apr 2022 23:13:56 GMT in Computing by Andre

Last month I ended up copying thousands upon thousands of files, while recovering my data from ReFS volumes turned RAW, because Microsoft quietly dropped support for ReFS v1.2 on Windows 10. During file recovery, I was trying to be careful and flushed the volume cache after every significant copy operation, but a couple of times Windows just restarted on its own and I faced a bit of uncertainty on whether data in all files safely reached the drive platters or not.

I used a couple of file integrity verification tools in the past and thought it would take some time to read all files, but otherwise would be a fairly simple exercise. However, it turns out that everyday file tools don't work quite as well against a couple of hundred thousand of files.

fciv

My first choice was Microsoft's fciv utility, which is quite well designed for tracking file integrity across multiple directories and maintains integrity information in an XML file, so files can be verified against the same location or at a new base path. This was exactly what I was looking for.

The download link on the fciv page was broken and I had to dig up fciv from my archives. While scanning some of the music files I noticed that fciv mangles characters outside of the Windows-1252 character set, which made it unusable for file names like hello,🌎︎.txt. This didn't deter me because most of my files had ASCII names, so I decided to work around this problem with file exclusion lists.

I kicked off a scan of one of the largest directories, which had about 200K files across many directories, and fciv worked for hours, until I noticed that disk LEDs stopped blinking. I don't remember exactly how it manifested - I wish I would document this better, but even though it recorded some data in the XML file, it was so large that it failed to load it back to memory for subsequent runs, so it became apparent that fciv was not intended to work with so many files.

for loop & certutil

My second choice was just to run a good old batch script that would run certutil against each file and append results to a text file. If file are traversed in the same order, I could use text regular comparison tools to run the comparison of these text files from both locations. I estimated that the resulting text file would be about 100 MB in size and text comparison tools could certainly handle this size.

I started running the for /R command, with certutil -hashfile against each file. The command kept going for hours and the resulting file looked exactly as I expected it when I sample-checked it a few times, but then I started seeing some console errors and the computer started behaving erratically. After some troubleshooting I realized that the for loop kept consuming memory as it traversed directories, so after a few hours my computer ran out of the 8 GB I have on this desktop.

The thought of running the same in PowerShell crossed my mind, but each of these experiments was taking too much time and instead of spending more time on trying to see if I can get PowerShell to produce one-shot text files for me, I decided to write a small C++ utility that could compute and verify hash values for hundreds of thousands of files with limited amount of memory.

File Integrity Tracker (fit)

I gave it some thought on whether to reuse the Berkeley DB database layer I wrote for Stone Steps Webalizer or write something new and decided that this utility should use a database that could be used on its own, which led me to the SQLite library.

I spent a couple of weekends contemplating the design, creating a SQLite package and choosing a SHA-256 library, and a couple more weekends and some weeknights writing and testing the utility against a variety of directory trees, and here it is - meet File Integrity Tracker (fit).

https://github.com/StoneStepsInc/fit

fit can compute and verify SHA-256 hashes for hundreds of thousands of files with limited amount of memory, while processing several files simultaneously.

All file integrity data is stored in a SQLite database, which can be queried directly in the SQLite shell, which can be fun on its own, as it allows one to find large files, duplicate files, when files were modified, and so on.

The project above only builds a Windows utility, but the code is written in a portable way and can be compiled on Linux. I gave it a try and it seems to work, but I didn't have the time to test it well, so I decided to limit builds to Windows at this point.

My final test for fit was to scan all files I copied from a temporarily mounted ReFS volume to an NTFS volume, which was carefully flushed in the process and served as a good source of reliable file integrity information, and a verification scan of files moved from a ReFS volume to an NTFS volume on the same Storage Space virtual drive with a limited amount of free space that could accommodate only one set of files.

The verification scan ran for 10 hours and scanned about 4TB in 205K files, which translates into 114 MB/s. This NTFS volume is formatted on top of a Storage Spaces virtual drive with mirror resiliency across four 4 TB WD Red NAS drives.

This final file integrity verification scan marked the end of my data recovery ordeal that started on February 26th. Whew.

Comments: