Dupfinder

October 11, 2016

Dupfinder is a small tool I wrote that provides just one command: find-dups. It finds duplicate files by content; it ignores the filenames.

It happened that one day I found some old backup drives and I was wondering which things I allready transferred to the new computer. There were thousands of files on those drives. There was no way I could do this all by hand. Obviously needed a program to help me with this. I found several existing solutions that would start off by calculating a hash of every single file on my hard drive, undoubtedly to create an index that will guarantee quick retrieval. Unfortunately the indexing process itself is very time consuming. I thought I could do better.

My tool postpones expensive calculations: it will only determine a hash when two file sizes are equal to the last byte. We want to avoid streaming the contents of several multi-gigabyte videos or iso-files through a hashing algorithm when we’re searching for a 4 KB jpeg image, right? This simple idea results in a huge speed increase compared to similar solutions because fetching the size of a file is such a cheap operation for your computer. It is negligible compared to calculating a hash of all the bytes in a file.

That is all it does; it does not contain any other functionality besides some options to format the output. You can instruct it to display serveral variables so that whenever a match has been found it can display the filename of the needle and/or the file found in the haystack. The options reflect the analogy of searching for needles (certain files on the backup drive) in a haystack (a directory on my new hard drive).

find-dups [options] haystack needle...

-h, --help
-v, --[no-]verbose
-p, --print         Print 'needle', 'hay' or 'separator';
                    should be a comma-separated list (no spaces!)

Low on RAM?

The program keeps a big list in memory of all the files you tell it to look for. It probably wouldn’t have worked as well in the 90s but now, in the twenty-first century, we have gigabytes of memory and for me it proved not to be a problem at all.