Dupfinder


– tags: dupsduplicates

Dupfinder is a small tool I wrote that provides just one command: find-dups. It can find duplicate files by their content; it ignores the filenames.

It happened that one day I found some old backup drives and I was wondering which things I allready transferred to the new computer. There were thousands of files on those drives. There was no way I could do this all by hand. Obviously needed a program to help me with this. I found several existing solutions that would start off by calculating a hash of every single file on my hard drive, undoubtedly to create an index that will guarantee quick retrieval. Unfortunately the indexing process itself is very time consuming. I thought I could do better.

My tool postpones the hash calculation: it will only perform the calculation when file sizes are matching. Since why would I want to stream the contents of several multi-gigabyte MP4s or iso-files through a hash algorithm when I know I’m searching for a 4 KB jpg!? This results in a huge speed increase because fetching the size of a file is such a cheap operation for your computer compared to calculating a hash of what’s in the file.

That is all it does; it does not contain any other functionality besides some output formatting options. You can instruct it to display serveral variables so that whenever a match has been found it can output the filename of the needle and/or the file found in the haystack. I used the analogy of searching for needles (certain files on the backup drive) in a haystack (a directory on my new hard drive).

find-dups [options] haystack needle...

-h, --help
-v, --[no-]verbose
-p, --print         Print 'needle', 'hay' or 'separator'; 
                    should be a comma-separated list (no spaces!)

The program keeps a big list in memory of all the files you tell it to look for. It probably wouldn’t have worked as well it in the 90s but we have gigabytes of memory these days and for me it proved not to be a problem at all.