Abstract:
Designing an effective algorithm for finding identical files.

Created by Peter Kankowski
Last changed
Filed under Algorithms

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Searching for duplicate files

Here is an interesting programming task: write a program that finds all identical files on your hard disk. The files are considered identical if they have the same content; their names may be different or not. (By this definition, all zero-length files are identical.)

Exploring possible solutions

It's obvious that you cannot compare each file to all other files on your HDD in a reasonable time. So, a kind of hashing should be implemented: the program will calculate a hash for each file and remember the file name in a hash table. After scanning the folder tree, the program can do byte-by-byte comparison of the files with equal hashes and find which of them are truly identical (or just presume that the files are identical, if the hash function provides almost unique values). Hash table is an O(N) solution, which is faster than a sorted list for this task.

Which hash function to use? A possible idea is hashing file content with MD5 or CRC function. However, this requires reading all files on the disk (several gigabytes of data).

The best solution

The ideal hash function would read only file entries in a directory without opening the files and reading their content. Unfortunately, popular file systems do not store CRCs of the files (archive formats such as ZIP or RAR do that, so you can search for duplicates without unpacking files).

Let's take look at a directory entry (e.g., WIN32_FIND_DATA structure) and see what can be used for file comparison. You have a file name, modification/creation date, attributes, and file size here. Obviously, identical files have the same size.

The program can scan directories and put the files with equal size in the same hash table entry. Then, it compares the found files (skipping hash table entries that contain only 1 file). For all files with the same size, SHA-512 hash is calculated. The files are considered identical if their SHA-512 hashes are equal. Typically, the program has to read only a few files and calculate their SHA-512 hashes, because the files with different sizes are skipped.

# Find files with the same size
samesize = {}
for path, _, filenames in os.walk(DIRECTORY, onerror=raise_err):
    for filename in filenames:
        fullname = join(path, filename)
        samesize.setdefault(getsize(fullname), []).append(fullname)

# Calculate SHA-512 hashes for these files
for filenames in samesize.values():
    if len(filenames) > 1:
        hashes = {}
        for filename in filenames:
            hashes.setdefault(hash_file(filename), []).append(filename)
        
        # Print files with the same hash value
        for identicalfiles in hashes.values():
            if len(identicalfiles) > 1:
                print('\n'.join(identicalfiles) + '\n')

Download the complete Python 3.0 code

Improvements

  • When searching the whole NTFS disk, you can scan MFT instead of directories, as NDFF and TFind do. MFT is more or less continuous, while directories are scattered around the disk, so they take more time to search in.
  • Certainly, the program will be faster if you reimplement it in C.

Conclusion

With a duplicate search program like this, you can free your disk from unnecessary files. Such utilities are included in Total Commander and various disk cleanup tools.

The design process used in this article can be applied to many problems: explore possible solutions, formulate the ideal solution, and see how close you can be to it. Analyze the problem at low level (in this case, look at directory entries and MFT table).

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created by Peter Kankowski
Last changed

16 comments

ace,

IMHO that SHA-512 (512 bits!) is overkill. As far as I know, the probability for any two different files to have same SHA-512 is 9e-78 (see birthday paradox), same MD5 (128 bits): 5e-20. In perspective: the probability to win premium (15 mega Euro) in EuroMillions lottery at once is "only" 1e-8. So if you play the lottery each time you get two same hashes, you can expect to win 200 giga times and have once a "bad luck" that the files had the same MD5 but are not the same. And if I remember correctly, for most countries the probability that the citizen will die in the traffic accident is 1e-2.

Peter Kankowski,

Thanks :). Yes, MD5 should be enough. At the same time, I would not use CRC64, because in the nearest future, many people will have 108-109 files on their HDDs (I have 105 now). For estimating the probability, see the table in wikipedia article on birthday paradox.

ace,

On another side, why make a program for which you certainly know that it will sometimes fail, if the alternative implementation doesn't have to be worse? For the given problem (the file is deleted and lost if you decide wrong) I would not calculate the checksum for the whole file if the file is big, I'd use only, for example, the first 500 KB, the 500 KB at the start of the second 10th of the whole file (I hope it's clear why?) and the last 500 KB. That way if something is obviously not the same I avoid even reading such big files. The files that remain I'd actually fully compare. The actual compares keep CPU less occupied compared to the checksum calculation of these big files. I think such approach would be the closest to the minimum number of reads and the minimal cpu use.

Peter Kankowski,

In the worst case (when a lot of files have the same length), you will have to make N2 / 2 comparisons. Using MD5/SHA hashing guarantees O(N) performance.

ace,
In the worst case (when a lot of files have the same length), you will have to make N2 / 2 comparisons.

Did you mean "when there is N files with the same length and the same sample checksum you'd have to compare every one with all others, giving N2/2 compares?" I don't see why that? The goal was not listing where they are different between them, the goal was only to discover those that are certainly the same in order to keep only one of them. That means that we need only finding the sets of the same files starting from the list of N files with the same length and the same sample checksum. Which should still mean that every of N files can be read from the disk only once (using blocks of course). Now look at the "optimal native word" (the one most convenient to process, further oword) i of every such file. I need only N compares as long as the files are the same (and not N**2/2). I'm just checking if all have the same value. Whenever I have N compares which are not all the same value, it's true, I'd have to reorganize the files in new sets, but if the files are really the same that doesn't happen. If the files are really not the same then as soon as I have one unique sequence of bytes among those N starting from position i I don't have to process that whole file anymore. So I claim that the process even then doesn't take longer if we count net disk reads, and I already won earlier by not even reading some big files of the same size except for small parts of them. So I'm never worse by disk reads. For CPU I have in your case (the files which are actually the same) exactly one compare per oword per file, and you have the cost of updating separate file checksum per oword of each file, which is always much more than one "compare oword i". So I think I still win. Or do you see something I'm missing here?

Peter Kankowski,

I cannot say which algorithm will be faster on real files: there are too many factors. You could implement your algorithm and compare its speed with mine. Python has filecmp module, which is ideal for comparing small files. The module is written in Python (see the source in Python30\Lib\filecmp.py), so you can easily adapt it to your 500 KB scheme.

ace,

I don't have "real files" where I need such program, that is, I don't have "duplicates where any one can be kept."

An example "on real files": I create "disk in file" of 30 GB, install the OS on it. Then I make a copy and then continue to mount the original and work on it. So there are two big files of the same size, each 30 GB, but different.

Second "real files" example: "virtual encrypted disks." If you have more than one, most often they'll be the same size.

Third "real files" example: More databases on the computer. Every is preallocated to some fixed and same size, in order to be as fast as possible. They are always different.

Most of real big files of the same size will actually often update or modify either blocks directly at the file start or the file end. Most that don't but have some content in them will tend to have that content close to the begin. And even if the changes are not inside of the three blocks that are to be checksummed, with my algorithm these files will be read only up to the first difference.

With two different files each 30 GB as the part of the test set my algorithm will be always faster than yours, most often up to 4000 times faster – my will read and checksum 1.5 MB and your 60 GB. Your algorithm is simple but certainly not optimal when the really big files of the same size are present, and when they are most of the time they will not be the same. And when they are the same I still claim one compare per byte is cheaper than one checksum update per byte. I admit when there are more than two such files they must be read in parallel blocks in my case in order to still have same total number of reads, that was very good point of yours to mention such case.

By the way, as soon as you think about "real files" as actually of some specific format, there's chance to find the examples where you can know with 100% certainty that reading the whole files in not needed: for a lot of encrypted file formats you can know that either they are different in a first few bytes or certainly the same (the same probability discussion as with checksums applies).

Which kind of files do you "clean up" anyway? I must admit I almost never had the case where I just wanted to remove "any duplicate of any file, no matter where." I think I did it once over the a bunch of e-books I've got once, then a year later a new bunch where a lot of them were the same but unhelpfully not always with the same name. It was a "one off" task, I wanted to delete only "newly received but the same" (as the old ones already went to backup) so there I calculated MD5 of each file of the same size and then made some short program to list only the same in the second directory. Then I checked the list (if it "looks ok") and only then deleted every file from that list. I admit that if it were something more important I can imagine using SHA-1 (160 bits) or even SHA-256 for the same task, but certainly not SHA-512. Anyway I never wanted to make "universal" solution (in one single program) – I considered it too much work, and also that it can be important to decide which copy of more should be kept and by which logic.

(P.S. Thanks for fixing formatting.)

Peter Kankowski,

Well, if you don't want to implement your algorithm, I will do it anyway ;)

I searched in a large folder with documentation: saved web pages, manuals in PDF, help files, RFCs, etc. (3.2 GB in total). The largest identical files were 4 MB both; all other files were below 800 KB (99.8% were smaller than 100 KB). So, I used exact comparison instead of your "500 KB" idea.

In my program, I replaced SHA-512 with SHA-256, because Python docs says SHA-512 is slower on 32-bit machines (though, I did not make any tests).

The first run (after reboot) The second and following runs
SHA-256 (Peter) 4 min 06 sec 13 sec
Comparison (Ace) 5 min 42 sec 20 sec
Total Commander 5 min 23 sec 6 sec

Source code

ace,

I don't have Python 3 but on my installation as far as I see file compare is plain python which reads 8 KB at the time(!) and hashes are in the native DLL. You can investigate then further the reason for speed difference.

BTW, how many files of the same size that are actually different exist in your sample set? Your numbers don't mean anything fithout at least some usable statistics about the properties of test set, whereas I gave you some exact examples where comparing must be more orders of magnitude faster.

Peter Kankowski,

The test set included 19 000 files with equal sizes (from which 9600 files were identical). The results don't change if you use an equal-sized buffer (64 KB instead of 8 KB) in both programs.

You certainly can show that O(N2) algorithm is better than O(N) in some special cases. For example, insertion sort is better than quick sort for a 5-element array. However, using insertion sort as a general-purpose sorting algorithm is a bad idea. The same can be said about your algorithm.

ace,

I still don't understand how many of which files you had. How many different sizes, and how many unique files for given one size? Does your n sqared come from having 9000 copies of the same file and comparing all combinations of them? But that comparision can be done O(n) for such set, as I replied after you first mentioned n squared (my third post here).

Peter Kankowski,

4170 different sizes; on average 2.6 unique files for each size (std. deviation = 2.9).

Does your n sqared come from having 9000 copies of the same file and comparing all combinations of them?

No. Read the source code: it makes 8999 comparisons in this case.

I found a possible optimization. The comparison function reopened and reread the same files. If you are comparing 5 files, the program for your algorithm opened files 8-20 times, and my program do it just 5 times. Now it keeps the buffer for one of the compared files, so 8-15 times instead of 8-20.

The first run (after reboot) The second and following runs
SHA-256 (Peter) 4 min 05 sec 13 sec
Comparison, 2nd version (Ace) 4 min 21 sec 13 sec
Total Commander 5 min 23 sec 6 sec

Source code

Generally, instead of theorizing, please write a better code yourself. It applies to your discussion with Dmitry as well.

ace,

It is possible to discuss the algorithms without actually writing the code. I'm not paid to write that code and I never claimed it's a matter of a few lines or a few minutes of coding to implement the algorithm. Still I believe it's useful to consider it, discuss the properties of it and compare it to others.

Note that I already mentioned all the elements of my algorithm, you just somehow ignored the most of it. I already explained all that follows in more informal way. I also compared it to other in more informal way. But if that's not enough, here's all that again, formally:

- This algorithm (from now on let's call it A) is applied only on the set of the N files which have exactly same size M.

- Determine available "parallelization level," e.g. Pm=200. P=max(N,Pm). The files will be processed by opening and reading the same block i of size B for P of them at once.

- Define the "fingerprint" of the file to be the checksum (where checksum can be even less complex than MD5. i.e. MD4 is OK, or even proper combination of two CRC-32 could be enough) of the first small part, one small part near begin and the last small part of the file. For M < = 3 B the fingerprint is the checksum of the whole file.

- Stop observing files with the different fingerprint.

- Note that already at this point there will be files for which "naive checksum the whole files" reads and checksums whole, where A will already reject them as different. A has potentially much less file reads at this very point.

- The remaining files, let's say R files, have the same fingerprint. The extremely high probability is that the files with the same fingerprint are actually the same. Still it has to be confirmed. It is going to be confirmed without the further use of the checksum.

- The next step is to attempt to partition the R files to smaller sets. Any file which has unique byte or word at the position p (where position p is the distance from the begin of the file) can be immediately discarded from the further analysis. In case where R=2 and mismatch occurs the process is already finished.

- The comparing of the files is, like before, performed using the blocks of the same size B. (Note that we still have in memory R blocks of B size from the end of the R files, so we will process them and then if later needed observe only M-B first bytes of any potentially remaining file).

- Comparing can be optimally done in two passes over the same set of blocks for the same oword on the distance pp. First pass is to take to the accumulator the biggest native word at position pp of the file 0 and then just compare if the word of any other file is the same. With very high probability it will be the same, when we simply increment pp to the next oword position. With the very high probability this way we'll decide that the remaining files are all the same. This is why I claim that for bigger files my algorithm typically has one compare per oword per file on the same position pp where "naive checksum" does the checksum update per word which can be optimally checksummed.

- As soon as we get a mismatch on the first pass we need to do the proper partitioning to split the current set of R files to smaller sets, on which we can then continue oword compare for each set independently. Note that the partitioned files can be processed directly from the same position. This step doesn't increase the total number of file reads! The partitioning should use, for example, bytes. I don't have to define it exactly at the moment to be able to estimate how the algorithm performs, because all the behaviour up to now is clear, including that, that partitioning to new sets doesn't increase total reads of blocks.

Now we have enough to estimate the behaviour and performance of A:

Advantage 1: "A" doesn't depend on probability for the final result. If you need such assurance (no probability) "naive checksum" is disqualified, "naive compare" is much slower.

Advantage 2: "A" potentially avoids reading close to 100% of big files with the same size but of different content. Universal algorithm shouldn't assume to process only small files.

Advantage 3: Less CPU use in total.

Advantage 4: For files of size < = B is the number of file reads the same as with the "naive checksum." CPU use per file is still lower, because simpler checksum is allowed to be used.

Potential weak point: Compared with "naive checksum" you can actually have maximally 2 block reads per file more for the files which are actually bigger than B. However, as soon as Advantage 2 is used, the total number of block reads for "A" can still be lower, and therefore "A" faster.

Weak point of my formalism: I also ignored the case when N > Pm, that's left as an exercise of the reader. Still I claim handling it doesn't change the numbers significantly.

I also already wrote major parts of these conclusions. I hope that now you see formal variant (I really expected that informal would be enough for you) maybe you understand why I wasn't more formal at the start or why I don't want to develop code for it! Still I think it's useful algorithm.

ace,

P.S. Re Dmitry – my last example there can be made complete by adding one if, one assignment and two returns. I wouldn't want to insult the intelligence of anybody who reads that to assume they are not obvious.

P.P.S. You never implemented my algorithm but yours "naive compare." I said the first time you mentioned N sqared that each "of N files can be read from the disk only once." See also "I need only N compares" (where the subject was word compares, not block compares) "as long as the files are the same (and not 0.5 * N squared)" That's not the function you have available in Python.

P.P.P.S. Wouldn't fair measurement include at least two big files of the same size and different content?

Peter Kankowski,

Sorry for the late reply. If I understand correctly, your algorithm calculates checksums for small files, so its performance would not be different on my test set. However, I agree that a universal algorithm should effectively process large files, too.

First pass is to take to the accumulator the biggest native word at position pp of the file 0 and then just compare if the word of any other file is the same.

I like the idea, because it discards different files early, without processing the whole file. When implementing the comparison, a programmer should ensure that the buffers start at different addresses to avoid cache conflicts.

For example, if the buffer addresses will be:

0x80000000
0x80080000
0x80100000
0x80180000
etc.

you will have cache conflicts when the number of compared files is larger than cache associativity. The addresses should be randomized like this:

0x800057F0
0x801015A0
etc.

Thanks for the detailed description of your ideas.

fish72,

IdenticalFiles" - small and smart utility finds duplicate files

on your computer: https://yadi.sk/d/7HFSNfN-3P7xnU

Your name:


Comment: