
# Adapted to only compute the md5sum of files with the same size It is very efficient because it checks the duplicate based on the file size first. svn paths for instance, which surely will trigger colliding files in find_duplicates.įeedbacks are has a nice solution here. This method is convenient for not parsing. Raise Exception("Unknown checksum method")įile_size = os.stat(current_file_name) Hashes_on_1k = defaultdict(list) # dict of (hash1k, size_in_bytes): Hashes_by_size = defaultdict(list) # dict of size_in_bytes: """Generator that reads a file in chunks of bytes"""ĭef get_hash(filename, first_chunk_only=False, hash=hashlib.sha1):ĭef check_for_duplicates(paths, hash=hashlib.sha1): For files with the same hash on the first 1k bytes, calculate the hash on the full contents - files with matching ones are NOT unique.For files with the same size, create a hash table with the hash of their first 1024 bytes non-colliding elements are unique.Buildup a hash table of the files, where the filesize is the key.Iterating on the solid answers given by and borrowing the idea of to have a fast hash of just the beginning of each file, and calculating the full one only on collisions in the fast hash, here are the steps: Calculating the expensive hash only on files with the same size will save tremendous amount of CPU performance comparisons at the end, here's the explanation. The approaches in the other solutions are very cool, but they forget about an important property of duplicate files - they have the same file size.

Fastest algorithm - 100x performance increase compared to the accepted answer (really :))
