finddup – Finds duplicated files fast and efficiently.
finddup [-aehiqrx0] [-p | -t] [-2 | -4 | -8] [-B | -T] [-H | -L | -P]
[-d | -l | -o | -O | -s | -S | -c | -C | -m | -M | -v | -V | -n]
[-I glob] [-X glob] [file ...]
This utility compares the contents of files to check if any of them match. What is considered a match depends on the chosen method.
By default, files are compared heuristically, which means that files are considered duplicates if they are the same size, and if a few bytes of different parts of the file contents (samples) are identical to their counterparts.
This method is very fast and accurate enough for most use cases, but it can produce false positives (or false negatives when invoked with -n). The number of samples that are compared can be increased with the -2, -4, and -8 options, which can reduce the number of false matches but also increase the run time, especially for the trim method. The sample size can be increased with the -x option.
The trim method (-t) also employs heuristic comparison as described above, but it ignores repeating characters at the start and end of file contents. This is especially useful for text files, which often end with blank lines, and video files, which might have a varying number of NUL characters at the end of their contents.
However, this method is slower because it needs to open every file to compare their contents to each other, whereas the default method only has to compare files of the same size.
With precise comparison (-p), file contents are compared byte for byte, so it can be guaranteed that only perfect duplicates are found.
This method is the slowest one unless all files are different sizes, in which case it is actually faster than the trim method.
Note that multiple hard links to the same file are considered duplicates unless the -h option is specified.
There are various output modes that are mostly useful for subsequent processing of the results.
By default, duplicates and their originals are shown in pairs, separated by
one of the following equality signs: ~~
means that the files are probably
duplicates; ==
indicates that the file contents are identical; ===
means that their inode numbers are identical.
The format of this output mode might change in the future and is therefore not suited for automatic processing or piping. finddup prevents output redirection in this mode.
As for non-option arguments, finddup differentiates between files and directories; files passed as arguments are checked and compared first, and directories are traversed after. Hence, while it does not matter whether files or directories appear first on the command line, the order of multiple files and the order of multiple directories might affect the results, depending on the output mode.
When invoked without non-option arguments, finddup looks for duplicates in the working directory. When files are passed as arguments, finddup only looks for duplicates of these files.
This manual contains a tutorial.
-p
Compare the entire contents of files. This is slower but only considers files to be duplicates if they are perfect matches.
-t
Trim repeating characters from the beginning and end of file contents before comparing them.
-2
Use twice as many samples for heuristic comparison.
-4
Use four times as many samples for heuristic comparison.
-8
Use eight times as many samples for heuristic comparison.
-x
Use three times as many bytes per sample for heuristic comparison.
-d
Print path of each file with a list of paths of its duplicates.
If combined with the -0 (zero) option, each path is terminated with a NUL character, while the last path in the list of duplicates is terminated with two NUL characters.
-l
Print paths of each file and its duplicate on separate lines.
-o
Only print paths of files that are duplicates of other files. This corresponds to the path on the left in the default output mode.
-O
Only print paths of files that have at least one duplicate. This corresponds to the path on the right in the default output mode.
-s
Only print paths of files whose size is smaller than or equal to the size of their respective duplicates.
Implies -t.
-S
Only print paths of files whose size is larger than or equal to the size of their respective duplicates.
Implies -t.
-c
Only print paths of files whose inode change time is older than or equal to the time of their respective duplicates.
-C
Only print paths of files whose inode change time is newer than or equal to the time of their respective duplicates.
-m
Only print paths of files whose modification time is older than or equal to the time of their respective duplicates.
-M
Only print paths of files whose modification time is newer than or equal to the time of their respective duplicates.
-v
Only print paths of files whose access time is older than or equal to the time of their respective duplicates.
-V
Only print paths of files whose access time is newer than or equal to the time of their respective duplicates.
-n
Only print paths of files that have no duplicates.
-a
Compare all files, including hidden files, such as Thumbs.db
and Icon?
. Also look for files in hidden directories.
-e
Ignore empty files.
-r, -R
Look for duplicates in subdirectories as well.
-B
Only compare binary files.
-T
Only compare text files.
-h
Do not compare files whose inode numbers are identical.
-H
Follow symbolic links on the command line.
This option has no effect on Microsoft Windows.
-L
Follow all symbolic links.
This option has no effect on Microsoft Windows.
-P
Do not follow symbolic links. This is the default.
-I glob
Only compare files matching the pattern glob.
-X glob
Do not compare files matching the pattern glob.
-i
Ignore the case of glob patterns.
-q
Do not print the number of duplicated or unique files. Hide the progress indicator.
-0
Print paths separated by NUL characters; useful for xargs -0
.
Implies -o unless an output mode is specified.
--help
Print a synopsis of the command and its options.
--version
Print version information.
The finddup command accepts the -- option, which will cause it to stop
processing flag options. This allows you to pass file or directory names that
begin with a dash (-
).
The finddup utility exits 0 on success, 1 if no duplicates were found, and greater than 0 if an error occurs.
For all of these examples you should bear in mind that, unless -p is specified, this utility might identify duplicates that are not, in fact, identical but you have to trade off precision against speed of operation.
In this tutorial, the words duplicates and copies are used interchangeably.
Let’s start by looking for duplicates in the working directory.
finddup
You can also check whether a directory contains duplicates of files in another directory (or vice versa). Note that this command will also find copies of files that are both located in the same directory.
finddup dir1 dir2
To simply get a list of duplicates (without the corresponding original file),
call finddup -o dir1 dir2
instead. Provided that dir2
contains copies
of files from dir1
, this command will print the paths of the duplicated
files in dir2
.
You might want to find out which files are copies of other files.
finddup file1.xyz file2.xyz file3.xyz
The next example shows how to determine which of two files is the original, i.e., the older one of the duplicates, provided that they are perfectly identical.
finddup -pm file1.xyz file2.xyz
It’s easy to pipe the results to another utility, e.g., to delete duplicated files. (The -0 (zero) option implies -o unless another output mode is specified, which comes in handy for a simple operation like this.)
finddup -0 | xargs -0 rm
However, maybe you only want to delete specific files that already exist
somewhere else and leave all other duplicates untouched, if there are any.
This command searches dir
recursively, and either does nothing or
removes file.xyz
if a duplicate of it exists anywhere in dir
. (It
will also try to delete the file more than once if dir
contains multiple
copies of it.)
finddup -rO0 file.xyz dir | xargs -0 rm
You could also delete text files that are almost identical but end (or begin) with unnecessary blank lines.
finddup -TS0 | xargs -0 rm
Caution: In the examples above, heuristic comparison was used, which could lead to the removal of files that were not exact copies of any other file but that the utility still regarded as duplicates. Only the precise comparison method can rule out false positives.
You might find yourself in a situation where two or more directories contain the same files except for a few that have been changed (or corrupted). To get a list of these unique files, you can negate the results.
finddup -n dir1 dir2
Similarly, to make sure that the working directory does not contain a copy of a specific file, you can use a command like this.
finddup -n . file.xyz
You can specify which files should be compared or skipped during directory traversal. Let’s say you don’t want backup files to be compared.
finddup -X "*.bak"
You could also, e.g., look for duplicated video and audio files in the working
directory and all its subdirectories recursively. (The pattern in the command
below matches filenames with the extensions mp3
, mp4
, m4a
, m4v
,
mkv
, etc. The -i option makes patterns case-insensitive.)
finddup -ri -I "*.{mp[34],m?[av]}"
You can even combine inclusion and exclusion patterns. This command compares
all JPEG files except the ones whose filenames contain _thumb
.
finddup -ri -I "*.{jpg,jpeg}" -X "*_thumb.*"
Consult the documentation of Text::Glob for a detailed explanation of pattern syntax.
Although finddup should work on any platform, it has so far only been tested on macOS.
diff(1), xargs(1), File::Compare, Text::Glob
Bernhard Waldbrunner (github.com/vbwx)