finddup – Finds duplicated files fast and efficiently.
finddup [-aehiqrx0] [-p | -t] [-2 | -4 | -8] [-B | -T] [-H | -L | -P]
[-d | -l | -o | -O | -s | -S | -c | -C | -m | -M | -v | -V | -n]
[-I glob] [-X glob] [file ...]
findlink [-aeiqr0] [-B | -T] [-H | -L | -P] [-d | -l | -o | -O | -n]
[-I glob] [-X glob] [file ...]
finddup compares the contents of files to check if any of them match. What is considered a match depends on the chosen method.
By default, files are compared heuristically, which means that files are considered duplicates if they are the same size, and if a few bytes of different parts of the file contents (samples) are identical to their counterparts.
This method is very fast and accurate enough for most use cases, but it can produce false positives (or false negatives when invoked with -n). The number of samples that are compared can be increased with the -2, -4, and -8 options, which can reduce the number of false matches but also increase the run time, especially for the trim method. The sample size can be increased with the -x option.
The trim method (-t) also employs heuristic comparison as described above, but it ignores repeating characters at the start and end of file contents. This is especially useful for text files, which often end with blank lines, and video files, which might have a varying number of NUL characters at the end of their contents.
However, this method is slower because it needs to open every file to compare each of their contents, whereas the default method only has to compare files of the same size.
With precise comparison (-p), file contents are compared byte for byte, so it can be guaranteed that only perfect duplicates are found.
This method is the slowest one unless all files are different sizes, in which case it is actually faster than the trim method.
Note that multiple hard links to the same file are considered duplicates unless the -h option is specified.
findlink, on the other hand, finds hard links to files whose inode numbers are identical.
For both utilities, there are various output modes that are mostly useful for subsequent processing of the results.
By default, duplicates and their originals are shown in pairs, separated by
one of the following equality signs: ~~ means that the files are probably
duplicates; == indicates that the file contents are identical; ===
means that their inode numbers are identical.
The format of this output mode might change in the future and is therefore not suited for automatic processing or piping. finddup and findlink prevent output redirection in this mode.
As for non-option arguments, these utilities differentiate between files and directories; files passed as arguments are checked and compared first, and directories are traversed after. Hence, while it does not matter whether files or directories appear first on the command line, the order of multiple files and the order of multiple directories might affect the results, depending on the output mode.
When invoked without non-option arguments, these utilities look for duplicates or identical hard links in the working directory. When files are passed as arguments, they only look for duplicates or identical hard links of these files.
This manual contains a tutorial.
-p
Compare the entire contents of files. This is slower but makes sure that only perfectly matching files are considered duplicates.
-t
Trim repeating characters from the beginning and end of file contents before comparing them.
-2
Use twice as many samples for heuristic comparison.
-4
Use four times as many samples for heuristic comparison.
-8
Use eight times as many samples for heuristic comparison.
-x
Use three times as many bytes per sample for heuristic comparison.
-d
Print path of each file with a tab-indented list of paths of its duplicates.
If combined with the -0 (zero) option, the list of duplicates is not tab-indented, and each path is terminated with a NUL character; the last path in the list of duplicates is terminated with two NUL characters.
-l
Print paths of each file and its duplicate on separate lines.
-o
Only print paths of files that are duplicates of other files. This corresponds to the path on the left in the default output mode.
-O
Only print paths of files that have at least one duplicate. This corresponds to the path on the right in the default output mode.
-s
Only print paths of files whose size is smaller than or equal to the size of their respective duplicates.
Implies -t.
-S
Only print paths of files whose size is larger than or equal to the size of their respective duplicates.
Implies -t.
-c
Only print paths of files whose inode change time is older than or equal to the time of their respective duplicates.
-C
Only print paths of files whose inode change time is newer than or equal to the time of their respective duplicates.
-m
Only print paths of files whose modification time is older than or equal to the time of their respective duplicates.
-M
Only print paths of files whose modification time is newer than or equal to the time of their respective duplicates.
-v
Only print paths of files whose access time is older than or equal to the time of their respective duplicates.
-V
Only print paths of files whose access time is newer than or equal to the time of their respective duplicates.
-n
Only print paths of files that have no duplicates.
When invoked as findlink, print paths of files whose inode numbers are unique (among the specified files or the files within the traversed directories).
-a
Compare all files, including hidden files, such as Thumbs.db
and Icon?. Also look for files in hidden directories.
-e
Ignore empty files.
-r, -R
Look for duplicates in subdirectories as well.
-B
Only compare binary files.
-T
Only compare text files.
-h
Do not compare files whose inode numbers are identical.
-H
Follow symbolic links on the command line.
This option has no effect on Microsoft Windows.
-L
Follow all symbolic links.
This option has no effect on Microsoft Windows.
-P
Do not follow symbolic links. This is the default.
-I glob
Only compare files matching the pattern glob.
-X glob
Do not compare files matching the pattern glob.
-i
Ignore the case of glob patterns.
-q
Do not print the number of duplicated/unique files or identical hard links. Hide the progress indicator.
-0
Print paths separated by NUL characters; useful for xargs -0.
Implies -o unless an output mode is specified.
--help
Print a synopsis of the command and its options.
--version
Print version information.
The finddup and findlink utilities accept the -- option, which will
cause them to stop processing flag options. This allows you to pass file or
directory names that begin with a dash (-).
These utilities exit 0 on success, 1 if no duplicates were found, and greater than 0 if an error occurs.
For all of these examples you should bear in mind that, unless -p is specified, finddup might identify duplicates that are not, in fact, perfect copies but it will do so much faster than with precise comparison.
In this tutorial, the words duplicates and copies are used interchangeably.
Let’s start by looking for duplicates in the working directory.
finddup
You can also check whether a directory contains duplicates of files in another directory (or vice versa). Note that this command will also find copies of files that are both located in the same directory.
finddup dir1 dir2
To simply get a list of duplicates (without the corresponding original file),
call finddup -o dir1 dir2 instead. Provided that dir2 contains copies
of files from dir1, this command will print the paths of the duplicated
files in dir2.
You might want to find out which files are copies of other files.
finddup file1.xyz file2.xyz file3.xyz
The next example shows how to determine which of two files is the original, i.e., the older one of the duplicates, provided that they are perfectly identical.
finddup -pm file1.xyz file2.xyz
It’s easy to pipe the results to another utility, e.g., to delete duplicated files. (The -0 (zero) option implies -o unless another output mode is specified, which comes in handy for a simple operation like this.)
finddup -0 | xargs -0 rm
However, maybe you only want to delete specific files that already exist
somewhere else and leave all other duplicates untouched, if there are any.
This command searches dir recursively, and either does nothing or
removes file.xyz if a duplicate of it exists anywhere in dir. (It
will also try to delete the file more than once if dir contains multiple
copies of it.)
finddup -rO0 file.xyz dir | xargs -0 rm
You could also delete text files that are almost identical but end (or begin) with unnecessary blank lines.
finddup -TS0 | xargs -0 rm
Caution: In the examples above, heuristic comparison was used, which could lead to the removal of files that were not exact copies of any other file but that the utility still regarded as duplicates. Only the precise comparison method can rule out false positives.
You might find yourself in a situation where two or more directories contain the same files except for a few that have been changed (or corrupted). To get a list of these unique files, you can negate the results.
finddup -n dir1 dir2
Similarly, to make sure that the working directory does not contain a copy of a specific file, you can use a command like this.
finddup -n . file.xyz
You can specify which files should be compared or skipped during directory traversal. Let’s say you don’t want backup files to be compared.
finddup -X "*.bak"
You could also, e.g., look for duplicated video and audio files in the working
directory and all its subdirectories recursively. (The pattern in the command
below matches filenames with the extensions mp3, mp4, m4a, m4v,
mkv, etc. The -i option makes patterns case-insensitive.)
finddup -ri -I "*.{mp[34],m?[av]}"
You can even combine inclusion and exclusion patterns. This command compares
all JPEG files except the ones whose filenames contain _thumb.
finddup -ri -I "*.{jpg,jpeg}" -X "*_thumb.*"
Consult the documentation of Text::Glob for a detailed explanation of pattern syntax.
You can find all files that have multiple hard links pointing to them. This command prints the path to each file along with a list of the file’s other hard links.
findlink -rd dir
Since hard links to the same inode are all identical, there cannot be an original hard link, so findlink interprets the first file path it sees as the original.
Although finddup should work on any platform, it has so far only been tested on macOS.
findlink does not work on file systems that don’t support hard links, such as FAT.
diff(1), ln(1), xargs(1), File::Compare, Text::Glob
vbwx (github.com/vbwx)