finddup

NAME

finddup – Finds duplicated files fast and efficiently.

SYNOPSIS

finddup [-aehiqrx0] [-p | -t] [-2 | -4 | -8] [-B | -T] [-H | -L | -P]
        [-d | -l | -o | -O | -s | -S | -c | -C | -m | -M | -v | -V | -n]
        [-I glob] [-X glob] [file ...]

findlink [-aeiqr0] [-B | -T] [-H | -L | -P] [-d | -l | -o | -O | -n]
         [-I glob] [-X glob] [file ...]

DESCRIPTION

finddup compares the contents of files to check if any of them match. What is considered a match depends on the chosen method.

Note that multiple hard links to the same file are considered duplicates unless the -h option is specified.

findlink, on the other hand, finds hard links to files whose inode numbers are identical.

For both utilities, there are various output modes that are mostly useful for subsequent processing of the results.

As for non-option arguments, these utilities differentiate between files and directories; files passed as arguments are checked and compared first, and directories are traversed after. Hence, while it does not matter whether files or directories appear first on the command line, the order of multiple files and the order of multiple directories might affect the results, depending on the output mode.

When invoked without non-option arguments, these utilities look for duplicates or identical hard links in the working directory. When files are passed as arguments, they only look for duplicates or identical hard links of these files.

This manual contains a tutorial.

OPTIONS

Comparison Methods

Comparison Accuracy

Output Modes

Directory Traversal Options

Miscellaneous Options

NOTES

The finddup and findlink utilities accept the -- option, which will cause them to stop processing flag options. This allows you to pass file or directory names that begin with a dash (-).

EXIT STATUS

These utilities exit 0 on success, 1 if no duplicates were found, and greater than 0 if an error occurs.

TUTORIAL

For all of these examples you should bear in mind that, unless -p is specified, finddup might identify duplicates that are not, in fact, perfect copies but it will do so much faster than with precise comparison.

In this tutorial, the words duplicates and copies are used interchangeably.

Finding Duplicates

Let’s start by looking for duplicates in the working directory.

finddup

You can also check whether a directory contains duplicates of files in another directory (or vice versa). Note that this command will also find copies of files that are both located in the same directory.

finddup dir1 dir2

To simply get a list of duplicates (without the corresponding original file), call finddup -o dir1 dir2 instead. Provided that dir2 contains copies of files from dir1, this command will print the paths of the duplicated files in dir2.

Comparing Files

You might want to find out which files are copies of other files.

finddup file1.xyz file2.xyz file3.xyz

The next example shows how to determine which of two files is the original, i.e., the older one of the duplicates, provided that they are perfectly identical.

finddup -pm file1.xyz file2.xyz

Removing Duplicates

It’s easy to pipe the results to another utility, e.g., to delete duplicated files. (The -0 (zero) option implies -o unless another output mode is specified, which comes in handy for a simple operation like this.)

finddup -0 | xargs -0 rm

However, maybe you only want to delete specific files that already exist somewhere else and leave all other duplicates untouched, if there are any. This command searches dir recursively, and either does nothing or removes file.xyz if a duplicate of it exists anywhere in dir. (It will also try to delete the file more than once if dir contains multiple copies of it.)

finddup -rO0 file.xyz dir | xargs -0 rm

You could also delete text files that are almost identical but end (or begin) with unnecessary blank lines.

finddup -TS0 | xargs -0 rm

Caution: In the examples above, heuristic comparison was used, which could lead to the removal of files that were not exact copies of any other file but that the utility still regarded as duplicates. Only the precise comparison method can rule out false positives.

Finding Unique Files

You might find yourself in a situation where two or more directories contain the same files except for a few that have been changed (or corrupted). To get a list of these unique files, you can negate the results.

finddup -n dir1 dir2

Similarly, to make sure that the working directory does not contain a copy of a specific file, you can use a command like this.

finddup -n . file.xyz

Including and Excluding Files

You can specify which files should be compared or skipped during directory traversal. Let’s say you don’t want backup files to be compared.

finddup -X "*.bak"

You could also, e.g., look for duplicated video and audio files in the working directory and all its subdirectories recursively. (The pattern in the command below matches filenames with the extensions mp3, mp4, m4a, m4v, mkv, etc. The -i option makes patterns case-insensitive.)

finddup -ri -I "*.{mp[34],m?[av]}"

You can even combine inclusion and exclusion patterns. This command compares all JPEG files except the ones whose filenames contain _thumb.

finddup -ri -I "*.{jpg,jpeg}" -X "*_thumb.*"

Consult the documentation of Text::Glob for a detailed explanation of pattern syntax.

You can find all files that have multiple hard links pointing to them. This command prints the path to each file along with a list of the file’s other hard links.

findlink -rd dir

Since hard links to the same inode are all identical, there cannot be an original hard link, so findlink interprets the first file path it sees as the original.

CAVEATS

Although finddup should work on any platform, it has so far only been tested on macOS.

findlink does not work on file systems that don’t support hard links, such as FAT.

SEE ALSO

diff(1), ln(1), xargs(1), File::Compare, Text::Glob

AUTHORS

vbwx (github.com/vbwx)