finddup

NAME

finddup – Finds duplicated files fast and efficiently.

SYNOPSIS

finddup [-d | -l | -o | -O | -s | -S | -c | -C | -m | -M | -v | -V | -n] [-aehiqr0] [-p | -t] [-B | -T] [-H | -L | -P] [-I glob] [-X glob] [file …]

DESCRIPTION

This utility compares the contents of files to check if any of them match. What is considered a match depends on the chosen method.

Note that multiple hard links to the same file are considered duplicates unless the -h option is specified.

There are various output modes that are mostly useful for subsequent processing of the results.

As for non-option arguments, finddup differentiates between files and directories; files passed as arguments are checked and compared first, and directories are traversed after. Hence, while it does not matter whether files or directories appear first on the command line, the order of multiple files and the order of multiple directories might affect the results, depending on the output mode.

When invoked without non-option arguments, finddup looks for duplicates in the working directory. When files are passed as arguments, finddup only looks for duplicates of these files.

This manual contains a tutorial.

OPTIONS

Comparison Methods

Output Modes

Directory Traversal Options

Miscellaneous Options

TUTORIAL

For all of these examples you should bear in mind that, unless -p is specified, this utility might identify duplicates that are not, in fact, identical but you have to trade off precision against speed of operation.

In this tutorial, the words duplicates and copies are used interchangeably.

Finding Duplicates

Let’s start by looking for duplicates in the working directory.

finddup

You can also check whether a directory contains duplicates of files in another directory (or vice versa). Note that this command will also find copies of files that are both located in the same directory.

finddup dir1 dir2

To simply get a list of duplicates (without the corresponding original file), call finddup -o dir1 dir2 instead. Provided that dir2 contains copies of files from dir1, this command will print the paths of the duplicated files in dir2.

Comparing Files

You might want to find out which files are copies of other files.

finddup file1.xyz file2.xyz file3.xyz

The next example shows how to determine which of two files is the original, i.e., the older one of the duplicates, provided that they are perfectly identical.

finddup -pm file1.xyz file2.xyz

Removing Duplicates

It’s easy to pipe the results to another utility, e.g., to delete duplicated files. (The -0 (zero) option implies -o unless another output mode is specified, which comes in handy for a simple operation like this.)

finddup -0 | xargs -0 rm

However, maybe you only want to delete specific files that already exist somewhere else and leave all other duplicates untouched, if there are any. This command searches dir recursively, and either does nothing or removes file.xyz if a duplicate of it exists anywhere in dir. (It will also try to delete the file more than once if dir contains multiple copies of it.)

finddup -rO0 file.xyz dir | xargs -0 rm

You could also delete text files that are almost identical but end (or begin) with unnecessary blank lines.

finddup -TS0 | xargs -0 rm

Caution: In the examples above, heuristic comparison was used, which could lead to the removal of files that were not exact copies of any other file but that the utility still regarded as duplicates. Only the precise comparison method can rule out false positives.

Finding Unique Files

You might find yourself in a situation where two or more directories contain the same files except for a few that have been changed (or corrupted). To get a list of these unique files, you can negate the results.

finddup -n dir1 dir2

Similarly, to make sure that the working directory does not contain a copy of a specific file, you can use a command like this.

finddup -n . file.xyz

Including and Excluding Files

You can specify which files should be compared or skipped during directory traversal. Let’s say you don’t want backup files to be compared.

finddup -X "*.bak"

You could also, e.g., look for duplicated video and audio files in the working directory and all its subdirectories recursively. (The pattern in the command below matches filenames with the extensions mp3, mp4, m4a, m4v, mkv, etc. The -i option makes patterns case-insensitive.)

finddup -ri -I "*.{mp[34],m?[av]}"

You can even combine inclusion and exclusion patterns. This command compares all JPEG files except the ones whose filenames contain _thumb.

finddup -ri -I "*.{jpg,jpeg}" -X "*_thumb.*"

Consult the documentation of Text::Glob for a detailed explanation of pattern syntax.

CAVEATS

Although finddup should work on any platform, it has so far only been tested on macOS.

SEE ALSO

diff(1), xargs(1), File::Compare, Text::Glob

AUTHORS

Bernhard Waldbrunner (github.com/vbwx)