


===> Dupseek <===


A command-line interactive perl program to find and remove duplicate files.


--- Algorithm ---

A few strategies are possible for finding duplicate files in a big set, such as
a heavily populated directory.
One of the most widely used consists of grouping files by size (because files
of different size can't be identical) and then computing a short digital
fingerprint (such as a md5 checksum) for the files. Files with a different
fingerprint are different, and files with the same digital fingerprint are very
probably the same. Just to be sure, one can further check possible duplicates.
Dupseek does something different:
    * It starts by grouping files by size.
    * Then it starts reading small chunks of the files of the same size and
      comparing them. It creates smaller groups depending on these comparisons.
    * It goes on with bigger and bigger chunks (of size up to a hard-coded
      limit).
    * It stops reading from files as soon as they form a single-element group
      or they are read completely (which only happens when they have a very
      high probability of having duplicates).
This algorithm is much more efficient than competitors when dealing with large
files of the same size. When files differ, reading usually stops after very few
reads.


--- Partial execution ---

Dupseek (and destroy) can be interrupted at any moment. The user is then
presented with partial results and can either intervene manually or go on with
the reading and computation, on a group-by-group basis. Since subsequent reads
happen sparsely in the file, if some files are still in the same group after
many iterations, they are most probably identical, unless the differences are
very small.


--- Platforms ---

Dupseek was reported to run on the following platforms:
    * Debian GNU/Linux "Woody" and "Sarge"
    * Mac OS X v10.2.6
    * Freebsd 4.7


--- Dependencies ---

Dupseek was developed with perl 5.6.1 and was also tested with perl 5.8.4. It
relies on the following modules:
    * File::Find directory recursion;
    * IO::File object-oriented file handles;
    * Getopt::Std option parsing


--- License ---

Dupseek (and destroy) is Copyright Antonio_Bellezza_2003-2005. It is released
under the GPL_v2. Here is the license notice:

  ---------------------------------------------------------------------------

      This program is free software; you can redistribute it and/or modify
       it under the terms of version 2 of the GNU General Public License
                 as published by the Free Software Foundation;
        This program is distributed in the hope that it will be useful,
         but WITHOUT ANY WARRANTY; without even the implied warranty of
         MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
                  GNU General Public License for more details.

  ---------------------------------------------------------------------------



--- Beware ---

The program destroys files. Starting from version 1.1, it can also do it in an
automatic way, and mistakes can happen, on the user's or programmer's part. So,
be warned!!!


--- Usage ---

dupseek -h outputs a help page.
Hit Ctrl-C to interrupt interactive execution and be presented with partial
results.


--- Credits ---

I would like to thank:
Mike Depot for a patch implementing minimum size for files.
Glenn Powers for extensive testing on Mac OS X and pointing out the problem
with changing files/directories.
Henry_Laxen for sending me his patch implementing batch processing and option
parsing (see credits.txt).


--- Download ---

The latest version is
                      Dupseek_version_1.2 (March 7, 2005)
The file is a tgz archive, containing development files and the stand-alone
program dupseek (which is the only file you need as a user).
You can also download the older releases
                      Dupseek_version_1.1 (June 27, 2003)
                      Dupseek_version_1.0 (June 6, 2003)


--- Bugs ---

    * If a directory is entered twice, or is contained in another one, then all
      its files are found twice and identified as duplicates. This can be a
      VERY DANGEROUS SITUATION
    * Dupseek gets confused if files are modified/moved while it's working.
      Starting from version 1.1, you should avoid making any changes to the
      folders you are checking while dupseek is running.
    * Testing under other platforms was not carried out. Please, send me some
      feedback if you are brave enough to use dupseek on a different OS.


--- Further work ---

If I had more spare time, I would like to add a graphical user interface,
possibly managed by a different process or thread, allowing interaction while
the program is running without the need to interrupt the main loop. Since the
program works well enough for my needs now, I will probably leave it as-is, but
any contribution is welcome.
