APLawrence.com -  Resources for Unix and Linux Systems, Bloggers and the self-employed

Removing duplicate files

This is a Perl script to remove duplicate files. It considers a file to be a duplicate if it has the same name and the same number of bytes. THAT COULD BE A VERY BAD ASSUMPTION.

You could do an MD5 sum to be sure the files really are dupes, you could even compare byte by byte, but THIS SCRIPT doesn't do any of that. It's designed to be quick, simple and easy for you to modify if you need something a little different.

This script keeps the OLDEST instance of the files. It's simple enough to change it to keep the newest; just reverse the sense of the test.

It's possible for two files to have the same age. In that case, the script won't delete either of them unless some other matching file is older.

I just dashed this off quickly this morning so CHECK THE RESULTS carefully before you uncomment the "unlink" line.

Or leave it in and just redirect this to a list for manual review and removal. That's the safest way - if you see anything odd in the list, don't remove it until you are sure it's OK to do so.

It would be easy enough to modify this to have it ignore files older than a certain age; you might do that to avoid deleting important system files. You could exclude files owned by root, or add any other conditions that make sense to you.

This works from the current directory down. I'd warn you to be very careful running this from "/" without adding extra restrictions.

A SCRIPT LIKE THIS COULD BE VERY DANGEROUS. Don't use it without understanding it or without testing it.

Note: the use of the "sprintf" is to avoid the possibility of a file name that happens to be numeric and somehow matches the the size and name of another file. That's extremely unlikely, but that's why it's there. You could aso generate the key from size, MD5 hash, name and anything else you wanted to add (owner, date range, whatever you need)

If you Google for "delete duplicate files" you will find lots of scripts and programs for this purpose. One of them may be exactly what you need. See Remove Duplicate Files for a version that does use MD5 Digest.


See My Technical Support and Service Rates if you need help modifying this for a specific need.

#!/usr/bin/perl
# dupekill - Tony Lawrence, http://aplawrence.com/Unixart/remove_duplicate_files.html
# Purpose: kill duplicate files
# Keep oldest version, check name and file size
# 
# Feel free to copy this, modify it, use it - with or without credit
# WARNING:  this is potentially very dangerous.  Test, understand, test.
# See the web page above for enhancements and more warnings

use File::Find;
use File::Basename;

sub walking {
$size= -s _;
$age=(stat(_))[9];
$name=basename($File::Find::name);
$key=sprintf("%.12d%s",$size,$name);
$myfiles{$key}=$age if not $myfiles{$key};
$myfiles{$key}=$age if ($myfiles{$key} > $age);
$counts{$key}++;
}

sub killing {
$size= -s _;
$age=(stat(_))[9];
$name=basename($File::Find::name);
$realname=$File::Find::name;
$key=sprintf("%.12d%s",$size,$name);
return if $counts{$key} < 2;
$date=scalar localtime($age);
if (($myfiles{$key} == $age)) {
  #print "Keeping $realname $age  $date\n";
  # uncomment above for testing
  return;
}
# you could do more tests here like MD5::Digest or even 
# a byte by byte comparison
# or exclude files owned by root, over a certain age, whatever
# 
  push @killem, $realname;
  # if you want to actually reference $realname here, use /$realname
  # because File::Find changes directories as it walks
}


find (\&walking, '.');
find (\&killing, '.');
foreach(@killem) {
  print "$_\n";
  #unlink($_);
# uncomment above line to actually remove files.
}
 

A free GUI dupe checker (with source) that runs on Mac, Linux and Windows is dupeGuru



Got something to add? Send me email.





(OLDER)    <- More Stuff -> (NEWER)    (NEWEST)   

Printer Friendly Version

-> -> Removing duplicate files


12 comments



Increase ad revenue 50-250% with Ezoic


More Articles by

Find me on Google+

© Anthony Lawrence







Mon Apr 13 11:54:33 2009: 6151   TonyLawrence

gravatar
See (link) and (link) also - those posts describe a more common need and solution.





Tue Apr 14 05:07:13 2009: 6168   anonymous

gravatar
What a lame script. Why bother?

(link)



Tue Apr 14 10:23:54 2009: 6169   TonyLawrence

gravatar
Why bother with anything?

These things come up, and some people need a boost getting started with scripting a solution.

That's why.

Is that OK with you?



Fri Apr 17 21:56:52 2009: 6218   badanov

gravatar
Why bother?

Try the simplicity yet eloquence of perl in performing routine and rote tasks even in a Windows environment.

But back to unix. Couldn't you do this a little faster ( by a few milliseonds) using just md5 ( available in BSD unix as a compiled executable or available as a compilable executable any number of places on the web ) piping the result into a searchable text list?

Just askin'

Anyway, cool script.



Fri Apr 17 22:36:24 2009: 6219   TonyLawrence

gravatar
Yes - that's why I mentioned it. But remember that MD5 isn't absolute either.



Sun Apr 19 07:08:11 2009: 6228   badanov

gravatar
But remember that MD5 isn't absolute either.

And yet, even BSD websites which feature ISOs of various BSD distributions still uses md5 hashes as a checksum. I realize that the ISOs are so large a collision is a lot less likely and therefore an attack, but what is the likelihood even in using md5 in an internal ( as in never being available to a network ) script will cause a security breach? Such an application is surely a secure method of encoding material.

I routinely use md5 hashes ( as an aside ) in some security programming.( I use md5 to provide a unique id for cookies, for example ) and I am fairly certain given the wide variation of the inputs, the hash is unbreakable unless the server is subject to attacks in a lab setting.

Not arguing against your method. I just think md5 given its ubiquitiousness and its continued use and availability in FreeBSD still has a place in security programming, inasmuch as it may not be useful for things like financial and military applications or in Internet applications where finances may be involved.

Another thing: Isn't md5 used as the security basis for for the Digest log ins in the Apache web server?

My only beef with the output from SHA-2 is that its output is only 40 characters long.



Sun Apr 19 10:43:19 2009: 6230   TonyLawrence

gravatar
I don't disagree at all. Again, that's why I mentioned it. The purpose here was to solve a specific problem for someone where we knew what caused the dupes and knew that we didn't need MD5. The result is simple code that can be adapted to any other need easily. Think of it as a framework for someone new to Perl rather than a prescription for removing files.



Fri Dec 18 21:19:15 2009: 7766   TonyLawrence

gravatar


A Windows user recommended AcuteFinder


(link)

Their problem was thousands of duplicate email messages:

see (link)



Fri Mar 26 13:52:04 2010: 8276   Jade

gravatar


I use Duplicate Finder from Ashisoft to remove duplicate files from drive.



Sat May 22 11:40:22 2010: 8624   binaryman

gravatar


Directory Report is a good general purpose program for finding duplicate files
(link) (link)



Fri Jun 28 16:32:36 2013: 12170   dupes

gravatar


I use this free tool to (link) Find Similar Files
Give it a try...it provides impressively good results.






Fri Jun 28 16:48:16 2013: 12171   Tony

gravatar


Yet another person who doesn't bother to read other folks comments..

------------------------
Kerio Samepage


Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us





The camel has evolved to be relatively self-sufficient. (On the other hand, the camel has not evolved to smell good. Neither has Perl.) (Larry Wall)

Good questions outrank easy answers. (Paul Samuelson)












This post tagged: