(OLDER) <- More Stuff -> (NEWER) (NEWEST)
Printer Friendly Version



Removing duplicate files


2009/04/13



This is a Perl script to remove duplicate files. It considers a file to be a duplicate if it has the same name and the same number of bytes. THAT COULD BE A VERY BAD ASSUMPTION.

You could do an MD5 sum to be sure the files really are dupes, you could even compare byte by byte, but THIS SCRIPT doesn't do any of that. It's designed to be quick, simple and easy for you to modify if you need something a little different.

This script keeps the OLDEST instance of the files. It's simple enough to change it to keep the newest; just reverse the sense of the test.

It's possible for two files to have the same age. In that case, the script won't delete either of them unless some other matching file is older.

I just dashed this off quickly this morning so CHECK THE RESULTS carefully before you uncomment the "unlink" line.

Or leave it in and just redirect this to a list for manual review and removal. That's the safest way - if you see anything odd in the list, don't remove it until you are sure it's OK to do so.

It would be easy enough to modify this to have it ignore files older than a certain age; you might do that to avoid deleting important system files. You could exclude files owned by root, or add any other conditions that make sense to you.

This works from the current directory down. I'd warn you to be very careful running this from "/" without adding extra restrictions.

A SCRIPT LIKE THIS COULD BE VERY DANGEROUS. Don't use it without understanding it or without testing it.

Note: the use of the "sprintf" is to avoid the possibility of a file name that happens to be numeric and somehow matches the the size and name of another file. That's extremely unlikely, but that's why it's there. You could aso generate the key from size, MD5 hash, name and anything else you wanted to add (owner, date range, whatever you need)

If you Google for "delete duplicate files" you will find lots of scripts and programs for this purpose. One of them may be exactly what you need. See Remove Duplicate Files for a version that does use MD5 Digest.


#!/usr/bin/perl
# dupekill - Tony Lawrence, http://aplawrence.com/Unixart/remove_duplicate_files.html
# Purpose: kill duplicate files
# Keep oldest version, check name and file size
# 
# Feel free to copy this, modify it, use it - with or without credit
# WARNING:  this is potentially very dangerous.  Test, understand, test.
# See the web page above for enhancements and more warnings

use File::Find;
use File::Basename;

sub walking {
$size= -s _;
$age=(stat(_))[9];
$name=basename($File::Find::name);
$key=sprintf("%.12d%s",$size,$name);
$myfiles{$key}=$age if not $myfiles{$key};
$myfiles{$key}=$age if ($myfiles{$key} > $age);
$counts{$key}++;
}

sub killing {
$size= -s _;
$age=(stat(_))[9];
$name=basename($File::Find::name);
$realname=$File::Find::name;
$key=sprintf("%.12d%s",$size,$name);
return if $counts{$key} < 2;
$date=scalar localtime($age);
if (($myfiles{$key} == $age)) {
  #print "Keeping $realname $age  $date\n";
  # uncomment above for testing
  return;
}
# you could do more tests here like MD5::Digest or even 
# a byte by byte comparison
# or exclude files owned by root, over a certain age, whatever
# 
  push @killem, $realname;
  # if you waqnt to actually reference $realname here, use /$realname
  # because File::Find changes directories as it walks
}


find (\&walking, '.');
find (\&killing, '.');
foreach(@killem) {
  print "$_\n";
  #unlink($_);
# uncomment above line to actually remove files.
}
 

;


Click here to add your comments





Mon Apr 13 11:54:33 2009: Subject:   TonyLawrence

gravatar
See http://aplawrence.com/Basics/undo-bad-archive.html and http://aplawrence.com/foo-mac/duplicate-files.html also - those posts describe a more common need and solution.





Tue Apr 14 05:07:13 2009: Subject:   anonymous

gravatar
What a lame script. Why bother?

http://www.mindgems.com/products/Fast-Duplicate-File-Finder/images/screenshots/main.png



Tue Apr 14 10:23:54 2009: Subject:   TonyLawrence

gravatar
Why bother with anything?

These things come up, and some people need a boost getting started with scripting a solution.

That's why.

Is that OK with you?



Fri Apr 17 21:56:52 2009: Subject: perl   badanov
http://www.freefirezone.org
gravatar
Why bother?

Try the simplicity yet eloquence of perl in performing routine and rote tasks even in a Windows environment.

But back to unix. Couldn't you do this a little faster ( by a few milliseonds) using just md5 ( available in BSD unix as a compiled executable or available as a compilable executable any number of places on the web ) piping the result into a searchable text list?

Just askin'

Anyway, cool script.



Fri Apr 17 22:36:24 2009: Subject:   TonyLawrence

gravatar
Yes - that's why I mentioned it. But remember that MD5 isn't absolute either.



Sun Apr 19 07:08:11 2009: Subject: md5   badanov
http://www.freefirezone.org
gravatar
But remember that MD5 isn't absolute either.

And yet, even BSD websites which feature ISOs of various BSD distributions still uses md5 hashes as a checksum. I realize that the ISOs are so large a collision is a lot less likely and therefore an attack, but what is the likelihood even in using md5 in an internal ( as in never being available to a network ) script will cause a security breach? Such an application is surely a secure method of encoding material.

I routinely use md5 hashes ( as an aside ) in some security programming.( I use md5 to provide a unique id for cookies, for example ) and I am fairly certain given the wide variation of the inputs, the hash is unbreakable unless the server is subject to attacks in a lab setting.

Not arguing against your method. I just think md5 given its ubiquitiousness and its continued use and availability in FreeBSD still has a place in security programming, inasmuch as it may not be useful for things like financial and military applications or in Internet applications where finances may be involved.

Another thing: Isn't md5 used as the security basis for for the Digest log ins in the Apache web server?

My only beef with the output from SHA-2 is that its output is only 40 characters long.



Sun Apr 19 10:43:19 2009: Subject:   TonyLawrence

gravatar
I don't disagree at all. Again, that's why I mentioned it. The purpose here was to solve a specific problem for someone where we knew what caused the dupes and knew that we didn't need MD5. The result is simple code that can be adapted to any other need easily. Think of it as a framework for someone new to Perl rather than a prescription for removing files.

cartoon
Looking for Mac OS X Help?
OS X PDF e-books
Inexpensive, instant download




Don't miss responses! Subscribe to Comments by RSS or by Email

Click here to add your comments


If you want a picture to show with your comment, go get a Gravatar



numly esn 15375-090413-708850-86
numly barcode

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.

Publishing your articles here

Jump to Comments



Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.

Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.

We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.


book graphic unix and linux troubleshooting guide

My Troubleshooting E-Book will show you how to solve tough problems on Linux and Unix systems!



 I sell and support
 Kerio Mail server




pavatar.jpg
More:
       - Shell
       - Linux
       - MacOSX


Unix/Linux Consultants

Skills Tests

Guest Post Here








card_image






My Favorites

Change Congress