Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Duplicate Files - Finding and removing via CLI.
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks
View previous topic :: View next topic  
Author Message
vectox
n00b
n00b


Joined: 29 Oct 2004
Posts: 21
Location: Luxembourg

PostPosted: Sun Oct 14, 2007 3:38 am    Post subject: Duplicate Files - Finding and removing via CLI. Reply with quote

This is a tip that I found on another forum site (http://ubuntuforums.org/showthread.php?t=97701&page=2) that I found quite handy so I thought I'd share it. If you have tons of media, image, music files or whatever...this command will find them and mark them as a duplicate.

Code:
find . ! -empty -type f -printf "%s '%p'\n" | sort -n | uniq -D -w 1 | cut -d" " -f2- | xargs md5sum | sort | uniq -w32 -d --all-repeated=separate | cut -c35- > duplicates


So in a nutshell this will recursively go through all the child directories and list all duplicates in a space delimited file called "duplicates".
I think it's easier to manage the duplicates by putting them in a file...this way you can examine them before pooching the dupes.
So I threw together a one liner to delete the duplicate entries.

Code:
grep -B1 '^$' duplicates | sed '/--/ d ; /^$/ d' |xargs -i rm {}


The above command works fine, but there is a small bug, in that it doesn't delete the last duplicate unless you add a blank line to the bottom of the duplicates file. Also if you have 2 duplicates of the same file (3 identicle files) you're still going to be left with 2 after this.

There is also a program called fdupes as well I believe, but I believe that this is quicker as this script only md5sums the files that are the same file size.

I like to stick with common bash commands rather than creating perl scripts to do this sort of stuff though awk is still pretty quick :).
Suggestions and optimizations welcome as I'm sure the code can be optimized to be more efficient.
Back to top
View user's profile Send private message
ok
Guru
Guru


Joined: 11 Jul 2006
Posts: 390
Location: germany

PostPosted: Sun Oct 14, 2007 10:11 am    Post subject: Reply with quote

With that
Quote:
... | uniq -D -w 1 | ...
e.g. 67 and 68 is the same size.
man uniq:
Quote:

...-w, --check-chars=N
compare no more than N characters in lines
...
Back to top
View user's profile Send private message
lagalopex
Guru
Guru


Joined: 16 Oct 2004
Posts: 562

PostPosted: Sun Oct 14, 2007 10:35 am    Post subject: Reply with quote

ever looked at fdupes ;)
Back to top
View user's profile Send private message
truc
Advocate
Advocate


Joined: 25 Jul 2005
Posts: 3199

PostPosted: Sun Oct 14, 2007 10:59 pm    Post subject: Reply with quote

Starting with this idea, and the first comment ( ;)), I've written a script which also finds duplicate files, but which doesn't suffer from this limitation

You can find it here ( see it here)
In short, it first finds files of the same size, then compares their md5 checksum
here is a sample of the output:
Code:
find-duplicate-files music/
--- same MD5 ---
'music/Requiem For A Dream/Requiem For A Dream - Requiem For A Dream - Main Theme.mp3'
'music/Requiem For A Dream/Requiem_For_A_Dream--Theme.mp3'
--- same MD5 ---
'music/temp/Madeleine peyrou/08 Piste 8.wma'
'music/temp/Madeleine peyrou/Copie de 08 Piste 8.wma'
--- same MD5 ---
'music/Rock/Deftones/Deftones - Adrenaline - Fist.mp3'
'music/trliala'
--- same MD5 ---
'music/temp/Madeleine peyrou/06 Piste 6.wma'
'music/temp/Madeleine peyrou/Copie de 06 Piste 6.wma'
--- same MD5 ---
'music/.rag[gna]gna'
'music/.truc'
--- same MD5 ---
'music/.bibabeloula'
'music/.bloup'
'music/R&B/Kelis/Kelis - Tasty - Glow Feat Raphael Saadiq.mp3'


EDIT: 20071015: some small modifications in the script, some cleanups :)
_________________
The End of the Internet!


Last edited by truc on Mon Oct 15, 2007 9:40 pm; edited 2 times in total
Back to top
View user's profile Send private message
vectox
n00b
n00b


Joined: 29 Oct 2004
Posts: 21
Location: Luxembourg

PostPosted: Mon Oct 15, 2007 7:38 pm    Post subject: Good point Reply with quote

ok wrote:
With that
Quote:
... | uniq -D -w 1 | ...
e.g. 67 and 68 is the same size.
man uniq:
Quote:

...-w, --check-chars=N
compare no more than N characters in lines
...



Yep totally...good point. I think this is something you'd manually adjust depending on the size of the files your working with.......though it doesn't make it very dynamic.
Back to top
View user's profile Send private message
truc
Advocate
Advocate


Joined: 25 Jul 2005
Posts: 3199

PostPosted: Wed Oct 17, 2007 9:22 pm    Post subject: Re: Good point Reply with quote

vectox wrote:
Yep totally...good point. I think this is something you'd manually adjust depending on the size of the files your working with.......though it doesn't make it very dynamic.

Well the script I wrote *is* dynamic, no need to adjust anything.
_________________
The End of the Internet!
Back to top
View user's profile Send private message
slycordinator
Advocate
Advocate


Joined: 31 Jan 2004
Posts: 3065
Location: Korea

PostPosted: Thu Oct 18, 2007 3:50 am    Post subject: Reply with quote

app-misc/fdupes
Back to top
View user's profile Send private message
truc
Advocate
Advocate


Joined: 25 Jul 2005
Posts: 3199

PostPosted: Thu Oct 18, 2007 10:45 am    Post subject: Reply with quote

slycordinator wrote:
app-misc/fdupes

wow!
Code:
echo 3 > /proc/sys/vm/drop_caches
time find-duplicate-files prog/ | wc -l
1472

real    0m33.255s
user    0m10.978s
sys     0m4.890s

then
Code:
echo 3 > /proc/sys/vm/drop_caches
time fdupes -r prog/ | wc -l
1472

real    0m7.961s
user    0m0.351s
sys     0m0.394s

(total nb of files: 1617)

thanks! And even more:fdups has interesting options :)
_________________
The End of the Internet!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum