View previous topic :: View next topic |
Author |
Message |
vectox n00b
Joined: 29 Oct 2004 Posts: 21 Location: Luxembourg
|
Posted: Sun Oct 14, 2007 3:38 am Post subject: Duplicate Files - Finding and removing via CLI. |
|
|
This is a tip that I found on another forum site (http://ubuntuforums.org/showthread.php?t=97701&page=2) that I found quite handy so I thought I'd share it. If you have tons of media, image, music files or whatever...this command will find them and mark them as a duplicate.
Code: | find . ! -empty -type f -printf "%s '%p'\n" | sort -n | uniq -D -w 1 | cut -d" " -f2- | xargs md5sum | sort | uniq -w32 -d --all-repeated=separate | cut -c35- > duplicates |
So in a nutshell this will recursively go through all the child directories and list all duplicates in a space delimited file called "duplicates".
I think it's easier to manage the duplicates by putting them in a file...this way you can examine them before pooching the dupes.
So I threw together a one liner to delete the duplicate entries.
Code: | grep -B1 '^$' duplicates | sed '/--/ d ; /^$/ d' |xargs -i rm {} |
The above command works fine, but there is a small bug, in that it doesn't delete the last duplicate unless you add a blank line to the bottom of the duplicates file. Also if you have 2 duplicates of the same file (3 identicle files) you're still going to be left with 2 after this.
There is also a program called fdupes as well I believe, but I believe that this is quicker as this script only md5sums the files that are the same file size.
I like to stick with common bash commands rather than creating perl scripts to do this sort of stuff though awk is still pretty quick .
Suggestions and optimizations welcome as I'm sure the code can be optimized to be more efficient. |
|
Back to top |
|
|
ok Guru
Joined: 11 Jul 2006 Posts: 390 Location: germany
|
Posted: Sun Oct 14, 2007 10:11 am Post subject: |
|
|
With that Quote: | ... | uniq -D -w 1 | ... | e.g. 67 and 68 is the same size.
man uniq: Quote: |
...-w, --check-chars=N
compare no more than N characters in lines
... |
|
|
Back to top |
|
|
lagalopex Guru
Joined: 16 Oct 2004 Posts: 562
|
Posted: Sun Oct 14, 2007 10:35 am Post subject: |
|
|
ever looked at fdupes |
|
Back to top |
|
|
truc Advocate
Joined: 25 Jul 2005 Posts: 3199
|
Posted: Sun Oct 14, 2007 10:59 pm Post subject: |
|
|
Starting with this idea, and the first comment ( ), I've written a script which also finds duplicate files, but which doesn't suffer from this limitation
You can find it here ( see it here)
In short, it first finds files of the same size, then compares their md5 checksum
here is a sample of the output:
Code: | find-duplicate-files music/
--- same MD5 ---
'music/Requiem For A Dream/Requiem For A Dream - Requiem For A Dream - Main Theme.mp3'
'music/Requiem For A Dream/Requiem_For_A_Dream--Theme.mp3'
--- same MD5 ---
'music/temp/Madeleine peyrou/08 Piste 8.wma'
'music/temp/Madeleine peyrou/Copie de 08 Piste 8.wma'
--- same MD5 ---
'music/Rock/Deftones/Deftones - Adrenaline - Fist.mp3'
'music/trliala'
--- same MD5 ---
'music/temp/Madeleine peyrou/06 Piste 6.wma'
'music/temp/Madeleine peyrou/Copie de 06 Piste 6.wma'
--- same MD5 ---
'music/.rag[gna]gna'
'music/.truc'
--- same MD5 ---
'music/.bibabeloula'
'music/.bloup'
'music/R&B/Kelis/Kelis - Tasty - Glow Feat Raphael Saadiq.mp3' |
EDIT: 20071015: some small modifications in the script, some cleanups _________________ The End of the Internet!
Last edited by truc on Mon Oct 15, 2007 9:40 pm; edited 2 times in total |
|
Back to top |
|
|
vectox n00b
Joined: 29 Oct 2004 Posts: 21 Location: Luxembourg
|
Posted: Mon Oct 15, 2007 7:38 pm Post subject: Good point |
|
|
ok wrote: | With that Quote: | ... | uniq -D -w 1 | ... | e.g. 67 and 68 is the same size.
man uniq: Quote: |
...-w, --check-chars=N
compare no more than N characters in lines
... |
|
Yep totally...good point. I think this is something you'd manually adjust depending on the size of the files your working with.......though it doesn't make it very dynamic. |
|
Back to top |
|
|
truc Advocate
Joined: 25 Jul 2005 Posts: 3199
|
Posted: Wed Oct 17, 2007 9:22 pm Post subject: Re: Good point |
|
|
vectox wrote: | Yep totally...good point. I think this is something you'd manually adjust depending on the size of the files your working with.......though it doesn't make it very dynamic. |
Well the script I wrote *is* dynamic, no need to adjust anything. _________________ The End of the Internet! |
|
Back to top |
|
|
slycordinator Advocate
Joined: 31 Jan 2004 Posts: 3065 Location: Korea
|
Posted: Thu Oct 18, 2007 3:50 am Post subject: |
|
|
app-misc/fdupes |
|
Back to top |
|
|
truc Advocate
Joined: 25 Jul 2005 Posts: 3199
|
Posted: Thu Oct 18, 2007 10:45 am Post subject: |
|
|
slycordinator wrote: | app-misc/fdupes |
wow!
Code: | echo 3 > /proc/sys/vm/drop_caches
time find-duplicate-files prog/ | wc -l
1472
real 0m33.255s
user 0m10.978s
sys 0m4.890s |
then Code: | echo 3 > /proc/sys/vm/drop_caches
time fdupes -r prog/ | wc -l
1472
real 0m7.961s
user 0m0.351s
sys 0m0.394s |
(total nb of files: 1617)
thanks! And even more:fdups has interesting options _________________ The End of the Internet! |
|
Back to top |
|
|
|