merging directories replacing duplicates with links

barrymac · Tux's lil' helper Joined: 15 Jul 2004 Posts: 87

Hello all,

I would like to find a quick way to scan for file duplicates using an md5sum and replace duplicates found with symlinks, thereby maintaining the relative semantics without duplicating actual file content. Perhaps rsync's hashing algorithm is more efficient and adequate?

I'm sure this situation has come up for many people before. I just got a new file server with enough capacity to put everything in one place. So I'd like to merge directories from several machines into the new server but I have lots of duplication across them.

I was wondering if anyone had a quick strategy or even know of a package that does the job. I'm think that using the ouput of fdupes would be one approach but my scripting is basic. Is this a good excuse to learn some python?

Somehow I think a filesystem plugin for reiser4 might be a nice way to achieve this. I imagine it would keep a database of hashes so if you ever tried saving the same file somewhere else in the system it would only create a link rather than saving it again. This would be useful in a system with many users which may have their own copies but you want to minimise storage. I would imagine Google might use something like this.

Thanks in advance for any help.

barrymac · Tux's lil' helper Joined: 15 Jul 2004 Posts: 87

I was looking in the wrong places.

The tool to use is called fslint, a very handy program for removing as they call it 'lint' from a file system and it will replace file duplicates with hardlinks.

I think that'll solve my problem!

http://www.pixelbeat.org/fslint/