Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Synchronization software is not smart enough (unison)!
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Off the Wall
View previous topic :: View next topic  
Author Message
kyron
Apprentice
Apprentice


Joined: 26 Aug 2002
Posts: 198
Location: Montreal, Qc.

PostPosted: Tue Sep 24, 2002 4:15 am    Post subject: Synchronization software is not smart enough (unison)! Reply with quote

Here is the context. I synchronize two servers (one in Montreal and one in Vancouver) using Unison which is a great tool. Now the problem is that, today, a shitload of files were MOVED around on one of the servers. In theory, unison is supposed to be able to easily detect these "moves" and simply execute the same move on the remote server ...well, it ain't happening. Unison would rather delete and re-transfer the files (which is a few Gigs!!).

Si my attempted approach is to use a script with md5sum to find the matching files on each server with their new wherabouts and then use the md5sum and some grep and some sed to move the files on the remote server to their new location...

The idea doesn't seem to bat to me except that:

1- how do I automatically delete empty directories
2- is md5sum guaranteed to give UNIQUE checksums for all files (dunnow about that)
[GRUMBLE!....well, apparantly, md5sum does not necessarely generate a unique file sum... ]
3- The move command doesn't work well with spaces and special characters...how the hell do I convert a "non-qupted" path to a backslash-quoted path??!

Now, of course, if there is a better way to do this, don't hesitate to tell me!!!!
_________________
M$ Windows: When in doubt, REBOOT
Linux: When in doubt, RTFM ;-)
Back to top
View user's profile Send private message
rac
Bodhisattva
Bodhisattva


Joined: 30 May 2002
Posts: 6553
Location: Japanifornia

PostPosted: Tue Sep 24, 2002 6:27 am    Post subject: Re: Synchronization software is not smart enough (unison)! Reply with quote

kyron wrote:
1- how do I automatically delete empty directories

Code:
$ find /somewhere/ -type d -empty | xargs rmdir

Quote:
2- is md5sum guaranteed to give UNIQUE checksums for all files (dunnow about that)

From an information theory point of view, a message digest that is guaranteed to be unique for any file would have to be equivalent to a lossless compression of that file. In other words, the best you could do with current tools would be to bzip the file and use that as the digest, which (given your file sizes) is totally impractical.

I would recommend 160-bit SHA1 message digests over 128-bit MD5, but either should be acceptable for normal use. The likelihood of an edit to a file causing it to generate the same SHA1 or MD5 sum as before is small enough to ignore, IMO. When you're using a message digest function as a key in a database or something is when you have to worry about collisions, when the domain of potential documents is large. When you have a filename to make partial identification, the chance of collision is much less.

Quote:
3- The move command doesn't work well with spaces and special characters...how the hell do I convert a "non-qupted" path to a backslash-quoted path??!

This might just be a shell quoting problem. Can you enclose the file names in quotes before passing it to the shell? Or, if you're already in Perl, for example, you can pass the arguments to the system() or exec() functions, and it will call execve for you, bypassing the shell entirely.
_________________
For every higher wall, there is a taller ladder
Back to top
View user's profile Send private message
kyron
Apprentice
Apprentice


Joined: 26 Aug 2002
Posts: 198
Location: Montreal, Qc.

PostPosted: Tue Sep 24, 2002 11:16 am    Post subject: Re: Synchronization software is not smart enough (unison)! Reply with quote

rac wrote:
kyron wrote:
1- how do I automatically delete empty directories

Code:
$ find /somewhere/ -type d -empty | xargs rmdir



Cool, thanks.... was a bit too tired to rtfm yesterday....

rac wrote:
From an information theory point of view, a message digest that is guaranteed to be unique for any file would have to be equivalent to a lossless compression of that file. In other words, the best you could do with current tools would be to bzip the file and use that as the digest, which (given your file sizes) is totally impractical.

I would recommend 160-bit SHA1 message digests over 128-bit MD5, but either should be acceptable for normal use. The likelihood of an edit to a file causing it to generate the same SHA1 or MD5 sum as before is small enough to ignore, IMO. When you're using a message digest function as a key in a database or something is when you have to worry about collisions, when the domain of potential documents is large. When you have a filename to make partial identification, the chance of collision is much less.


I actually hit a few identical cases of md5 sums that are not unique across files. But as you stated, the file names combined with the md5 sum is a damned good shot a identifying a unique file.

rac wrote:
This might just be a shell quoting problem. Can you enclose the file names in quotes before passing it to the shell? Or, if you're already in Perl, for example, you can pass the arguments to the system() or exec() functions, and it will call execve for you, bypassing the shell entirely.


Yeah...well..if I had been in Perl it would have been somthing like \Q [studd] \E...but I'm not doing this in Perl (using a combination of grep, cut and sed for the moment)

Thanks for the pointers!...will try to fiddle a script that uses filenames AND md5sums to identify unique files and then strip the file name, add the quotes to non-alpha characters and perform the move on the file :)
_________________
M$ Windows: When in doubt, REBOOT
Linux: When in doubt, RTFM ;-)
Back to top
View user's profile Send private message
rac
Bodhisattva
Bodhisattva


Joined: 30 May 2002
Posts: 6553
Location: Japanifornia

PostPosted: Tue Sep 24, 2002 11:28 am    Post subject: Re: Synchronization software is not smart enough (unison)! Reply with quote

kyron wrote:
add the quotes to non-alpha characters

I was just suggesting quoting the entire filename always. Seems easier. Would that approach cause problems?
_________________
For every higher wall, there is a taller ladder
Back to top
View user's profile Send private message
klieber
Administrator
Administrator


Joined: 17 Apr 2002
Posts: 3657
Location: San Francisco, CA

PostPosted: Tue Sep 24, 2002 1:18 pm    Post subject: Re: Synchronization software is not smart enough (unison)! Reply with quote

kyron wrote:
I actually hit a few identical cases of md5 sums that are not unique across files.

You may want to seriously consider how you're calculating md5 hashes, then, because the odds of hitting a dupe are 2^64, which is one hell of a big number. So big, in fact, that I would say that the odds of some glitch in your script/system/whatever are astronomically higher than the odds of you actually having a "few" identical md5 hashes on your system.

Not trying to start an argument -- just pointing out that you may have a problem that you're not aware of.

[EDIT]To put things further in perspective, if you had 18,446,744,073,709,551,616 files on your system, then you'd probably have to start worrying about two of those files having duplicate md5 hashes[/EDIT]

--kurt
_________________
The problem with political jokes is that they get elected
Back to top
View user's profile Send private message
kyron
Apprentice
Apprentice


Joined: 26 Aug 2002
Posts: 198
Location: Montreal, Qc.

PostPosted: Tue Sep 24, 2002 3:46 pm    Post subject: Re: Synchronization software is not smart enough (unison)! Reply with quote

rac wrote:
I was just suggesting quoting the entire filename always. Seems easier. Would that approach cause problems?


No worky... "" won't work...and even to make this more fun, I have many filenames that contain ' !

kliber: Yeah, I agree with you...but for some odd reason I REALLY did encounter identical md5 sums... I didn't have time to but I will do a diff on the files to see if they are indeed identical. I would expect the md5 to be identical on empty files...unless the filename is part of the sum calculation....
_________________
M$ Windows: When in doubt, REBOOT
Linux: When in doubt, RTFM ;-)
Back to top
View user's profile Send private message
ebichu
Apprentice
Apprentice


Joined: 03 Jul 2002
Posts: 231
Location: Manchester, England

PostPosted: Tue Sep 24, 2002 5:51 pm    Post subject: Re: Synchronization software is not smart enough (unison)! Reply with quote

klieber wrote:
[EDIT]To put things further in perspective, if you had 18,446,744,073,709,551,616 files on your system, then you'd probably have to start worrying about two of those files having duplicate md5 hashes[/EDIT]

The number of files on your system would be a lot lower than your 2^64, but still quite large, depending on what probability of checksums clashing you are willing to accept before you start worrying about it! Let's make a big assumption that the files contain random data so that there's less to worry about, and that the resulting MD5 checksums for all possible inputs are uniformly distributed. The probability p(k) of at least two out of k files generating the same checksum would be given by
Code:
p(k) = 1 - ((2^18)! / (((2^18)-k)! * 2^18^k))

<EDIT 2 (after reading kleiber's post below)>
I don't know how those 2^18's got in there. The above should have been
Code:
p(k) = 1 - ((2^64)! / (((2^64)-k)! * 2^64^k))

Also, I was confused enough to be thinking that the number 2^64 was due to the length of the MD5 checksum being 64 bits, when it is in fact 128 bits, so the above formula should actually have been
Code:
p(k) = 1 - ((2^128)! / (((2^128)-k)! * 2^128^k))

The probability of 2^64 random files' MD5 checksums clashing would be
Code:
p(2^64) = 1 - ((2^128)! / (((2^128)-(2^64))! * 2^128^2^64))

Does anyone have a pocket calculator handy?
</EDIT>

Conversely, for a maximum acceptable probability r you could find the maximum number of files k before this probability is exceeded by the above formula, by trying different values of k in the above formula using a binary search for the correct value.

<EDIT 1>
The big assumption here of course is that the files under consideration are random. MD5 is pretty good at differentiating similar files, but no good at differentiating identical files. So if there's a high probability of there being identical files on your system, that's going to increase the probability of identical MD5 checksums by a similar amount. All zero-length files will of course be identical as kyron mentioned, and will have identical MD5 checksums - and in fact this MD5 checksum will be d41d8cd98f00b204e9800998ecf8427e.
</EDIT>

The classic example of this problem is that of finding how large a group of random people you need to get at least a 50% chance of at least two people in that group sharing a birthday (assuming 365 possible birthdays), which turns out to be 23.
_________________
Ebichu wa chiizu ga daisuki dechu!


Last edited by ebichu on Thu Nov 21, 2002 6:33 pm; edited 2 times in total
Back to top
View user's profile Send private message
kyron
Apprentice
Apprentice


Joined: 26 Aug 2002
Posts: 198
Location: Montreal, Qc.

PostPosted: Tue Sep 24, 2002 6:30 pm    Post subject: Reply with quote

Well...here is what I have up to now...
Code:
#! /bin/ksh
for INFILE in `cat $1`; do
        echo $INFILE
        MD5SUM=`echo $INFILE|cut -c-32`
        FILENAME=`echo $INFILE|sed -e "s/^.*\///g"`
        for REFFILE in `grep $MD5SUM $2`; do
                FILENAME2=`echo $INFILE|sed -e "s/^.*\///g"`
                if [ $FILENAME = $FILENAME2 ]; then
                        if [ $INFILE = $REFFILE ]; then
                                echo "mv $INFILE $REFFILE" > moveit
                        fi
                fi
        done
done


You simply call
Code:
callfindmatch.sh SumsServer1 SumsServer2

and it spits out moveit which I would execute as a second script (after approving the moves....)

Now my problem is that the "for" command interprets spaces in a line and assumes that the next stream of character is a new case...for example:
Code:
c5adf5f7e45512c6a96aaff10a4ab6b8  ./KB/KTH/Door & Frame Schedule - sample.xls

would come out as :
Code:
c5adf5f7e45512c6a96aaff10a4ab6b8 
./KB/KTH/Door
&
Frame
Schedule
-
sample.xls

from the for assignment.... in other words, for is not going thrugh this line by line as I though it would.... anyone got any idea???...

Note that I still haven't resolved the backslash quoting issue....
_________________
M$ Windows: When in doubt, REBOOT
Linux: When in doubt, RTFM ;-)
Back to top
View user's profile Send private message
klieber
Administrator
Administrator


Joined: 17 Apr 2002
Posts: 3657
Location: San Francisco, CA

PostPosted: Tue Sep 24, 2002 7:38 pm    Post subject: Re: Synchronization software is not smart enough (unison)! Reply with quote

ebichu wrote:
The number of files on your system would be a lot lower than your 2^64,

Well, Rivest seems to think 2^64 is the right number, and I'm inclined to go with his opinion. :)


ebichu wrote:
MD5 is pretty good at differentiating similar files, but no good at differentiating identical files.

How would you differentiate identical files? By definition, that's an oxymoron.

The whole purpose of md5 is to ensure the integrity of the data being sent. If the data are altered in any way, an md5 hash is designed to detect that. If two files are identical, it simply means that the data within those files are identical, and file integrity has been maintained.

ebichu wrote:
So if there's a high probability of there being identical files on your system, that's going to increase the probability of identical MD5 checksums by a similar amount.

OK, what I should have said originally was the chances of two different files generating the same md5 hash are 2^64. My mistake.

--kurt
_________________
The problem with political jokes is that they get elected
Back to top
View user's profile Send private message
mikaelu
n00b
n00b


Joined: 04 May 2002
Posts: 10
Location: Sweden/Hjo

PostPosted: Tue Sep 24, 2002 9:06 pm    Post subject: Spaces in filenames Reply with quote

To solve your space problem you have to change the $IFS like this:

IFS=$'\n'

the for loop will then use newlines as separator.
Back to top
View user's profile Send private message
kyron
Apprentice
Apprentice


Joined: 26 Aug 2002
Posts: 198
Location: Montreal, Qc.

PostPosted: Tue Sep 24, 2002 9:29 pm    Post subject: Re: Spaces in filenames Reply with quote

mikaelu wrote:
To solve your space problem you have to change the $IFS like this:

IFS=$'\n'

the for loop will then use newlines as separator.


DOOOOOOOOOH! Now that I re-wrote and re-learned Perl just to get that damned thing doen with... I had it almost complete with bash scripts... easyyer than perl to my astonishment!...then again, only found the backslash quoting issu fixed in perl (simply "\Q$VariableNameForPath\E" and voilà!!
_________________
M$ Windows: When in doubt, REBOOT
Linux: When in doubt, RTFM ;-)
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Off the Wall All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum