Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Deltup will update packages and reduce download time
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3, 4, 5  Next  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Wed May 21, 2003 10:36 pm    Post subject: Re: Different stuff Reply with quote

Quote:
If you're concatting the headers together into one file, and the data into a seperate, the headers file would be structured as such- creating a more efficient delta for it shouldn't be too hard, I've basically implemented a 2 byte indicator of modified fields, then null delimited strings/nums of the new entries. From their my attempt gets a bit different, but you get the idea for doing tar header updates.

That sounds like a good experiment if you have the infrastructure to find matching tar headers. But don't you need an extra integer to find the old header when you rebuild the new one?
Quote:
One could probably try and do a copy/insert setup on the header changes, but that seems like overkill for the most part- the data changes are too small.

I don't know exactly what you mean here.
Quote:
As for attempting to align related fields so they're adjacent (first section is all names, second is linkname... etc), I'd wonder what you'd hope to accomplish via that. Compression might be better, but I'd be curious how xdelta would be more behaved- it only includes changes, regardless of where they are.

It would allow several fields that are identical in the old and new tar headers to be matched in a single larger block, also leaving the differing fields all together in a block to be compressed.
example:
    ABCDE -> ABCDE
    this -> that
    FGHIJK -> FGHIJK
    oh -> my
    LM -> LM
    data -> differs

could be arranged like this:
    ABCDEFGHIJKLM -> ABCDEFGHIJKLM
    thisohdata -> thatmydiffers
Now there's only one matching block to include in the delta control struct.
Quote:
Assuming I'm reading it correctly, from the sounds of it, you're proposing the ability to identify w/in a stream an archive/compressed section? You could attempt to identify it via the appropriate magic/id, but that would require the ability to identify how long the data is, which would be a pain unless you were reading the length from the tarball/archive header. Unless you're planning on attempting to figure out the specific compressed/archive file's length automatically, the alg would have to be aware of the tar headers...

Yes, I think it should be aware of the tar headers. After all, I'm going to be extracting them anyway so it's a cinch to check the file length.
Quote:
Assuming you made the alg tar-header aware, you're basically half way to diffing per file anyways

Maybe you're right that diffing per file would be a cleaner solution - it seems to give us more control. It would be more practical if I write a new delta algorithm with less overhead - I'm pretty sure I could do it. I read that article you suggested about the rsync algorithm and I was surprised that not only was it pretty simple to implement, but it's very much like what I had in mind already! Even the rolling checksum part. Maybe I'm mistaken, but I think I know a better way to calculate the checksum. I bet that the xdelta algorithm doesn't take advantage of the fact that the source files are present for matching variable-sized blocks either.
Quote:
In terms of an individual dataset thing being more work, I'd disagree- it's the equivalent of doing multiple diff's rather then one overarching one. As long as the delta compression code is setup right/cleanly, it's really not that much different then doing the whole file.
There still is the issue of data that jumps from file to file, but I'd think there is an efficient solution for that (possibly a control kludge).

That's the issue I'm worried about. Maybe a second global diff could be done on all the data that is left unmatched. An alternative would be to just ignore the problem.
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Thu May 22, 2003 3:06 am    Post subject: Re: Different stuff Reply with quote

jjw wrote:
That sounds like a good experiment if you have the infrastructure to find matching tar headers. But don't you need an extra integer to find the old header when you rebuild the new one?

At the moment, I have what I'd define as a working setup of that. There are still issues (file's being moved) which will require some form of id'ing to be done in the removal/additions for optimizations sake, but it's coming along.
As for extra int requirement, yes, there is (currently) data stored as to which entry to match the patch against for the delta- I do have a few ideas how to curtail those requirements though- basically a combination of forward patching and specifying offsets from current tar entry number rather then the absolute entry number. I haven't yet nailed that down, although I consider it doable...
jjw wrote:
ferringb wrote:
One could probably try and do a copy/insert setup on the header changes, but that seems like overkill for the most part- the data changes are too small.
I don't know exactly what you mean here.

Pardon, I was referring to greedy matching- diff for instance (at least unified) generates a delta of the insert/delete sort, rsync and xdelta appear to be of the copy/insert persuasion. Below I mention about some work by Randal Burns, it's explicity explained w/in his papers (it basically is a way of classifying the control syntax).
jjw wrote:
... snip ... could be arranged like this:
ABCDEFGHIJKLM -> ABCDEFGHIJKLM
thisohdata -> thatmydiffers
Now there's only one matching block to include in the delta control struct.

First off, just in case I sound way the heck out in left field, not sure what you're intending there. I'm guessing that you mean a way of basically identifiying identical fields (devmajor, devminor come to mind) that serve as a standard template/update for the tar headers? If that's what you're after, to be honest I don't really see how groupping it as such has any benefits in space efficiency.
In dealing w/ version wide changes, or instances were it's cheaper to store the general settings in one place, and add a field where it differs I handled it by having basically a template definition- basically the template is merged/applied to the original source header, then the further modifications are applied. Thus far, aside from the cost in identifying common/identical values throughout the tarball I've found it to be a simple and easy way to handle global changes. Guessing that is what you were getting at?
jjw wrote:
Yes, I think it should be aware of the tar headers. After all, I'm going to be extracting them anyway so it's a cinch to check the file length.

'k. I was wondering if you were doing that, mainly since (personally) I couldn't see any way of reconstructing the tarball w/out knowing the length after applying the xdelta to the headers and data.
jjw wrote:
Maybe you're right that diffing per file would be a cleaner solution - it seems to give us more control.

Well, aside from control, there are space considerations- figure it thus, you have a 2 mb file, and a block (assuming you're doing checkpoints) is transposed from one end of the file to the other (front to tail)- the maximum relative addressing needed/allowed for the control syntax must only cover 2mb- now if that was but a file in a tarball (say size 20mb), the addressing size needed/allowed would have to cover 20mb, rather then 2mb. Yes, the address for transposing the block would be only a certain number of bits, but for reconstructing you need to be able to know how many bit's you're using for addressing. Not the greatest example (I know a few ways to minimize the addressing size already), but you get the jist.
jjw wrote:
It would be more practical if I write a new delta algorithm with less overhead - I'm pretty sure I could do it. I read that article you suggested about the rsync algorithm and I was surprised that not only was it pretty simple to implement, but it's very much like what I had in mind already!

Prior to attempting this, I would read into the works of Randal Burns- specifically check out his master's thesis "Differential Compression: A Generalized Solution For Binary Files." Title just rolls off the tongue, doesn't it? The downside to the rsync alg is that it uses checkpoints (rather then doing true byte by byte modification/transposing, it tries to work w/ blocks)- this is much easier on memory, although the d. compression isn't neccessarily guranteed to be optimal (pretty close though).
To be frank, I'd be quite curious how you're planning on improving on the existant delta algs for generalized data. Structured data, yes, there is room for improvement (by the differ and the applyer knowing the structure, the control overhead can be simplified/optimized).
jjw wrote:
Even the rolling checksum part. Maybe I'm mistaken, but I think I know a better way to calculate the checksum. I bet that the xdelta algorithm doesn't take advantage of the fact that the source files are present for matching variable-sized blocks either.

Last I looked in xdelta's source, I believe it was using a variant of the adler32 chksum for the rolling chksum- pretty efficient, and (fairly) unlikely on collisions. I don't recall what the strong chksum was, possibly md4 (I think that might've been rsync though...)
jjw wrote:
[data transposed between files is] the issue I'm worried about. Maybe a second global diff could be done on all the data that is left unmatched. An alternative would be to just ignore the problem.

That'd seem inneficient, although I lack any data to prove it (heh, gotta love conjecture). Offhand, this is what I've been thinking about for it- I haven't yet done any considerations about space needed for this, so who knows how crazy it is. Either way, onto it- A) see what bzip2 can do with it- assuming it's a largely unmodified block of data, bzip2'ing/gziping the sucker ought to make short work of it. Or, it's a data block that is say 95% similar, but does have modifications here and there- I'd thought of attempting to pull the block out, and outside of the file specify basically a pseudo substitution section- eg, blockx belong here and here (based off of entrynum and intra-file-location), and possibly if data exists there or there is some particular control syntax there, merge the two.
First off, like I said, I have yet to even consider the efficiency of such a setup- mostly I've been running it through the skull figuring how easiest/efficient to do it. Second, the extra overhead it induces in the control language, assuming one could keep it outside of the file control syntax, would be likely quite hefty.
It's a bit out there, and probably moot considering bzip2 can likely handle it quite readily.
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Thu May 22, 2003 5:20 am    Post subject: new delta format Reply with quote

Quote:
First off, just in case I sound way the heck out in left field, not sure what you're intending there. I'm guessing that you mean a way of basically identifiying identical fields (devmajor, devminor come to mind) that serve as a standard template/update for the tar headers? If that's what you're after, to be honest I don't really see how groupping it as such has any benefits in space efficiency.
In dealing w/ version wide changes, or instances were it's cheaper to store the general settings in one place, and add a field where it differs I handled it by having basically a template definition- basically the template is merged/applied to the original source header, then the further modifications are applied. Thus far, aside from the cost in identifying common/identical values throughout the tarball I've found it to be a simple and easy way to handle global changes. Guessing that is what you were getting at?

No. What I'm suggesting is to move the fields in the tar header around so that the ones that are unlikely to change are adjacent. When the delta is performed the matching blocks will be much larger, and the control structure can be smaller. This certainly would not be useful if you can eliminate the header entirely by using special structured diff techniques as you have been describing.
Quote:
Well, aside from control, there are space considerations- figure it thus, you have a 2 mb file, and a block (assuming you're doing checkpoints) is transposed from one end of the file to the other (front to tail)- the maximum relative addressing needed/allowed for the control syntax must only cover 2mb- now if that was but a file in a tarball (say size 20mb), the addressing size needed/allowed would have to cover 20mb, rather then 2mb. Yes, the address for transposing the block would be only a certain number of bits, but for reconstructing you need to be able to know how many bit's you're using for addressing. Not the greatest example (I know a few ways to minimize the addressing size already), but you get the jist.

Yes, the bits used for addressing could be cut down that way, but an even more effective way would be to use relative addressing and keep the control structure separate from the data. Then bzip2 will wipe up the control codes like a hungry garbage disposal. But that will have to wait until I write the new delta alg.
Quote:
To be frank, I'd be quite curious how you're planning on improving on the existant delta algs for generalized data. Structured data, yes, there is room for improvement (by the differ and the applyer knowing the structure, the control overhead can be simplified/optimized).

#1, after finding a block that matches the algorithm can do a running comparison backwards and forwards because it will have access to both datasets.
#2, I plan to use a better control structure which will work effectively with compression and have no extra fluff (that includes md5sums).
#3, possibly use a multi-pass algorithm.
#4, add features like two way patches.
#5, possibly better checksum calculations. And remove the strong checksum (compare data instead), resulting in large reductions in the hash table, not to mention 100% accuracy which I don't think xdelta algorithm can guarantee.
Quote:
A) see what bzip2 can do with it- assuming it's a largely unmodified block of data, bzip2'ing/gziping the sucker ought to make short work of it.

Sure it will, but that's just ignoring the problem. In other words that data was compressed before we touched it, and our algorithm isn't doing anything with it.
Quote:
Or, it's a data block that is say 95% similar, but does have modifications here and there- I'd thought of attempting to pull the block out, and outside of the file specify basically a pseudo substitution section- eg, blockx belong here and here (based off of entrynum and intra-file-location), and possibly if data exists there or there is some particular control syntax there, merge the two.

That doesn't solve the problem of how to find the matching blocks...
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Thu May 22, 2003 3:33 pm    Post subject: Re: new delta format Reply with quote

jjw wrote:
No. What I'm suggesting is to move the fields in the tar header around so that the ones that are unlikely to change are adjacent. When the delta is performed the matching blocks will be much larger, and the control structure can be smaller. This certainly would not be useful if you can eliminate the header entirely by using special structured diff techniques as you have been describing.

Viable, one could just run a generic diff alg against it rather then writing something specific to handle it.
jjw wrote:
Yes, the bits used for addressing could be cut down that way, but an even more effective way would be to use relative addressing and keep the control structure separate from the data. Then bzip2 will wipe up the control codes like a hungry garbage disposal. But that will have to wait until I write the new delta alg.

That already was assuming using relative addressing.
I'd wonder what gains you'd hope from seperating the control syntax from the data- first off though, clarifying, when I say data in this instance I mean say the bytes that are to be appended/inserted here or there.
Either way, as I was saying- if you break the control syntax away from the data, then you have to have some way of addressing which data belongs to which control/command in the patch. Seems like overkill, keeping things inline removes such a need. Course I may be missing what you mean...
jjw wrote:
#1, after finding a block that matches the algorithm can do a running comparison backwards and forwards because it will have access to both datasets.

Fairly sure xdelta does this (kind of a common property of delta algs)- if it doesn't, it would likely handle it via either the correction alg.
jjw wrote:
#2, I plan to use a better control structure which will work effectively with compression and have no extra fluff (that includes md5sums).

I can understand the desire to eject fluff fields like 'this was the original files name.' That seemed a bit off to me. Better control structure that is more behaved w/ compression?
jjw wrote:
#3, possibly use a multi-pass algorithm.

What would you hope to gain from it? Assuming you were doing a one half correction pass, about all one could hope for is to go over the generated commands and see if their was some overarching correction/optimization that could be done- as is, with these algs they already are designed to correct their mistakes in a near-optimal way.
jjw wrote:
#4, add features like two way patches.

Downside to reversible patches is that you they will be significantly longer- you can use the offset trick I'd mentioned for replacements, but one is screwed either way on deletes (unless that block shows up elsewhere). Either way, it's a feature that's mostly needed.
jjw wrote:
#5, possibly better checksum calculations. And remove the strong checksum (compare data instead), resulting in large reductions in the hash table

There are a couple of issues of what you're proposing offhand- say you do implement a beter (stronger) rolling checksum alg, the downside is that in doing so each rolling checksum calculation becomes more expensive. The weak chksum is choosen mainly for it's speed, and the fact it doesn't produce that many false positives. From their obviously the strong chksum is used to verify things match (compensating for the failing of the weak chksum's false positives).
If one were to implement a stronger chksum for the rolling calc, A) the prog likely becomes slower- the weak chk is a large portion of the proc time, B) the required memory goes up by a good portion- by definition, creating a better, more accurate rolling chksum would require more bits. That's assuming, that is what you meant.
As for doing away with strong chksum'ing, and doing string comparison two methods I could see-
A) rather then storing the strong chksum, store the string in memory. Again, this hoses memory usage fairly quickly.
B) doing lookup on the specific string from disk- easier on memory, slower (i/o becomes a *major* limiting factor).
I think in attempting to do away with the strong chksum, you're missing it's benefits (ultimately why it's used). The strong chksum is stored for the target file, so you only have to compute strong chksum once per block: computing the chksum's for the target in one pass, then working through source doing the same.
Further, the strong chksum is efficient from a memory standpoint due to the fact that it serves as a way of efficiently pegging blocks of varying size and composition into a specific key (which can then be compared).
Dunno, I could easily see pulling the entire file into memory and just doing a strncmp, but that's dependant on the file size- w/ my setup (individual files) I could probably get away with it for the majority of files, just not something like kde/openoffice without thrashing the heck out swap...

jjw wrote:
not to mention 100% accuracy which I don't think xdelta algorithm can guarantee.

Well, the strong chksum is md4 based (at least for rsync), so I'd think it's basically the five 9's accurate. Highly unlikely that it isn't exact- alternatively, you could use two moderate chksum algs that, verify both match.
Course I haven't done the math on it, so again, fairly sure it's not *guranteed* 100%. Dunno, I don't think via a chksum method one can gurantee 100% for any scenario where their are more bytes being chksummed then the chksum has. Of course, that doesn't mean that they aren't remarkably accurate though...
jjw wrote:
Sure it will, but that's just ignoring the problem. In other words that data was compressed before we touched it, and our algorithm isn't doing anything with it.

eh? Not talking about compressed data at all- specifically was talking of data that is transposed between files. The problem is basically creating a way to say, yank from here, stick it there w/out getting grossly inneficient. Unless you're saying the data is already 'delta compressed'...
jjw wrote:
That doesn't solve the problem of how to find the matching blocks...

I'm not worried about finding matching blocks between removals and additions- that isn't particularly hard- I've been working on a similar rolling chksum setup to identify essentially how alike the two files are. Plus side, I can handle the entire dataset of removals/additions, and basically watch for file/data bounds.
Downside being that files that are quite similar (CVS Entries files for instance) will trip the alg out quite a bit. Dunno, something can be worked out for it though.
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Thu May 22, 2003 5:09 pm    Post subject: Re: new delta format Reply with quote

Quote:
I'd wonder what gains you'd hope from seperating the control syntax from the data- first off though, clarifying, when I say data in this instance I mean say the bytes that are to be appended/inserted here or there.

A very large gain actually... the control codes will be very repetitive, if they are bunched together bzip2 can achieve a very high compression rate.
Quote:
Either way, as I was saying- if you break the control syntax away from the data, then you have to have some way of addressing which data belongs to which control/command in the patch. Seems like overkill, keeping things inline removes such a need. Course I may be missing what you mean...

Surprisingly, it can be done without ANY data addressing. Every time data is needed by the control codes, it grabs more data from the sequence. You don't need random data access.
Quote:
What would you hope to gain from it? Assuming you were doing a one half correction pass, about all one could hope for is to go over the generated commands and see if their was some overarching correction/optimization that could be done- as is, with these algs they already are designed to correct their mistakes in a near-optimal way.

On the first pass the alg could use large checksum block sizes, the remaining passes could parse through the leftover data with increasingly smaller block sizes. That would produce better results with less memory used. It would take longer in most cases though...

I do not propose to make a stronger rolling checksum, in fact, I think a slightly weaker one might be more appropriate!
Quote:
B) doing lookup on the specific string from disk- easier on memory, slower (i/o becomes a *major* limiting factor).

This is what I was thinking of. Your point is valid, but any modern OS will cache most of the data that needs to be accessed. In return for this inefficiency we get the following three benefits:
#1 removes the need for the strong checksum
#2 enables variable-sized blocks (on average matches "checksum block size" more data per matching block)
#3 produces 100% accurate results.
Is there another way to accomplish #2?
Quote:
jjw wrote:
#1, after finding a block that matches the algorithm can do a running comparison backwards and forwards because it will have access to both datasets.

Fairly sure xdelta does this (kind of a common property of delta algs)- if it doesn't, it would likely handle it via either the correction alg.

This is the issue here. Does it? If so, then why the need for a strong checksum to verify the block?
Back to top
View user's profile Send private message
neuron
Advocate
Advocate


Joined: 28 May 2002
Posts: 2371

PostPosted: Thu May 22, 2003 7:23 pm    Post subject: Reply with quote

discuss this faster guy, I like the idea ;)

also, why not bittorrent the downloads? It's an idea for big packages anyway. Could take a lot of load off the main mirrors.
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Thu May 22, 2003 9:08 pm    Post subject: Re: new delta format Reply with quote

jjw wrote:
A very large gain actually... the control codes will be very repetitive, if they are bunched together bzip2 can achieve a very high compression rate.

I'd find this debatable, contingent upon the developer not writing the control syntax in an optimized manner to begin with. Eg: say you're doing a starting at source 100, length 50, to target 125; it could be written as c100,50,125 (akin to an ed diff), or assuming the program was smart enough to discern maximum address, it could be in binary- using 4 bits for some enumerated symbol for copy , 7 bits to the source start, 7 bytes to the length, then 7 bytes for the target start address.
From there, the developer could either start another command, or more likely for their sanity pad the existing command w/ 3 bits to align it to a byte boundary.
Alternatively, they could break the control syntax down into sections- copies here (so as to have just one symbol needed for the copy command), adds here (same)....
As for compression, yes, repetitive data is lunch for a good alg- but the major bonus eminates from larger blocks of repetive data due to the whole control to data overhead issue.
My personal opinion is that the control syntax would already be optimized to begin with, hence compression would be able to take a chunk out of it, but not what I'd defined as a very large gain: I'd think at best 5%. As always, I could be wrong.
jjw wrote:
Surprisingly, it can be done without ANY data addressing. Every time data is needed by the control codes, it grabs more data from the sequence. You don't need random data access.

It's of course contingent on each control command having a length arguement, but yeah... it had not dawn on me that it doesn't matter if the control/data are inline or seperated.
jjw wrote:
On the first pass the alg could use large checksum block sizes, the remaining passes could parse through the leftover data with increasingly smaller block sizes. That would produce better results with less memory used. It would take longer in most cases though...

I'd say all cases would take longer using such a model. It would be faster then doing both backwards and forward searching due to the fact backward and forward searching goes quadratic in worst case scenario. Also, as you proceed to smaller block sizes the memory usage goes up, even if you're wiping the hash after each run w/ a certain block size. Ultimately it still would be less memory intensive then not using block chksums- one thing I would wonder about is how the heck you'd deal w/ the delta data generated from each run, specifically integrating them into one coherent delta.
I do think that there is something of that vein that could be accomplished, although I'd think the optimization would come from an analzation of the diffed blocks- as in, w/ the larger block size, do some analysis on the block to basically move it to a byte sized quanta. Doing continually smaller block size would have issues with it, mainly less and less gains (if any) w/ additional process being required.

jjw wrote:
I do not propose to make a stronger rolling checksum, in fact, I think a slightly weaker one might be more appropriate!

I'm guessing you'd do a weaker checksum to gain speed in another section of the alg to offset the wrath of extra overhead associated w/ extra checks for false positives (requiring full block comparison)?

jjw wrote:
This is what I was thinking of. Your point is valid, but any modern OS will cache most of the data that needs to be accessed.

Quite aware- a while back I took it upon myself to try and update the old e2compr patches to work w/ kernel v2.4.20- required a lot of screwing w/ the buffering/swap system... One thing that I did learn out of it, is that while the cache system is quite niete, one cannot rely on it to neccessarily have the needed data cached. Basically, I'd think that the cache may save the prog often enough, but when it doesn't, the i/o hit becomes quite expensive. With small files you could probably skate on by without an issue- I'd personally not rely on it though.
Consider the openoffice tarballs- if you were just diffing the two files (not breaking it down into individual entries), the first pass would be building the chksum hash for the target version. The next pass is going through the source version, which pretty quickly starts pushing portions the cached sections of the target version out of memory.
Of course, this is assuming that there is enough memory for the file to be cached in memory to begin with.
W/ my system, this would happen pretty much immediately- heck w/ yours, the overhead from the OS and various daemon's alongside your required memory usage for the app, you'd likely start losing cached portions of the target version before you reached the end of it. Openoffice is an extreme, but not everyone has half a gig of memory- there are the 256 and 128 mb people (w/ pity for those at 64 and below). The cache serves as an attempt to improve performance by holding recent blocks in memory, and just that. It cannot and should not be considered to keep desired data in memory, one mallocs and stores that which they want saved.
jjw wrote:
In return for this inefficiency we get the following three benefits:
#1 removes the need for the strong checksum

Granted, string comparisons are mucho better then md4/5 chksumming in terms of proc cost.
jjw wrote:
#2 enables variable-sized blocks (on average matches "checksum block size" more data per matching block)

You're describing basically a non-checkpoint alg, which is significantly costlier in terms of memory and proc.
jjw wrote:
#3 produces 100% accurate results.

If you could find an instance where two different strings of a set block size are the same, I'd consider this an issue. Only when you start pushing the block size to the extremes could I see this an issue, which you wouldn't be doing anyways due to the fact too large of a block size would result in a grossly innefficient delta.
jjw wrote:
Is there another way to accomplish #2?

Using a correcting algorithm, doing a pass over the output delta commands/data, inspecting each block for actual changes and looking for optimizations.
Quote:
[quote="jjw"]#1, after finding a block that matches the algorithm can do a running comparison backwards and forwards because it will have access to both datasets.

jjw wrote:
This is the issue here. Does [xdelta]? If so, then why the need for a strong checksum to verify the block?

Xdelta being based off of rsync, as far as I know xdelta doesn't do back matching, nor does it particularly need to. That's why it uses chksum matching.
As for strong chksum's reason for existance, simple- the computationaly cheap weak checksum used for rolling *does* produce false positives. The strong chksum is used for identifying if a block is the same as another.

Pardon for typos, running late and boss needs some frigging javascript...
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Thu May 22, 2003 10:38 pm    Post subject: Re: new delta format Reply with quote

Quote:
one thing I would wonder about is how the heck you'd deal w/ the delta data generated from each run, specifically integrating them into one coherent delta.

It's pretty hard to know the best method at this point, but each run could produce data which is added to a table of matches for sorting and further processing. Then the data could be written out in any format desired.

Quote:
I'm guessing you'd do a weaker checksum to gain speed in another section of the alg to offset the wrath of extra overhead associated w/ extra checks for false positives (requiring full block comparison)?

Just a simple checksum algorithm that accounts for every byte in the block would give nearly the same uniqueness. We don't need to be cryptographically secure! The time spent on false positives is minimal.
Quote:
Granted, string comparisons are mucho better then md4/5 chksumming in terms of proc cost.

You can shrink the hash by 128bits per entry too!
Quote:
You're describing basically a non-checkpoint alg, which is significantly costlier in terms of memory and proc.

Actually it's a combination of the two: use checkpoints and verify data when checkpoint matches.
Quote:
jjw wrote:
#3 produces 100% accurate results.

If you could find an instance where two different strings of a set block size are the same, I'd consider this an issue.

I don't consider it an "issue". Only a small benefit.
jjw wrote:
Is there another way to accomplish #2?

Quote:
Using a correcting algorithm, doing a pass over the output delta commands/data, inspecting each block for actual changes and looking for optimizations.

You can't "inspect each block for actual changes" unless you re-access the previously checksummed data. How does this give you variable-sized matching blocks?
Quote:
Xdelta being based off of rsync, as far as I know xdelta doesn't do back matching, nor does it particularly need to. That's why it uses chksum matching.

What i'm trying to point out is that when it matches a block the data in adjacent blocks are not matched if they contain any unmatching data. You can only do that if you reload the checksummed data for matching.
Quote:
As for strong chksum's reason for existance, simple- the computationaly cheap weak checksum used for rolling *does* produce false positives. The strong chksum is used for identifying if a block is the same as another.

Very few false positives... They would be a minimal overhead and would fit perfectly into the string algorithm.

There would be some extra overhead like you said, but the big advantage would be the variable-sized block matches. The reduction of hash table would be a great memory booster since the entire hash must reside in memory. That's probably why your disk was thrashing on the OO delta. Disk IO can be slow, but it can also be very fast if accessed sequentially. Thrashing is the biggest problem and it's caused by random access on massive amounts of data (like a large hash table).
Back to top
View user's profile Send private message
jonner
n00b
n00b


Joined: 25 Jul 2002
Posts: 42

PostPosted: Fri May 23, 2003 1:20 am    Post subject: Why not rsync itself? Reply with quote

There has been much discussion on this thread about efficiency of diff algorithms, something which I know almost nothing about. One thing does seem clear, though: the algorithm from rsync is being used. Perhaps I'm missing something, but why not just use rsync itself?

If the gentoo mirrors were running rsync servers as well as http or ftp servers, updating only differences between distfiles could be completely automated with no additional burden on users or developers, though there would be more burden on server CPU's. Portage could do something as simple as make a copy of an old distfile with the desired version's name. Then, it would just run rsync on it.

For example, if foo-0.9.tar.gz exists locally, and Portage needs foo-1.0.tar.gz, it copies foo-0.9.tar.gz to foo-1.0.tar.gz, then rsyncs foo-1.0.tar.gz from the server. The only thing that might be a little tricky would be finding the best candidate old version of the distfile to copy.
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Fri May 23, 2003 2:13 am    Post subject: Re: Why not rsync itself? Reply with quote

jonner wrote:
There has been much discussion on this thread about efficiency of diff algorithms, something which I know almost nothing about. One thing does seem clear, though: the algorithm from rsync is being used. Perhaps I'm missing something, but why not just use rsync itself?

If the gentoo mirrors were running rsync servers as well as http or ftp servers, updating only differences between distfiles could be completely automated with no additional burden on users or developers, though there would be more burden on server CPU's. Portage could do something as simple as make a copy of an old distfile with the desired version's name. Then, it would just run rsync on it.

There are several reasons we can't do that:
#1 The packages are compressed - they must be decompressed before rsyncing.
#2 There are various technical issues with recompressing the package.
#3 Diffs are much more efficient because the differences between the files are already known before any data has to pass over the network.
#4 With rsync the difference would have to be recalculated for each request.
#5 It would require mirrors to have rsync software

This project works now and it won't take a lot of work to integrate with Portage when the Gentoo developers are ready. If you'd like to see the feasibility of the current implementation, please see my post for upgrading kde & gcc: https://forums.gentoo.org/viewtopic.php?t=55210
Back to top
View user's profile Send private message
jonner
n00b
n00b


Joined: 25 Jul 2002
Posts: 42

PostPosted: Fri May 23, 2003 2:45 am    Post subject: Reply with quote

First, I appreciate the effort greatly, as this is a obvious need; I've already thought about the problem a few times. You've obviously already had some success, so I can't dismiss that. I'm just wondering why such a complex solution is necessary.

As for #1 and #3, why? Are you sure? Have you tried rsync? #4 is the reason there would be more CPU usage on the servers. Why is #5 unreasonable? Portage already uses rsync all the time.

It seems that generating persistent diffs would either require additional effort from developers for each ebuild submission, use significantly more space on the server, or only work between two successive versions of distfiles. Am I wrong on all of these points? If so, I'll shut up.

I guess the question is whether the potential inconveniences of your system are less or more than those of using rsync. I'm just trying to invoke the KISS (Keep it Simple Stupid) principle.
Back to top
View user's profile Send private message
BradB
Apprentice
Apprentice


Joined: 18 Jun 2002
Posts: 190
Location: Christchurch NZ

PostPosted: Fri May 23, 2003 2:54 am    Post subject: Reply with quote

Quote:
It seems that generating persistent diffs would either require additional effort from developers for each ebuild submission, use significantly more space on the server

Hmm, maybe - but the diffs for kde are 2.8Mb, and GCC is 1.8 (IIRC)
You would actually save space if instead of storing v1,v2,v3 of a whole package you stored v1 & the diffs to get you to v2 & v3. So a small percentage of extra space to carry doesn't really hurt - diskspace is cheap.
Also, there isn't really too much effort required from the developer (as I see it) because they just submit the ebuild, you could easily autocreate the diffs between two packages with a script.
I think this idea has huge potential to save download time and bandwidth. I would personally push the bandwidth saving point to the Gentoo devs, because somebody pays for that server bandwidth. Image if you only needed to download 3Mb of package instead of 30Mb when going from one release to another. Instant 10x saving in server bandwidth - that is a lot of Mb when you have 1000 people grabbing KDE from a server.

Brad
Back to top
View user's profile Send private message
jonner
n00b
n00b


Joined: 25 Jul 2002
Posts: 42

PostPosted: Fri May 23, 2003 3:07 am    Post subject: The question is ease of use Reply with quote

I guess my main concern is that a system to reduce network usage needs to be easy to apply universally. It probably won't be very successful if developers or users need to do something special for each ebuild. Currently, this system does require user and developer intervention, since it hasn't been integrated into Portage yet. However, if that can be done in such a way that it just works for all ebuilds, then more power to it. Also, it would be nice if servers could get away with only storing a few entire tarballs and many diffs, but that's probably more trouble than it's worth.
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Fri May 23, 2003 4:44 am    Post subject: Reply with quote

Quote:
First, I appreciate the effort greatly, as this is a obvious need; I've already thought about the problem a few times. You've obviously already had some success, so I can't dismiss that. I'm just wondering why such a complex solution is necessary.

It's NOT a complex solution! This thread makes it look complicated because we are discussing ways to squeeze every last byte out of the patches! In any case, it'll all be contained and abstracted in a easy to use program - just as it is now.
Quote:
As for #1 and #3, why? Are you sure? Have you tried rsync?

Very sure... I haven't used rsync recently, but logic tells you the answers.
Quote:
Why is #5 unreasonable?

Not unreasonable, but something a lot of heavy-duty mirrors won't offer.
Quote:
It seems that generating persistent diffs would either require additional effort from developers for each ebuild submission

It probably will end up being an automatic process.
Quote:
use significantly more space on the server

Usually only one patch per package. Might take 5-10% more space.
If you built an rsync system, you'd have to leave all the packages uncompressed on the server. That would take several times the space the packages are taking now.
Quote:
or only work between two successive versions of distfiles.

Not if multiple successive patching is implemented (deltup has special optimized mode for this).
Quote:
I guess the question is whether the potential inconveniences of your system are less or more than those of using rsync.

Less. Need I say more?
Quote:
I guess my main concern is that a system to reduce network usage needs to be easy to apply universally.

A system of diffs requires no infrastructure other than the ability to download files. What could be more universal?

Quote:
It probably won't be very successful if developers or users need to do something special for each ebuild.

Users definately won't have to do anything in the end.
Quote:
Currently, this system does require user and developer intervention, since it hasn't been integrated into Portage yet.

Yes, but it works with just a simple program and a sourceforge site. Try to do that with an rsync implementation (any takers?!).
Quote:
Also, it would be nice if servers could get away with only storing a few entire tarballs and many diffs, but that's probably more trouble than it's worth.

On a completely unrelated issue: wouldn't it be nice if Portage could manage the users distfiles repository, replacing old tarballs with backpatches to save space?

There is one advantage to using rsync - it makes patches unnecessary. Disadvantages? It uses more bandwidth, heavy on the processor, and is unfeasible with compressed files.

BradB wrote:
I would personally push the bandwidth saving point to the Gentoo devs, because somebody pays for that server bandwidth.
I love your comments Brad. Lots of people just don't care about dial-up users, but you can always get them on the server bandwidth issue!
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Fri May 23, 2003 4:47 am    Post subject: Reply with quote

jjw wrote:
#1 The packages are compressed - they must be decompressed before rsyncing. ...snip...
#3 Diffs are much more efficient because the differences between the files are already known before any data has to pass over the network.
jonner wrote:
As for #1 and #3, why? Are you sure? Have you tried rsync? #4 is the reason there would be more CPU usage on the servers. Why is #5 unreasonable? Portage already uses rsync all the time.

In order- I don't think it was explicitly said earlier in the forum, but diff'ing works best when it's able to identify the largest block of changes/transposed data possible- basically the larger the block, the more efficient the diff is (less control overhead vs actual data). Compression's intention is to basically cut all possible redundant information out, and store it as efficiently as possible.
There in lies the problem- diff'ing to be optimal needs to match largest possible block, and for compression to be efficient, it essentially removes the largest block (well, it inserts a symbol instead of the block). Basically, if you diff a compressed file, you get a huge diff- this is because
1) for most compressions (gzip/bzip2), to get byte 1024, you must first basically first decompress the prior 1023 bytes- in other words, one minor change results in everything following being different. Think of a crypt alg, it's designed so that one minor change (a lower case rather then upper case char) results in a completely different result. Basically diffing a compressed file suffers from the same effect.
3) Rsync works by A) on your system creating a rolling checksum and every x bytes a checkpoint checksum, B) the server doing the same. Assuming you were rsyncing an uncompressed file (due to reasons listed in 1), for every user who is trying to get a diff for a file, the aforementioned checksums must be computed on the server side. There are progs/ways to basically store that chksum information (presuming an rsync server does some caching of a sort like that), but you get the general gist.
Basically, as you said w/ the KISS principle, which is simpler, a dev uploading a diff, or a server solution that is caching the checksums, and for each user request computing the differing blocks and uploading them to the user? Part of the reason there was talk of a move away from rsync to csvup is due to rsync not scaling as well as people would like.
5) True, portage uses rsync. Purely against gentoo mirrors that are running rsync server software. What about sites that are serving as distfile mirrors? Unless they happen to be a portage/rsync mirror (unlikely), that would require them to run additional software rather then just letting people download files. Also, the servers would suffer from the issues mentioned in 3 and 1.

responding to 3 posts in one...
bradbb wrote:
Also, there isn't really too much effort required from the developer (as I see it) because they just submit the ebuild, you could easily autocreate the diffs between two packages with a script.
I think this idea has huge potential to save download time and bandwidth. I would personally push the bandwidth saving point to the Gentoo devs, because somebody pays for that server bandwidth. Image if you only needed to download 3Mb of package instead of 30Mb when going from one release to another. Instant 10x saving in server bandwidth - that is a lot of Mb when you have 1000 people grabbing KDE from a server.

Personally my views are that the devs are going to be the ones to get stuck w/ the extra work. At least for the near future, assuming gentoo were to run w/ the distfile diff setup, they'd basically be the only ones using such a setup... so, unless the system was setup to handle unified diffs (seems to be the choice of diff formats), the dev's would have to be creating their own (nobody else would have em). As for using unified diffs that has issues, besides not being the most efficient delta- mainly md5sum correctness. None the less, creation of the diff is pretty much an automated process, just will require the dev to do it.

jonner wrote:
Currently, this system does require user and developer intervention, since it hasn't been integrated into Portage yet. However, if that can be done in such a way that it just works for all ebuilds, then more power to it.

Yep, user intervention is required. Plus someone w/ high bandwidth to be a nice soul and gen the diff, and serve it somewhere.
To utter the 'famous last words', integration into portage isn't that hard. Whats hard is getting the correct infrastructure setup, and getting the devs to actually create the diff's. Oh yeah, that and having a good diff'ing alg. XDelta has issues, diff plain just doesn't cut it... heh heh.
jonner wrote:
Also, it would be nice if servers could get away with only storing a few entire tarballs and many diffs, but that's probably more trouble than it's worth.

Dunno, maybe those who want to contribute bandwidth but don't want to be transferring 20+ mb files could just serve as a diff server. Who knows. As for the main servers/mirrors, it would be best to keep all files (both full and diffs) on it due to the fact that a user may not have any full source on their system (obviously a req if you're attempting to use diff's for patching).
I kind of like the diff server idea actually, although I can't see it getting to far- it is easier to have one class of mirrors rather then maintaining and updating two.
Back to top
View user's profile Send private message
jonner
n00b
n00b


Joined: 25 Jul 2002
Posts: 42

PostPosted: Fri May 23, 2003 5:25 am    Post subject: Good work Reply with quote

Well, it sounds like there has already been plenty of thought put into this, so I'll shut up now. I will try out deltup. Also, I'll try rsyncing between different versions of tarballs to see for myself how effective it is.

The idea of diff servers is quite interesting, though having distfile servers with different capabilities will add complexity (maybe not much).
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Fri May 23, 2003 6:18 am    Post subject: Re: Good work Reply with quote

jonner wrote:
Well, it sounds like there has already been plenty of thought put into this, so I'll shut up now. I will try out deltup. Also, I'll try rsyncing between different versions of tarballs to see for myself how effective it is.

Input is always welcome... if rsync had a way to dump the delta it generates (and subsequently uploads), that could be used.
I think there was such a program, but I'm fairly sure it's dissapeared into the sands of time.

jonner wrote:
The idea of diff servers is quite interesting, though having distfile servers with different capabilities will add complexity (maybe not much).

It depends on if the files are pushed or pulled- if the central server is pushing the files, then I'd think having a diff only server would be an uphill battle getting those in control to do it.
I think it's most likely a pull, which would allow for a mirror admin to say they were only doing a diff mirror.
Don't know why, I realize it adds complexity to the setup, but I could see people be willing to be a diff mirror rather then a full-fledged mirror.
Back to top
View user's profile Send private message
BradB
Apprentice
Apprentice


Joined: 18 Jun 2002
Posts: 190
Location: Christchurch NZ

PostPosted: Tue Jun 17, 2003 3:52 am    Post subject: Reply with quote

Just bumping the thread to see where deltup is at - haven't heard much for a bit.

How's it going guys?

Brad
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Tue Jun 17, 2003 5:34 am    Post subject: Reply with quote

BradB wrote:
Just bumping the thread to see where deltup is at - haven't heard much for a bit.

How's it going guys?
Brad

Actually, I started another thread on a seperate issue (but still related), and we're running amok over at https://forums.gentoo.org/viewtopic.php?p=370030#370030. If I recall correctly, I believe jjw has deltup upto v3.5/3.6, main changes being addition of a couple of batch scripts for fetching/patching.
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Tue Jun 17, 2003 5:48 am    Post subject: Automatic updates Reply with quote

BradB wrote:
Just bumping the thread to see where deltup is at - haven't heard much for a bit.

How's it going guys?

Brad


I'm really glad you posted. A lot of great things have been happening with deltup! On Monday I got a temporary mirror from sunsite.dk for distributing patches. This means that you can finally automate deltup by adding a simple line to your make.conf file. The current version is 0.3.6.1 and you can install it by following these steps:
Code:
wget http://osdn.dl.sourceforge.net/sourceforge/deltup/ebuild-0.3.6.1.tar
tar -xvf ebuild-0.3.6.1.tar -C /usr/portage
rm ebuild-0.3.6.1.tar
emerge deltup

Then you need to edit the /etc/make.conf and add:
Code:
FETCHCOMMAND='efetch ${URI} ${DISTDIR} ftp://sunsite.dk/projects/deltup/patchfiles'

That's all there is to it! Emerge will use patches whenever they are available. I'm still working for full Gentoo acceptance so we can have local mirrors and put some necessary information into the portage tree.
I hope it works for you. Make sure you let me know if anything goes wrong.
God bless,
---JJW
Back to top
View user's profile Send private message
neuron
Advocate
Advocate


Joined: 28 May 2002
Posts: 2371

PostPosted: Tue Jun 17, 2003 12:18 pm    Post subject: Reply with quote

sweeet!, defenatly gonna do that :)
Back to top
View user's profile Send private message
BradB
Apprentice
Apprentice


Joined: 18 Jun 2002
Posts: 190
Location: Christchurch NZ

PostPosted: Tue Jun 17, 2003 8:13 pm    Post subject: Reply with quote

Oh, yeah. If you want that to get seriously hammered, submit a HOWTO in tips & tricks, and ask if it can go in the GWN Letter.

Thanks for all the hard work guys - I will set this up tonight.

Brad
Back to top
View user's profile Send private message
Death Valley Pete
n00b
n00b


Joined: 25 Mar 2003
Posts: 49
Location: The Inland Empire

PostPosted: Tue Jun 17, 2003 9:59 pm    Post subject: Reply with quote

Hi jjw,

I've been following your progress in the forums (this thread and the other one) for about a month, and I'd like to say again that what you're doing here is incredible. When I suggested sunsite.dk I figured it got buried in the convertation between two people who clearly knew what they were doing, but apparently not. I'm glad I was able to contribute something even if it was something pretty insignificant.

Anway, I hope I can be forgiven for asking this again but, now that you've got a real non-SF mirror going is there anyway for regular users to submit patches? For example, I made a patch with deltup for gcc-3.2.3-3.3 (it's about 8 MB) that I'd be willing to spend an hour uploading for the cause. :wink: I know this is a newbie delusion of grandeur but making it automatic (something like: "Patch not found, downloading full tarball... Make and submit patch? (y/n)") would be a nifty feature (and maybe once this is fully integrated there can be an automatic system run on the mirrors to do that). Looking at the files list on the server it seems that the naming convention has changed but I won't ask any dumb questions about that until I've read the new manual. I haven't installed the masked deltup yet but I'll do so now if there's Portage integration involved - in fact I'll do that right now.

Anyway, thanks for all the effort, keep us posted, and vaya con dios.
_________________
<instert pithy statement here>
Back to top
View user's profile Send private message
BradB
Apprentice
Apprentice


Joined: 18 Jun 2002
Posts: 190
Location: Christchurch NZ

PostPosted: Tue Jun 17, 2003 10:06 pm    Post subject: Reply with quote

Quote:
I know this is a newbie delusion of grandeur but making it automatic (something like: "Patch not found, downloading full tarball... Make and submit patch? (y/n)")


I think that is brilliant. Though you would need a "confirm upload" step in case you decided that you don't actually have time to upload 8Mb.

Brad
Back to top
View user's profile Send private message
Death Valley Pete
n00b
n00b


Joined: 25 Mar 2003
Posts: 49
Location: The Inland Empire

PostPosted: Wed Jun 18, 2003 4:19 pm    Post subject: Reply with quote

Hi BradB,

I'm flattered, but since the extend of my programming capabilities involves:
Code:
printf("Hello world!");

I think we'll just have to wait and see what the man who knows what's going on thinks. :wink: It sure does sound nice though.

Everybody:
Portage intebration works beautifully. All you need to do is:
Code:
ACCEPT_KEYWORDS="~x86" emerge -u deltup

(The tools are still a little light on documentation but it's not hard to figure out at all. It took me about a minute to get up to speed.)
Then put
jjw wrote:
FETCHCOMMAND='efetch ${URI} ${DISTDIR} ftp://sunsite.dk/projects/deltup/patchfiles'

in your make.conf. If all you want to do is reap the deltup goodness then that's it. I for one did a qpkg -l deltup and experimented with the programs it lists, as far as I know you can't really break anything.

For the record this is incredibly cool!

Edit: whoops, BBCode typo fixed. Preview first!
_________________
<instert pithy statement here>
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page Previous  1, 2, 3, 4, 5  Next
Page 3 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum