Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Deltup will update packages and reduce download time
View unanswered posts
View posts from last 24 hours

Goto page 1, 2, 3, 4, 5  Next  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Mon May 12, 2003 8:20 am    Post subject: Deltup will update packages and reduce download time Reply with quote

Hello everyone,

I'd like you to try my new application for patching packages. It produces correct MD5sums and I've seen cases where it has reduced download times by a factor of 80x. You can find it at: http://sourceforge.net/projects/deltup/

I have also included a few patches for you to try out. This application has great potential, and I think we might be able to eventually merge it with portage to automate the patch process. I have submitted an ebuild which you can find at: https://bugs.gentoo.org/show_bug.cgi?id=18354

It's simple to use: type
Code:
edelta -m <oldpackage> <newpackage> <patchname>
to make a patch and
Code:
edelta -p <patchname>
to apply it. Edelta will automatically look for packages in the portage distfiles directory, so don't include the package directory on the command line. Please try it out and tell me what you think.
---JJW

-------------------------------
Do all for the Glory of God!


Last edited by jjw on Mon May 12, 2003 6:36 pm; edited 1 time in total
Back to top
View user's profile Send private message
Lovechild
Advocate
Advocate


Joined: 17 May 2002
Posts: 2858
Location: Århus, Denmark

PostPosted: Mon May 12, 2003 9:22 am    Post subject: Reply with quote

a factor 80, damn that has to be for Xfree or something big like that..

but my real concern is that the sourceforge page is unavailable, is this your doing or just faith?
Back to top
View user's profile Send private message
Genone
Retired Dev
Retired Dev


Joined: 14 Mar 2003
Posts: 9507
Location: beyond the rim

PostPosted: Mon May 12, 2003 9:55 am    Post subject: Reply with quote

Could you please add a comment about the dependencies of your program, so I don't have to search through your source code :( And BTW, a better error handling would be nice (it just segfaulted because I don't have xdelta installed)
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Mon May 12, 2003 4:03 pm    Post subject: Answering your posts Reply with quote

Quote:
but my real concern is that the sourceforge page is unavailable, is this your doing or just faith?

Actually, I'm not responsible for the SourceForge website. It's working now, and it was working great just 30min. before you posted.

Genone,
Sure. My program requires xdelta, tar, gzip, and bzip2. If you use my ebuild it should automatically install dependencies. I do need some error checking though, especially checks for write errors (like disk full). I'll try to upload a new version with xdelta checking today.
---JJW

-----------------------------
Do all for the Glory of God
Back to top
View user's profile Send private message
Genone
Retired Dev
Retired Dev


Joined: 14 Mar 2003
Posts: 9507
Location: beyond the rim

PostPosted: Mon May 12, 2003 5:29 pm    Post subject: Reply with quote

Hmm, maybe I'm just blind but where is this ebuild. I found it neither on the SF page or on bugzilla (or do you mean the gcc ebuild, is 11470 the correct bug ?)
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Mon May 12, 2003 6:49 pm    Post subject: Reply with quote

At first I thought you were blind, but then I tried my ebuild link... Sorry. That was a stupid mistake. It should have pointed to Bug 18354. I have corrected it. I have also released Ver. 0.2.7 with more error checking. Instead of using the bugzilla ebuild you can download the accompanying ebuild.tar and type
Code:
tar -xf ebuild.tar -C /usr/portage
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Mon May 12, 2003 8:25 pm    Post subject: Reply with quote

Curious, why did you write this in c++? From what I've gathered going through the source, this is essentially a wrapper around gzip/bzip2/other compressions (noticed a couple were commented out) and xdelta. Particular reason you choose compiled over interpreted?

Other then that, I'd be curious about any anecdotal stat's you may have for xdelta vs diff when dealing with text- I already know 99.999% of the time xdelta beats the crap out of a unified diff, but it's something of a toss up when it's xdelta vs an ed diff (ed has much lower overhead).
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Mon May 12, 2003 9:42 pm    Post subject: Compiled or Interpreted? Reply with quote

ferringb wrote:
Curious, why did you write this in c++? From what I've gathered going through the source, this is essentially a wrapper around gzip/bzip2/other compressions (noticed a couple were commented out) and xdelta. Particular reason you choose compiled over interpreted?

Good question. I have a lot of C/C++ experience. I wanted to make a binary patch format to reduce patch sizes (can Python work with bitfields?). I usually try to minimize dependencies (this program doesn't even link to libstdc++). It's basicly C code with a few C++ features. Python would handle strings and system calls better, but deltup is really a basic utility that non-Gentoo users should be able to use without having Python on their system.

Quote:
Other then that, I'd be curious about any anecdotal stat's you may have for xdelta vs diff when dealing with text- I already know 99.999% of the time xdelta beats the crap out of a unified diff, but it's something of a toss up when it's xdelta vs an ed diff (ed has much lower overhead).

Yeah, in my limited tests I've seen a gain with xdelta over diff. I haven't collected any stats though because I don't know of a way to use diff to produce correct MD5sums.
I might try to implement an alternative to xdelta which would output raw differences which could then be bzip2'd. That would be more efficient than diff and xdelta.
Back to top
View user's profile Send private message
dreamer3
Guru
Guru


Joined: 24 Sep 2002
Posts: 553

PostPosted: Wed May 14, 2003 3:29 am    Post subject: Reply with quote

Your saying the updated files that it produces always have the same MD5 sum as the original updated archives? That means it can't possibly be diffing the files inside the archives, but rather the entire archive as a whole? Or maybe I'm just lost... don't have time to read the code right now... maybe a quick explination?
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Wed May 14, 2003 5:07 am    Post subject: Method for obtaining delta Reply with quote

Quote:
Your saying the updated files that it produces always have the same MD5 sum as the original updated archives? That means it can't possibly be diffing the files inside the archives, but rather the entire archive as a whole? Or maybe I'm just lost... don't have time to read the code right now... maybe a quick explination?

Your right on! Yes, I'm decompressing the archive without untarring it. Then with a few tricks (gzip requires special handling) it always produces the correct MD5sum (AFAIK). Isn't that neat? It currently supports gz, bz2, and uncompressed archives.

I hope you developers will get interested and interface deltup from portage. Think of the reduced load on your mirrors :wink:. You could start with a simple system without multiple patch support. I'd be glad to help in any way I can - just send an email to jjw at linuxmail.org

In the meantime, is there anyone who will donate DTUs (delta updates) for OpenOffice? I can post them on the deltup site. Even broadband users get frustrated re-downloading the whole suite. I would also appreciate any suggestions.

Keep posting to keep this thread alive!
---JJW
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Wed May 14, 2003 5:55 am    Post subject: Re: Compiled or Interpreted? Reply with quote

jjw wrote:
can Python work with bitfields?

yep. http://merd.net/pixel/language-study/syntax-across-languages-per-language/Python.html

jjw wrote:
Yeah, in my limited tests I've seen a gain with xdelta over diff.

Well, I wouldn't say xdelta automatically has a smaller size- as I said, I've yet to see (For text at least) an xdelta diff beat down a diff in ed format. Restating, unified it kills: unified includes both the new text and the old. ed is not unified though...
jjw wrote:
...a way to use diff to produce correct MD5sums.

As for MD5sums, it dawned on me the other day why MD5sum correct ain't possible with just diff- A) file times change on the files (if a file wasn't patched, it has the old timestamp for the previous version), B) the tarball format itself has fields that are specific to the newly created tarball (username, groupname, it's own chksum, devmajor/devminor). Diff works purely on the files, not on the tarball...

Easiest way to explain/prove this/see it- untar a tarball, retar it. ain't md5sum correct...
jjw wrote:
I might try to implement an alternative to xdelta which would output raw differences which could then be bzip2'd. That would be more efficient than diff and xdelta.

If you're talking dumping the xdelta diff to stdout, it originally seemed to have this capability... until automatic gzip decompression was added (it broke it somehow...). If not, I'd be curious how you intend to improve on the control overhead?

dreamer3 wrote:
Your saying the updated files that it produces always have the same MD5 sum as the original updated archives?

deltup is a wrapper around xdelta- xdelta being a binary diff, it will produce an md5 correct target (exempting #2 on the list below).

That said and done, I think xdelta'ing tarball's may not be the best option. I have faith in the rsync alg (xdelta is based on it), but there are a few reasons I think xdelta would be bad for tar.
1) There are many, many minor changes in the tar header (512 byte header for each tar entry/file) between versions... control overhead involved for each of those gets annoying/large for 1/8/12 byte fields between the versions. Note that's for each file entry (Openoffice-1.0.3 has 59,141 files- lost of tar updates there).
2) md5 correctness. For gzipped files w/in the tarball (gzip's in general actually), xdelta can produce a data correct yet md5 incorrect patch- Take a look in the xdelta readme.
3) compressed files w/ in the tarball (similar to 2, yet not 2). Again, look in the readme- the xdelta produced is rather large (many, many minor changes here and there, plus all the control overhead).
4) minor aside, but is Josh MacDonald still developing/supporting xdelta? I poked around the sourceforge site a bit, and it doesn't look too alive... the last cvs commit was in dec. 2001. further, it doesn't look like any one but MacDonald are associated w/ xdelta anymore (hell, open bugs go back a ways too).
I realize I sound cynical/nit picking, but eh, I'm anal...

Don't get me wrong- xdelta kicks ass. I just think there are limitations w/ it's alg that make it not the perfect solution. Offhand for a better 'patcher', I'd think something able to descend into existing tarballs and do the patching at a file level (w/in the tarball)- reliant on diff and xdelta, recursively descend into tarballs/zip's w/in the tarball, and patching/modifying of the tarball headers to cut down on the control overhead diff/xdelta would produce for that section.
I've written something vaguely similar in perl prior to noticing the tarball md5 problem, and due to the need for tar header updates I'm attempting to rewrite it in c... that doesn't fair well due to me being rusty in c at the moment... :evil:

Offhand I'd be curious why the desire for a multiple patch format (as in multiple patches w/in one file)? Also, if you want to make the xdelta patch smaller, run it w/ disabled compression (xdelta delta -0), and compress it with gzip or bzip2 after... zlib compresion is good, but bzip2 is better :P .

Either way, this is a nifty project despite me being overly critical and anal...
Back to top
View user's profile Send private message
dreamer3
Guru
Guru


Joined: 24 Sep 2002
Posts: 553

PostPosted: Wed May 14, 2003 6:28 am    Post subject: Reply with quote

Boy, if this were a bash script I'd love to play with it... I think some of the minor issues can be resolved... but I don't get into C/C++ much anymore...
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Wed May 14, 2003 7:10 am    Post subject: Deltup innards Reply with quote

Quote:
1) There are many, many minor changes in the tar header (512 byte header for each tar entry/file) between versions... control overhead involved for each of those gets annoying/large for 1/8/12 byte fields between the versions. Note that's for each file entry (Openoffice-1.0.3 has 59,141 files- lost of tar updates there).

Yes, there are many minor changes, but they must be included anyway... How much do you think we could gain by special recognition of tar files? If it's worthwhile then I think the best way to handle it would be to filter out the tar headers before the file is fed to xdelta.
Quote:
2) md5 correctness. For gzipped files w/in the tarball (gzip's in general actually), xdelta can produce a data correct yet md5 incorrect patch- Take a look in the xdelta readme.

I think md5 is a must. My program disables automatic handling of compressed files by using the --pristine option, guaranteeing md5 correctness. xdelta doesn't deal with gzipped files within the tarball.
Quote:
3) compressed files w/ in the tarball (similar to 2, yet not 2). Again, look in the readme- the xdelta produced is rather large (many, many minor changes here and there, plus all the control overhead).

Maybe we can filter these out too, but people don't usually put compressed files into tarballs... I think binaries are a much bigger problem.
Quote:
4) minor aside, but is Josh MacDonald still developing/supporting xdelta? I poked around the sourceforge site a bit, and it doesn't look too alive... the last cvs commit was in dec. 2001. further, it doesn't look like any one but MacDonald are associated w/ xdelta anymore (hell, open bugs go back a ways too).

You're right... not much going on there... That's one of the reasons I'm working on an alternative.

Quote:
Don't get me wrong- xdelta kicks ass. I just think there are limitations w/ it's alg that make it not the perfect solution. Offhand for a better 'patcher', I'd think something able to descend into existing tarballs and do the patching at a file level (w/in the tarball)- reliant on diff and xdelta, recursively descend into tarballs/zip's w/in the tarball, and patching/modifying of the tarball headers to cut down on the control overhead diff/xdelta would produce for that section.

That sounds like a great idea. A lot of work though, and potentially dangerous.

Quote:
Offhand I'd be curious why the desire for a multiple patch format (as in multiple patches w/in one file)?

I did it because packages often come in multiple tarballs, for example, KDE, GNOME, NVIDIA drivers, MythTV, etc... if you aren't using an automated patcher it's a lot easier to have everything contained in a single patch file.
Quote:
Also, if you want to make the xdelta patch smaller, run it w/ disabled compression (xdelta delta -0), and compress it with gzip or bzip2 after... zlib compresion is good, but bzip2 is better :P .

Hmmm... I have to do some tests! Some of the packages came out larger with bz2 compression (xscreensaver-4.08 to 4.09), so bzip2 isn't always better, and it's slower too. But overall I think you're right. I probably will make it the default in the next version and give CL options to use zlib instead.

Thanks for your suggestions,
---JJW
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Wed May 14, 2003 8:10 am    Post subject: Re: Deltup innards Reply with quote

jjw wrote:
Yes, there are many minor changes, but they must be included anyway... How much do you think we could gain by special recognition of tar files? If it's worthwhile then I think the best way to handle it would be to filter out the tar headers before the file is fed to xdelta.

Part of the reason I think (and this is conjecture) that tarball's are a fundamentally different beast is how well the rsync (well the xdelta alg) handles data that has moved a significant distance- assuming tar is just adding files by how the fs lists them (basically by inode order I'd think), there can be multiple files basically inserted/moving the original (basically unmodified) tar entry.
What I do know is that basically all files that are the data-same between versions have a different tarball header- mtime, uname, and gname are the usual culprits.
For that matter, uname and gname are typically different between versions- ergo, for distfile debianutils version 1.16.3 tarball: uname:willy, gname:willy, v1.16.7 has uname:tfheen, gname:tfheen. Why not just say, this tarball's uname/gname is tfheen rather then having a specific modification for *each* tar entry? That's a crap load of extra control/data generated that can be eliminated by getting specific to the tarball format.
jjw wrote:
My program disables automatic handling of compressed files by using the --pristine option, guaranteeing md5 correctness. xdelta doesn't deal with gzipped files within the tarball.

Yep- I'm an idiot, I noticed the pristine, but didn't work out the implications of it: since it's not recursive, and you're using pristine the gzip bug/problem doesn't exist... doh, pardon. I shouldn't post after midnight w/out sufficient caffeine levels...
jjw wrote:
Maybe we can filter these out too, but people don't usually put compressed files into tarballs... I think binaries are a much bigger problem.

True, but when they do it suffers (openoffice being a good example). In some basic tests I'd done w/ my aforementioned perl script that recursively broke down archives and generated xdeltas/diffs, it was something like a 2mb favoring for the generated tarball (a true tarball) and the xdelta. Mind you that lacked the tarball entry modifications (hadn't realized that issue yet), and I also lacked some common-sense checks- case in point, openoffice's moz/zipped/IRIXGCCMinc.zip has a very small delta (like under 500k compared to something like a best compressed 2.6+mb xdelta) once you realize that they moved the files out of a directory (easy enough to catch via the tar headers I'm thinking).
jjw wrote:
You're right... not much going on there... That's one of the reasons I'm working on an alternative.

Might as well start from the shoulders of the giants- http://samba.anu.edu.au/rsync/tech_report/tech_report.html is a decent intro to the rsync alg. I have the pdf of the actual paper he wrote about it somewhere, much more in detail floating around my hd somewhere, but I can't find it... I'll post a link/upload it somewhere probably tomorrow night assuming I can find it.

jjw wrote:
That sounds like a great idea. A lot of work though, and potentially dangerous.

Heh. Lots and lots of md5sum's to verify integrity/correct behavior (same as xdelta). So far, after spending way too much time dinking w/ type conversions (always my weak point in c), I've got a basic setup that's able to discern new files, changed files, and the structure setup to identify the changes in the tar headers. My intention is basically to write a kind-of-control-syntax thing for the tar headers- intention being the ability to specify in one place global changes to the tar headers (think uname/gname), and the ability to specify which fields have changed and update them.
Dunno. Originally I was intending on making it basically a true tarball, but due to the header overhead I'm doing this control thing. From there, for each different file it either appends a diff or xdelta (and a patch length to the header), then wash rinse repeat.

jjw wrote:
I did it because packages often come in multiple tarballs, for example, KDE, GNOME, NVIDIA drivers, MythTV, etc... if you aren't using an automated patcher it's a lot easier to have everything contained in a single patch file.

Curious, for a single patch, umm, well patch, how much have you modified the xdelta file format? I assume for the multiple patch setup you have some form of header control: would xdelta be able to use the (single patch) file on it's own, or is it specific to your prog?
jjw wrote:
Hmmm... I have to do some tests! Some of the packages came out larger with bz2 compression (xscreensaver-4.08 to 4.09), so bzip2 isn't always better, and it's slower too. But overall I think you're right. I probably will make it the default in the next version and give CL options to use zlib instead.

Better yet, I've personally been after some alg/prog that is able to determine which compression method (from a somewhat accurate sampling) to use, which ultimately would be best... know anything of the sort?
For deltup, at the very least leave it non-zlib compressed, and if bzip doesn't cut it you can just use a zlib type app to compress it. Also, have you checked into auto-generating the file bzipped/gzipped? I have a gzip->bzip2 conversion prog (in c) floating around somewhere that is a good indication of how to deal w/ their somewhat weird lib interfaces...
While you're doing tests, check out gzip- for binary data for whatever reason it's pretty well behaved (moreso then bzip2 imho). My own experience, bzip2 usually does better for textual data- tarballs, patches, etc. gzip lags behind, but matches/beats bzip2 on binary data.
Weird. One of these days I look at the underlying alg to find out why, but neh, it's 3:05 am, and I get up in 3 hours...

One final question- have you looked into using the xdelta lib? It'd be a nice way to eliminate having to rely on the actual program, that and integrate the xdelta generation more tightly into your program.
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Wed May 14, 2003 8:26 am    Post subject: Reply with quote

Thinking about it, I do have an xdelta diff of openoffice 1.0.2->1.0.3 at work, and a couple other xdeltas should you be curious.
If you have a slightly different format then the xdelta (header/footer), I can either A) add the info myself, or B) just run deltup against it and gen a new diff.
Back to top
View user's profile Send private message
quakz99
n00b
n00b


Joined: 09 Apr 2003
Posts: 33
Location: South Africa

PostPosted: Wed May 14, 2003 8:28 am    Post subject: Reply with quote

hey well i cant offer any good technical advice unfortunately :( but i must say that as a dialup user i am very excited about this program. it looks like a godsend. i will try it out asap

-luke
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Wed May 14, 2003 5:42 pm    Post subject: Re: Deltup innards Reply with quote

Quote:
Thinking about it, I do have an xdelta diff of openoffice 1.0.2->1.0.3 at work, and a couple other xdeltas should you be curious.

Thanks, but OO is the real biggie. Do you have untarnished source tarballs?
Quote:
If you have a slightly different format then the xdelta (header/footer), I can either A) add the info myself, or B) just run deltup against it and gen a new diff.

I think B would be better because I can't test the patchfile (I don't have the OpenOffice source tarball). I will download OO if I can perform easy upgrades.
Quote:
how well the rsync (well the xdelta alg) handles data that has moved a significant distance

It does an excellent job, even with matching data at opposite ends.
Quote:
Why not just say, this tarball's uname/gname is tfheen rather then having a specific modification for *each* tar entry?

Makes sense. Another pass through the file would create some extra overhead, but we could make several optimizations with one sweep.
Quote:
I've got a basic setup that's able to discern new files, changed files, and the structure setup to identify the changes in the tar headers.

What if we extracted the tar headers and everything else that needs extra processing and then do a diff on the remaining data? That way, xdelta/diff would be invoked only once, and renamed files or large chunks of text moved from one sourcefile to another would be handled with no extra work.
Quote:
Curious, for a single patch, umm, well patch, how much have you modified the xdelta file format?

I only added a header with a "magic" number, necessary filename information, compression options for package regeneration, and a special "pristine delta" footer for gzip files. xdelta wouldn't understand the format, but the delta could be extracted from the file quite easily. Multiple patches are just single patches concatenated with some of the header removed for every patch after the first one.
Quote:

Better yet, I've personally been after some alg/prog that is able to determine which compression method (from a somewhat accurate sampling) to use, which ultimately would be best... know anything of the sort?

I doubt whether this could be done accurately without using the actual gzip/bzip2 algorithm.
Quote:
Also, have you checked into auto-generating the file bzipped/gzipped?

I'll release a new version real soon (probably today)... with options -b, -g, and -z (bzip2, gzip, and internal zlib).
Quote:
One final question- have you looked into using the xdelta lib? It'd be a nice way to eliminate having to rely on the actual program, that and integrate the xdelta generation more tightly into your program.

I looked into it when I started this project, but invoking xdelta through a system call was just too easy :D! Documentation is a little scarce. I'll try to find out if this would be worthwhile.


What method do you use to obtain an diff in ed format? It should work faster because it can deal with line by line comparisons, but why does it produce more optimized patches? Do you have some examples?

There's a compression format made specifically for binaries you might want to look at: http://upx.sourceforge.net/ I wonder if this could be worked in somehow...

---JJW
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Wed May 14, 2003 6:05 pm    Post subject: Bright future Reply with quote

quakz99 wrote:
hey well i cant offer any good technical advice unfortunately :( but i must say that as a dialup user i am very excited about this program. it looks like a godsend. i will try it out asap

Gentoo was an instant hit with me, but the lack of package diffs really bothered me. Portage sync overhead could be reduced too, but that's tomorrow's job :idea::!:
---JJW
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Wed May 14, 2003 7:46 pm    Post subject: Re: Deltup innards Reply with quote

jjw wrote:
Thanks, but OO is the real biggie. Do you have untarnished source tarballs?...

I can generate one specific to the gentoo distfile tarballs, using deltup if you're curious (basically abuse my comp at work which is on the uw network).
jjw wrote:
What if we extracted the tar headers and everything else that needs extra processing and then do a diff on the remaining data? That way, xdelta/diff would be invoked only once, and renamed files or large chunks of text moved from one sourcefile to another would be handled with no extra work.

Well, you'd have a few annoyances- first pass through each tarball would create two files, the tar entry/headers, and the tar/file data. From there xdelta the tar-info file, and then xdelta the tar-file data.
To actually patch the tarball you'd have to pull it part into the two-file setup (or do it in memory), patch it, then rebuild the tarball.
The latter part is doable, albeit annoying. Nit picking, you would still suffer from the tarball wide uname/gname change, but compression would help there.
jjw wrote:
I doubt whether this could be done accurately without using the actual gzip/bzip2 algorithm.

So far that's basically what must be done... Although I'd think there ought to be a way to do it w/out doing the actual compression- basically estimatation based on sampling how well each alg might work.
Of course to write this, you need to have a rather throrough understanding of the compression algs...
jjw wrote:
I looked into it when I started this project, but invoking xdelta through a system call was just too easy :D! Documentation is a little scarce. I'll try to find out if this would be worthwhile.

Yeah, the code/lib interface is a bit imposing at first glance...
jjw wrote:
What method do you use to obtain an diff in ed format? It should work faster because it can deal with line by line comparisons, but why does it produce more optimized patches? Do you have some examples?
diff -e typically does it. Basically, it seems like the data to control ratio is better diff -e compared to xdelta... an ed diff just states, (ex: 50,100d) delete theses lines, (0a then text)insert this, change that (5,3c then text.)- you get the idea.
jjw wrote:
There's a compression format made specifically for binaries you might want to look at: http://upx.sourceforge.net/ I wonder if this could be worked in somehow...

Unless I'm wrong, this only works on binary/elf32 files. Maybe not on the elf32 thing, but basically machine code/binary executable.
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Wed May 14, 2003 10:59 pm    Post subject: Re: Deltup innards Reply with quote

Quote:
I can generate one specific to the gentoo distfile tarballs

Gentoo always uses the official tarballs. You can download the tarball off the OO site and it should have the same md5. All the Gentoo-specific stuff is in the ebuild. That's one of the things I love about Gentoo.

Quote:
Well, you'd have a few annoyances- first pass through each tarball would create two files, the tar entry/headers, and the tar/file data. From there xdelta the tar-info file, and then xdelta the tar-file data.
To actually patch the tarball you'd have to pull it part into the two-file setup (or do it in memory), patch it, then rebuild the tarball.
The latter part is doable, albeit annoying.

The more I think about this the more I'm convinced it's the best way to solve the problem and that bzip2 would do a fantastic job compressing the tar headers if we consolidated them (but certainly not make a delta of them). Trying to improve on the bzip2 algorithm for this one case is like trying to disassemble a junked car before feeding it to the crusher (hey, I know how to take this car apart!). And for me to try to improve on the xdelta algorithm is like squeezing the blood out of a turnip! I think I can cut down the overhead though (like your ed diff example).

Quote:
Nit picking, you would still suffer from the tarball wide uname/gname change, but compression would help there.

Yes, since uname/gname is in the tar header bzip2 should be able to crunch it easily if they are all together.

Quote:
Unless I'm wrong, this only works on binary/elf32 files. Maybe not on the elf32 thing, but basically machine code/binary executable.

I was thinking it might be useful for binary files within the tarball...

Please don't be offended by my jovial comments!
---JJW
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Thu May 15, 2003 3:34 am    Post subject: Re: Deltup innards Reply with quote

jjw wrote:
Gentoo always uses the official tarballs. You can download the tarball off the OO site and it should have the same md5.

Yep. If you're after it I have an oo deltup generated (nohupped it on the way out the door).

jjw wrote:
The more I think about this the more I'm convinced it's the best way to solve the problem and that bzip2 would do a fantastic job compressing the tar headers if we consolidated them (but certainly not make a delta of them). Trying to improve on the bzip2 algorithm for this one case is like trying to disassemble a junked car before feeding it to the crusher (hey, I know how to take this car apart!). And for me to try to improve on the xdelta algorithm is like squeezing the blood out of a turnip! I think I can cut down the overhead though (like your ed diff example).

I'd thought about just compressing the catted tar headers, which would work, but the thing is that no matter how good the compression alg is it takes less space to not have that data there in the first place. With a little elbow grease, the majority of the header (Exempting complete changes or new files in a version) can be seriously chopped.
ex: a header is 512 bytes; say only mtime has changed (12 byte field), and having a control of 2 bytes to specify what fields have changed (in this case it would only be mtime), that would reduce the tar header from 512 to 14. Further, mtime is stored such that each digit position is a octal char, so that could be converted to a more efficient radix (say 255), then null delimit the field to cut down on the bytes used.
While I realize that's getting anal, the intention is the smallest yet quickly applied/reconstructed patch possible... and compressing 512 bytes will lose out to compressing 14 bytes or less in this case. W/ the setup I'm planning, the only time my version is over 512 bytes (max being 514) is if basically all the fields are at their max (file path/name==255 chars). Unlikely, although a condition to watch for... further, the 514 max probably isn't accurate- it all depends on how much I can simplify the existing fields...
One thing to note though, is that I haven't yet integrated the patch length into my header... I have a pretty good idea how to do this, but it will require some (lossless) rebuilding of the tar header when patching a tarball.
jjw wrote:
I was thinking it might be useful for binary files within the tarball...

It's not applicable to binary files in general- purely executables (think winzip.exe or ls). I'm guessing it does some collapsing of the file specific to the proc commands (add, mov, ...). With a couple test runs w/ a few binaries here and there it seems to beat out gzip (which does the best of the traditionals).
Actually whats extremely nifty about that lil prog is the fact that the packed executable is still executable.
I wish I knew of that back when I had to create basically a full fledged mini-distro that ran on a 16 mb flash drive, and 16mb ram setup... the fact it doesn't abuse the crap out of the stack for running packed executables also is quite nice...

One thing I'm curious about, do you know much about the header info that xdelta tacks onto it's patches? Although I haven't decided if this is a bad idea yet, what I'm building includes the patch type (enumerated) in my specific patch control section due to the fact the applyer/reconstructing app needs to know how to apply the patch... Since I already have the info there, and I intend to include an xdelta version magic in the file, the xdelta header info isn't needed...
That and the extra chksum's w/in the xdelta diff- the whole patch will include a md5sum of the source, the patch, and the target/produced tarball... having extra md5sums littered throughout the file won't be needed.
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Thu May 15, 2003 5:35 am    Post subject: Re: Deltup innards Reply with quote

Quote:
Yep. If you're after it I have an oo deltup generated (nohupped it on the way out the door).

It would be great if you could upload it to SourceForge:
ftp upload.sourceforge.net
anonymous
(enter password)
put <local filename> /incoming/OO-1.02-1.03.dtu

Here's a script for you to try. It builds a tar with 1000 empty files which takes over 500K. After being compressed with bzip2 it hovers around 2k. I realize that the headers will typically have much more random data, but that's unavoidable. In this "best case" scenario it takes about two bytes per file. Now that's pretty impressive!

Code:
declare -i num
mkdir project
cd project
for i in a b c d e f g h i j
do
  mkdir $i
  cd $i
  for j in a b c d e f g h i j
  do
    mkdir $j
    cd $j
    touch 0 1 2 3 4 5 6 7 8 9
    cd ..
  done
  cd ..
done
cd ..
tar cf test.tar project
tar cjf test.tar.bz2 project


Quote:
One thing I'm curious about, do you know much about the header info that xdelta tacks onto it's patches?

Not really, except it includes some pretty useless fields like filenames.

Are you writing your parser in Perl?

I released V. 0.2.8 with bzip2/gzip/uncompressed patchfile support earlier today.
---JJW
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Thu May 15, 2003 2:49 pm    Post subject: Re: Deltup innards Reply with quote

jjw wrote:
put <local filename> /incoming/OO-1.02-1.03.dtu

Pardon for the delay, ended up crashing earlier then intended. Either way, she's uploaded, roughly 24,401k. Bzip2 version is roughly 11,660k (not uploaded).
I'm guessing you would like to generate a fair selection of dtu's? Offhand, given a list of package-version to package-version and desired naming format I could fairly easily automate something to either A) deltup the tarballs, upload it to to sourceforge (downside you need to enable each upload I presume), or B) post it to some webspace I have on the side.
The webspace I have, I wouldn't want to use as a main download site (I doubt my friend would like that...). Regardless, I'm sure something can be worked out.
One question though, what naming scheme are you using for the deltups? I'd personally use the standard distfile naming scheme w/ a minor modification... eg: OOo_1.0.2-1.0.3.dtu (note the extra o). My own opinion though..

jjw wrote:
Here's a script for you to try. It builds a tar with 1000 empty files which takes over 500K. After being compressed with bzip2 it hovers around 2k. I realize that the headers will typically have much more random data, but that's unavoidable. In this "best case" scenario it takes about two bytes per file. Now that's pretty impressive!

Definitely a best case scenario- none the less I follow what you mean. I still think I'll attempt to either A) split the tar data and tar header info into two files and see how small the two can be made, and B) the tar header diff attempt. Mostly I'm being stubborn and think it can be pushed lower via this method...

jjw wrote:
Not really, except it includes some pretty useless fields like filenames.

I'll end up probably looking in the source this weekend to see what chaff can be seperated for multiple-xdelta patches in a file. I'll post what I find (if anything), since you could also use it to cut down on redundant bytes in the multiple-patch dtu format...

jjw wrote:
Are you writing your parser in Perl?

The original version I wrote was in perl- that didn't fair as well as I liked mainly due to A) non-md5 correct reconstructed tarballs, B) a lot of redundant info- I was attempting to store the various generated files w/in a tarball which added to the size.
The initial attempt was smaller then the equivalent xdelta, but I also was lacking the appropriate tar header modifications (leading to non-md5 correctness...)
What I'm writing now is a strange mixture of c and perl- the intention being going completely to c... I'm basically using system calls to perl scripts for sections I haven't either A) rewrote in C or B) don't want to write in c yet (a hash of available files in a tarball for instance).

Any thoughts on how to integrate this beast reliably into the portage system? I've had some experience screwing around w/ that setup (I've been writing a python curses based package viewer/manipulator), although I personally still wonder how to easily track what version the user has and grab the appropriate patch (dtu or otherwise) to do the upgrade.
Back to top
View user's profile Send private message
jjw
n00b
n00b


Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

PostPosted: Thu May 15, 2003 4:09 pm    Post subject: Re: Deltup innards Reply with quote

Quote:
she's uploaded, roughly 24,401k

Thanks a lot. I've started downloading OO-1.0.2 so that I can upgrade and have both versions!

Quote:
I'm guessing you would like to generate a fair selection of dtu's?

For the SourceForge site I only intended a few large packages (kernel, abiword, wine, OO, ...) because I do have to manually add the file to download.
The reason I am using SourceForge is because I don't have any other suitable webspace.

Quote:
One question though, what naming scheme are you using for the deltups? I'd personally use the standard distfile naming scheme w/ a minor modification... eg: OOo_1.0.2-1.0.3.dtu (note the extra o). My own opinion though..

That would be an excellent naming scheme. Note that some packages use - and _ in their version numbers so distinguishing versions would be complicated. Here's a real weird one from my system: fcpackage.2_1.tar.gz. And look at the official OO 1.0.2 tarball name: OOo_1.0.2_source.tar.bz2

Quote:
B) the tar header diff attempt.

Do you intend to find matching tar headers and neutralize the differance in the new package so they'll match?

Quote:
I'll end up probably looking in the source this weekend to see what chaff can be seperated for multiple-xdelta patches in a file. I'll post what I find (if anything), since you could also use it to cut down on redundant bytes in the multiple-patch dtu format...

I'm sure there's a lot of chaff to chop. Maybe it would be better to process the xdelta patch and write it out in a completely different format.

Quote:
Any thoughts on how to integrate this beast reliably into the portage system? I've had some experience screwing around w/ that setup (I've been writing a python curses based package viewer/manipulator), although I personally still wonder how to easily track what version the user has and grab the appropriate patch (dtu or otherwise) to do the upgrade.

Assuming we can obtain webspace somewhere:
How about building support directly into Portage? When Portage calls wget to download a file, simply look for a patch first! OK, that is going straight for the heart, but it would be really simple. It could be extended later to allow patch sequences.
Otherwise, how about interfacing with emerge? "emerge -up" would show which packages need to be updated.
Back to top
View user's profile Send private message
ferringb
Retired Dev
Retired Dev


Joined: 03 Apr 2003
Posts: 357

PostPosted: Thu May 15, 2003 9:21 pm    Post subject: Re: Deltup innards Reply with quote

jjw wrote:
Do you intend to find matching tar headers and neutralize the differance in the new package so they'll match?

Yep. The format basically be such that the first portion is the tar header-modifier/patch, with the patch size included (if any), then the patch (if any).
Plus side, I can use basically the same format for new files. Downside, it ought to add 2 bytes in worst case scenario.

jjw wrote:
I'm sure there's a lot of chaff to chop. Maybe it would be better to process the xdelta patch and write it out in a completely different format.

If it were that bad, I'd think just rewriting the the patch generator would be the best option... course to discern that, one needs to study the xdelta alg which probably won't be fun.

jjw wrote:
Assuming we can obtain webspace somewhere:
How about building support directly into Portage? When Portage calls wget to download a file, simply look for a patch first! OK, that is going straight for the heart, but it would be really simple. It could be extended later to allow patch sequences.
Otherwise, how about interfacing with emerge? "emerge -up" would show which packages need to be updated.

On the other list I'd mentioned some ideas/concerns for it- the upgrade path thing (which somebody else thought up).
Emerge would need to be modified (specifically the fetch), and the portage setup would require most likely an additional field, diff_uri or something.
More thoughts later... got to run.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page 1, 2, 3, 4, 5  Next
Page 1 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum