Gentoo Forums :: View topic - Deltup will update packages and reduce download time

Deltup will update packages and reduce download time

View unanswered posts
View posts from last 24 hours

Goto page Previous 1, 2, 3, 4, 5 Next

Gentoo Forums Forum Index

Portage & Programming

View previous topic :: View next topic

Author

Message

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Thu May 15, 2003 11:31 pm Post subject: Re: Deltup innards

Quote:

the portage setup would require most likely an additional field, diff_uri or something.

What would that be useful for? Every patch will be mirrored on the Gentoo repositories, so why not just grab it using the package name/version? the SRC_URI field shows the original package location, but where would a diff_uri point to? Making changes in the ebuild would be a very annoying task.

Am I missing something here?

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Fri May 16, 2003 3:37 pm Post subject: OO deltup

ferringb, Hmmm.... I finished downloading OO 1.0.2 and tried to update it, but the resulting tarball has the wrong md5sum and size. What's the md5 of your OO 1.0.3 tarball? Mine is: 0733dd85ed44d88d1eabed704d579721

ferringb
Retired Dev

Joined: 03 Apr 2003
Posts: 357

Posted: Fri May 16, 2003 7:11 pm Post subject: Re: OO deltup

jjw wrote:

ferringb,
Hmmm.... I finished downloading OO 1.0.2 and tried to update it, but the resulting tarball has the wrong md5sum and size. What's the md5 of your OO 1.0.3 tarball? Mine is: 0733dd85ed44d88d1eabed704d579721

OOo_1.0.2_source.tar.bz2
8a82b4dbdd4e305b6f6db70ea65dce8c
OOo_1.0.3_source.tar.bz2
984146931906a7d53300b29f58f6a899
unles you're after the plain tarball-
OOo_1.0.3_source.tar
39a4aed9e60c509948ed147227493559
The md5sum on the dtu is ed9a50538c535bcbd52e787db02fa2e

Both of which I've verified on my home machine and a gentoo box at work.
A quick cursory glance to the digest files shows that the files I used were md5 correct- one thing to note, I created it via deltup .27 . Only thing that comes to mind unless one of your files md5sum's don't match.

ferringb
Retired Dev

Joined: 03 Apr 2003
Posts: 357

Posted: Fri May 16, 2003 7:41 pm Post subject: Re: Deltup innards

jjw wrote:

Quote:

the portage setup would require most likely an additional field, diff_uri or something.

You could get away with having the uri for the diff in the src_uri, but then you'd have to rewrite the src_uri parser/fetcher so that it was able to screen out just the diff (if it's after only that) or grab the full source (if the user chooses that or lacks a source to work from).
Further, with both my method and yours, stating for oo-1.0.3 grab diff oo-1.0.2-1.0.3 has issues- what if the source they have is only 00-1.0.1? The fetcher/src_uri parser would have to be smart enough to back track and find either A) oo-1.0.1-1.0.3 or more likely B) parse previous ebuilds, find the diff-uri for oo-1.0.1-1.0.2 then grab both 1.0.1-1.0.2 and 1.0.2-1.0.3 .
The fallacy I find w/ A is that it expects the dev's to create multiple diffs, rather then a single diff against the previous source version.
I doubt (imho) that any dev is going to be willing (especially w/ larger sources) to create/upload any patch/diff that isn't against just the previous version- yes it'd be more efficient having a diff for 1.0.1 to 1.0.3, but that's also a complete second run of the diffing program, plus another semi-hefty upload (they'd need to gen 1.0.2 to 1.0.3 also).

Courses alternatively, and I think this might be what you were hinting at, make the fetcher able to intelligently check for an appropriate diff- I have source 1.0.1, lets see how to get to 1.0.3 . If, as I posit, the dev's will end up creating just a diff against the previous version, that's extra traffic against the servers that could be eliminated by an optional/extra field in the ebuild.
There also is the issue if the dev has created the diffs, but they haven't yet made their way out to the mirrors- an intelligent downloader would fail there cause whatever mirror it checked wouldn't have the appropriate diff, hence the fetcher would just grab full source (despite the diff's existing on the main server).
You could handle this by having the fetcher work against the main server for figuring out what diffs are available, then grabbing from mirror if it can... I doubt the gentoo heads/leaders would like this (defeats the purpose of the round robin mirrors, main server gets hammered). Of course this ain't an issue if once a file touches the main server it's immediately rsynced out though I'd think they do it in intervals personally.
Added, if you do an intelligent downloader there is also the problem of getting the md5 checksum- you could have it download it from the mirror, but that has issues. Via the normal emerge sync you get the md5sum's of the files, so that you can download the source from anywhere, and as long as the checksum matches you can use it. If you're pulling the md5sum down at download time, you would also need to do an md5 checksum (based off gentoo's md5sums) on the final product to verify that no intrepid lil punks tried to pass of a trojanned patch. Kind of moot, although a user would be ticked if the patch download was wasted bandwidth due to md5 checksum not matching...
Course that's being a very, very paranoid person... probably a moot issue in reality though.

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Fri May 16, 2003 9:20 pm Post subject: Re: Deltup innards

Quote:

What's wrong with that? It could be implemented with a few lines of code.

Quote:

Further, with both my method and yours, stating for oo-1.0.3 grab diff oo-1.0.2-1.0.3 has issues- what if the source they have is only 00-1.0.1? The fetcher/src_uri parser would have to be smart enough to back track and find either A) oo-1.0.1-1.0.3 or more likely B) parse previous ebuilds, find the diff-uri for oo-1.0.1-1.0.2 then grab both 1.0.1-1.0.2 and 1.0.2-1.0.3 .

I definitely prefer B, or both methods combined. A patch spanning several versions would be much larger anyway.

Quote:

Courses alternatively, and I think this might be what you were hinting at, make the fetcher able to intelligently check for an appropriate diff- I have source 1.0.1, lets see how to get to 1.0.3 . If, as I posit, the dev's will end up creating just a diff against the previous version, that's extra traffic against the servers that could be eliminated by an optional/extra field in the ebuild.

Why not start out with a more limited approach and just support single patches? When multiple patches are implemented, portage can use ebuild information from the portage tree to determine what sequence of patches to install, querying the server isn't necessary.

Quote:

There also is the issue if the dev has created the diffs, but they haven't yet made their way out to the mirrors- an intelligent downloader would fail there cause whatever mirror it checked wouldn't have the appropriate diff, hence the fetcher would just grab full source (despite the diff's existing on the main server).

That's fine because I'm sure the gentoo devs wouldn't want their main server to be used and there is no other way to grab the patches. The load would be put on the original websites, not on the Gentoo mirrors, because the patches and the packages should hit the mirrors simultaneously. That's what happens in the current implementation before a package hits the mirrors.

Quote:

You could handle this by having the fetcher work against the main server for figuring out what diffs are available, then grabbing from mirror if it can...

Couldn't we assume all diffs are available and then fall back if they can't be found?

Quote:

I doubt the gentoo heads/leaders would like this (defeats the purpose of the round robin mirrors, main server gets hammered). Of course this ain't an issue if once a file touches the main server it's immediately rsynced out though I'd think they do it in intervals personally.

What information do we need from the main server that we don't have after doing an "emerge sync"? For that matter, what would diff_uri tell us that we can't figure out (assuming we name patches according to the package filename)?

Quote:

Added, if you do an intelligent downloader there is also the problem of getting the md5 checksum- you could have it download it from the mirror, but that has issues. Via the normal emerge sync you get the md5sum's of the files, so that you can download the source from anywhere, and as long as the checksum matches you can use it. If you're pulling the md5sum down at download time, you would also need to do an md5 checksum (based off gentoo's md5sums) on the final product to verify that no intrepid lil punks tried to pass of a trojanned patch. Kind of moot, although a user would be ticked if the patch download was wasted bandwidth due to md5 checksum not matching...

Since the patches will be generated automatically, we won't have people submitting patches, so the trojanned patch would have to come from a developer. You have to download the patch to check the md5sum, so why not apply it and check the resulting package's md5sum? That's the important thing. In fact, it would even be OK if the developers uploaded new, more efficient patches over the old ones (except some people would be in the middle of downloading them!)

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Sat May 17, 2003 6:43 am Post subject: bzip2 versions

Quote:

OK, I finally figured it out. Bzip2 1.0.2 doesn't always produce output consistant with version 0.9.0b. It must be a rare incident though, because the compressed data is identical up to byte 45279200 for OO 1.0.3 and byte 45133774 for OO 1.0.2. After these points the files are completely different. The OO tarballs are compressed with the old version. I downgraded bzip2 and was able to obtain the correct tarball.
This is going to take some careful consideration.

ferringb
Retired Dev

Joined: 03 Apr 2003
Posts: 357

Posted: Sun May 18, 2003 9:33 am Post subject: Re: Deltup innards

jjw wrote:

What's wrong with [having diff's uri in src_uri]? It could be implemented with a few lines of code.

Well, to me it doesn't seem like a clean solution. Yes it could be implemented in a few lines of code- but that would require modifying emerge such that it basically does conditional processing on the src_uri string. Take a look in blackdown_jdk's ebuild for an example of dealing with a variable src_uri (in the distfile-diff case it would be full source uri vs diff uri). Yes emerge could be modified to do the src_uri processing, but why do it this way? The full source uri is a different beast then the diff uri, and the ebuild variables (IMO) ought to reflect this.

jjw wrote:

Single patch is the intention. To be honest I can't see multiple patch support being useful/easy to implement for the gentoo setup- while rolling kde and xfree diff into one has the benefit of one file, having them as seperate files incurs basically extremely minimal cost, both fs and network traffic. Ultimately you still have to download the same data, the only difference being the extra overhead involved in initing another download.
My own opinion mind you. I do think the multiple could have uses- thinking pseudo release upgrade type thing, at least from the source standpoint.

jjw wrote:

Couldn't we assume all diffs are available and then fall back if they can't be found?

I suppose, although I personally would take the opposite of it and assume no diff's available (it handles dev's not creating diff's a bit nicer), and via whatever method find out if diff's are available. This is part of the reason I say diff_uri is a needed variable- if it's absent (either through dev's mistake or just that there isn't a diff), it's a simpler solution from the standpoint of emerge, fallback to the normal method. That and it provides a method of concretely knowing what diff's should be available.

jjw wrote:

As mentioned above, via diff_uri you can concretely know if there even exists a diff for the version.

jjw wrote:

Heh, like I said, I'm paranoid. If either from a users error, or a lovely tweaking of dns/routing somebody directed a users request for a diff (and the md5) to their specific server this is possible. Extremely unlikely, and paranoid, yes. Basically figure it thus, the portage system includes the md5sum of a source for a reason- as I said, A) it gives you the ability to pull the source from anywhere as long as the md5sum matches, B) it serves as a sort of quality control- the source *must* match the md5sum for emerge to work. While you can get emerge to use a non-md5 correct source, it's something that takes effort, rather then being automatic- so users attempting a possibly dumb thing must do it themselves (think of it as emerge being unwilling to help the user slit their own throat...).
In terms of a dev uploading a new/more efficient patch, of my understanding of portage that appears to be at best what the various release versions would be equivalent to. I'd be inclined to say get the diff right prior to uploading everything though, mainly since you don't see dev's uploading a new source for their package (along w/ an updated digest file to the rsync servers) if they find a smaller version. What's released, is just that- things only get pulled/updated when there are problems/specific reasons it must be done.
In terms of a more efficient patch, I'd wonder how likely that scenario is- yes the underlying alg may be improved to produce a smaller delta, but it's unlikely (in my books) that a dev would even consider/take the time to upload a more efficient patch. An automated setup could handle that mundane task, but I doubt this will end up an automated setup...

Curious, don't suppose you know of any semi-standard lib's that provide an md5 function? I'd toyed w/ using popen to use md5sum, but it seems like a waste considering I'm already basically going through every byte of the file. Currently, until either A) I roll something myself or B) I find a lib that has it, I'm just dumping the data out through a pipe into md5sum and piping the md5 back in. Not an elegant solution... course I'm doing something similar for the diffing till I write it to use the lib interfaces...

Also, have you looked into adding a magic signature to the file's db/list for deltup files? Would be worthwhile for helping other apps be able to identify/deal with deltup data.

ferringb
Retired Dev

Joined: 03 Apr 2003
Posts: 357

Posted: Sun May 18, 2003 9:35 am Post subject: Re: bzip2 versions

jjw wrote:

I'm not surprised- as I recall from using a perl module that was basically a wrapper around the bzip2 library, the .9x line had some annoying changes to the api (aside from function renaming, things behaved slightly different).
It'd be worthwhile to check the bzip2 lib's and see if their is a function for producing the older bzip2 format... something I find highly unlikely.
Were the old and new bzipped files the same size?

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Sun May 18, 2003 3:52 pm Post subject: Re: Deltup innards

I'll be out of town for the day, and I'm in a big hurry. Pardon any errors.

Quote:

I'm not suggesting we use the src_uri field. We could use internal Portage variables already extracted from the ebuild to determine the filename/version.

Quote:

There is a misunderstanding here. By "multiple patch" I'm talking about patching a source several times to obtain the latest version. By "single patch" I mean the lack of this feature (look at what I'm responding to).

Quote:

But the diffs can be built automatically. They should almost always be present.

Quote:

Curious, don't suppose you know of any semi-standard lib's that provide an md5 function? I'd toyed w/ using popen to use md5sum, but it seems like a waste considering I'm already basically going through every byte of the file. Currently, until either A) I roll something myself or B) I find a lib that has it, I'm just dumping the data out through a pipe into md5sum and piping the md5 back in. Not an elegant solution... course I'm doing something similar for the diffing till I write it to use the lib interfaces...

I haven't tried it, but take a look at "openssl/md5.h". I get the man page by typing "man md5"

Quote:

Also, have you looked into adding a magic signature to the file's db/list for deltup files? Would be worthwhile for helping other apps be able to identify/deal with deltup data.

The magic signature has been "DTU" as long as the format has existed. I hope it doesn't conflict with some other format...

Quote:

No they aren't the same size. I've made an ebuild for bzip2-0.9.0c which installs the binary as bzip2_old. That's the only way I know of solving the problem...

I'll respond to the md5 thing later...

BradB
Apprentice

Joined: 18 Jun 2002
Posts: 190
Location: Christchurch NZ

Posted: Mon May 19, 2003 4:57 am Post subject:

Just wanted to offer some encouragement. This is a great idea & would go a long way toward making dial-up less of a hassle. It also has the nice effect of reducing bandwidth. Keep up the good work guys, I will be trying this out when I have a spare hour or three Brad

ferringb
Retired Dev

Joined: 03 Apr 2003
Posts: 357

Posted: Mon May 19, 2003 7:40 am Post subject: Re: Deltup innards

jjw wrote:

I'm not suggesting we use the src_uri field. We could use internal Portage variables already extracted from the ebuild to determine the filename/version.

True- although that would take the optimistic view that there is a diff. Also, say a dev creating an ebuild for package foo v1.5 decides to be nice, and besides creating the diff for v1.4 to v1.5, creates a v1.0 to v1.5 for those lagging behind. Aside from poking the server and seeing what diff's exist, there wouldn't be anyway to handle using the v1.0 to v1.5 (if applicable of course). Course I may be missing something, I've been staring at a terminal for the last 5 hours and the faculties are starting to degrade a bit...

jjw wrote:

Doh... heh, pardon. Actually, I'd think once the emerge code to handle dealing w/ patching the source is in place, multiple should be cake to add. Famous last words of course...

jjw wrote:

I haven't tried it, but take a look at "openssl/md5.h". I get the man page by typing "man md5"

Heh, nothing like being caught with your pants down. I'd read openssl had an md5 function, just never checked it...
What I ended up doing, which shifts the dependency, is just create some pipes, fork, make stdin and stdout use the pipes, then exec md5sum. Downside is now it's dependent on md5sum, but neh, it was a quick kludge and it works. I'll switch over to openssl probably about the time I add in automatic gzip/bzip2 compressed file handling.

jjw wrote:

The magic signature has been "DTU" as long as the format has existed. I hope it doesn't conflict with some other format...

Actually I was talking about adding the magic signature for your deltup gen'ed files to the command 'file' db of magic sigs. On gentoo, the magic sig listings are in /usr/share/misc/file/magic (it should normally be /usr/share/magic on most linux machines).

jjw wrote:

No they aren't the same size. I've made an ebuild for bzip2-0.9.0c which installs the binary as bzip2_old. That's the only way I know of solving the problem...

Lovely... thanks for the info, looks like when I get to handling compressed files I'll have to add a check for a file having old bzip2 format on a system that uses the post .9x bzip2 libs... that sounds like an annoying hornets nest.

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Mon May 19, 2003 7:46 am Post subject:

BradB wrote:

Just wanted to offer some encouragement. This is a great idea & would go a long way toward making dial-up less of a hassle. It also has the nice effect of reducing bandwidth. Keep up the good work guys, I will be trying this out when I have a spare hour or three

Brad

Thanks for the encouragment Brad. It's great to know that people are interested.
I want to say a big Thank You to someone for including a note about this project in the GWN :wink:

!
---JJW

Foolhardy
n00b
n00b

Joined: 12 Apr 2003
Posts: 28

Posted: Mon May 19, 2003 10:29 pm Post subject:

Ok, I know nothing about coding or development so forgive my ignorance. However, I do understand the value of the ideas presented, and I must say...pretty damn cool! Makes me wonder why this wasn't included in Portage in the first place More power to ya! _________________ If you were in my position, you would have wrote the exact same thing.

STEDevil
Apprentice

Joined: 24 Apr 2003
Posts: 156

Posted: Mon May 19, 2003 11:44 pm Post subject:

As another nondeveloper (read lamer ) all I can do is to sit around drooling with the rest of peanut gallery and chime in with a big WOW. This is going to be a really greate addition to Gentoo (and certainly should be appart of portage in the future) and really gives a huge edge vs binary distributions. The potential BW savings for mirrors and end users alike is just mindboggeling. BTW, I'd like to add, it's really nice if you can implement this without messing up the md5sum of the files. But if this would show itself to be impossible down the line, don't give up, becuse the advantages of a patch-source" system like this is great enough that a system that (serverside) automatically produces md5sums for a patched file could be made instead. For the vast mayority of clients this alternate md5 will be good enough (and for the few that are paranoid enough there is always "get full original unpatched source").

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Tue May 20, 2003 1:50 am Post subject: Simple install instructions

Thanks for your enthusiasm about this project. Don't let all this talk about the md5sum bother you. The only package deltup currently doesn't work with is openoffice!
You can make some use of this program right now... I routinely upload patches for the largest packages onto deltup's sourceforge site. Most people will be able to install it with these three commands (when logged in as root or portage):

Code:

wget http://osdn.dl.sourceforge.net/sourceforge/deltup/ebuild.tar
tar -xvf ebuild.tar -C /usr/portage
emerge deltup

Then simply download the patches you want, and type:

Code:

edelta -p <packagename>

I hope y'all can make use of it even in this early stage of development!
There were a LOT of packages updated recently, including kde-3.1.2 and gcc-3.2.3. I'll be releasing patches for them as soon as I can download everything.
---JJW

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Tue May 20, 2003 3:22 am Post subject: Different stuff

Quote:

Oh. Sorry, I thought you might have meant something else, but I'm not yet familiar with these utilities. I guess adding the dtu format would be up to Carsten Klapp. This might be useful, especially if the DE developers want to enable "click-n-update"!

About patch md5sum:
Scenario "with patch md5sum checking":

Scenario "without patch md5sum checking":

Why is the first scenario better or safer than the second one? I'll answer my own question and say that maybe the patch contains other updates. If portage verifies that only the requested package is updated it should be quite secure.

I have decided to phase out the -c and -n options and all the combined patches stuff, and instead build in support for tarballed patches. This way it will be much simpler to add and remove patches from a patch set.

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Tue May 20, 2003 8:18 pm Post subject: kde-3.1.2 and gcc 3.2.3

If you haven't yet upgraded to kde-3.1.2 or gcc 3.2.3 you may want to take a look at this post: https://forums.gentoo.org/viewtopic.php?p=331221

Death Valley Pete
n00b
n00b

Joined: 25 Mar 2003
Posts: 49
Location: The Inland Empire

Posted: Tue May 20, 2003 9:21 pm Post subject:

Is there any way for users "on the street" to submit patches? This seems like an incredibly cool project, especially for those of us on dialup. I don't know enough about anything to help on the programming end, but I'd be willing to help by submitting patches. Maybe there could even be a sticky forum with requests for patches and a grand unified site where this stuff could be found? If there's nothing else I can contribute, let me add my voice to the crowd of people saying "good work!" _________________ <instert pithy statement here>

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Tue May 20, 2003 10:05 pm Post subject: Contributing

Death Valley Pete wrote:

Is there any way for users "on the street" to submit patches?

Unfortunately SourceForge only provides for 100MB space and a difficult upload mechanism. :cry:

It's only useful for the deltup source and a few important packages. Maybe someone will donate some webspace.

Quote:

This seems like an incredibly cool project, especially for those of us on dialup. I don't know enough about anything to help on the programming end, but I'd be willing to help by submitting patches. Maybe there could even be a sticky forum with requests for patches and a grand unified site where this stuff could be found?

That might be a good idea, especially for non-gentoo users. This project will benefit everyone automatically when/if Portage starts using it.

Quote:

If there's nothing else I can contribute, let me add my voice to the crowd of people saying "good work!"

The best contribution you can give is your interest. Thanks a lot.

Death Valley Pete
n00b
n00b

Joined: 25 Mar 2003
Posts: 49
Location: The Inland Empire

Posted: Wed May 21, 2003 12:27 am Post subject:

I don't know much about anything, but I wonder if the folks a http://sunsite.dk/ would be willing to give us a hand. They're hosting some pretty huge projects, so they might be willing to host our .dtus. I can't take the initiative on this one (I'm a super-newbie and don't have much time), but I thought I'd suggest it for what it might be worth. Also, if this ever gets integrated into Portage, it shouldn't be too hard to get the regular mirror to host it, should it? _________________ <instert pithy statement here>

ferringb
Retired Dev

Joined: 03 Apr 2003
Posts: 357

Posted: Wed May 21, 2003 2:17 pm Post subject: Re: Different stuff

jjw wrote:

Heh, valid point that ultimately if you're doing an md5sum check of the resulting tarball it's a moot issue, although personally I'd rather verify the patch for speed considerations (smaller dataset). If one were doing a multiple-patch in one file setup, an md5sum pulled from the server is viable, although I still posit that little is gained by rolling multiple patches into one file. Moot issue none the less, it's something you want/intend.

jjw wrote:

I have decided to phase out the -c and -n options and all the combined patches stuff, and

I must admit I'm not entirely familiar w/ deltup's options (I don't think there is a help option yet, or at least last I checked). -c=?, -n=?

jjw wrote:

...instead build in support for tarballed patches. This way it will be much simpler to add and remove patches from a patch set.

What's this equate to? In other words intendant on doing intra-tarball patches? I guess I don't follow exactly what you mean.

A couple of things from the last few days of dinking around w/ my lil project, and reading up on the xdelta and rsync alg (along w/ a few papers by randal burns on delta's- those are a bit nastier), and a couple of concerns come up. Not trying to be combative, but I (obviously) have concerns w/ a purely-xdelta based setup.

First, and possibly kicking the dead horse again dependant on the answer to the paragraph above, xdelta like most delta alg's sucks with structued data. While a tarball is loosely structured, it's still a structured dataset.
This is important in a few ways- first off, speed and memory. Xdelta builds a chksum representation of the entire file, then proceed's through the file doing lookup's in a hash table for a matching chksum. A) The author may've corrected this, but the original xdelta alg trashed any false chksum matches which results in a larger delta. B) Continuing w/ speed and memory, a tarball is a congregation of datasets, each onto their own. In other words, including the tarball header, say file x occupies bytes 512-1536 of a 262144 (512^2): xdelta runs through the entire range, rather then doing the 1024 block that that files occupies. Yes xdelta may be able (theoretically) to produce a smaller delta when considering it as one giant blob, but it's unlikely, further it's much more hardware intensive to consider it all as one file when in reality it is multiple files. Breaking it down into individual files likely increases the speed (possibly less collisions to deal with), but is guranteed to improve memory usage. Try running xdelta against uncompressed openoffice tarballs (each around 560mb), it thrashes the heck out of my system which has 512mb of ram and a gig of swap.

Second, and this I realize isn't likely to happen too often, xdelta is a one way modification. Eg. it's non-reversable, you need a seperate delta to go from 1.0.3 to 1.0.2. Personal opinion, but I'd tend to think any diff setup for portage would need at least the ability to regress a version via a diff.
Downside is that this obviously steps the size of the delta up, although I have a few ideas how to cut down on that- most delta alg's for a replacment just have the new text... compute the offset between the original byte and the new byte, and distribute that offset instead of the final version. At the very least, using that method data that is replaced/changed won't require the original and target data (2 * # byes), just the offset between the two (1x for replacements + new inserts). This can be extended further, although it's partially dependant on a control syntax designed around it. Regardless, the ability to go in reverse is something to be considered (preferably w/out requiring another delta comp. w/ the versions reversed).

Third, concatting version patches. Say being the nice developer I am, I decide to release alongside my v1.4 to v1.5 delta, a delta of v1.0 to v1.5 for those who are trailing. This would require two delta computations- 1.4 to 1.5, and 1.0 to 1.5 . While you could just concat all the delta's of the version hops between 1.0 and 1.5, that would result in a large file.
Have you checked into the ability to basically process and sum up multiple deltas between versions? I had a semi-functional attempt at this w/ delta's produced by diff, but I've yet to try it w/ xdelta. I don't know the format well enough, but I think it may be non-sequential in it's patch listings in the file format- this would make summing trickier.

Fourth, somebody ought to pester the author of xdelta (Joshua MacDonald) and see if he's still active w/ this or not- going by his papers, this seems to be something on the side he released while working towards cvs type stuff (prcs for instance). Although, the last release of prcs was in april last year (open bugs too). Just curious if there is any active development on it or not, v2.0 has been beta for something like 3 years now.

Either way, continuing the giant post (and rampant miss-spellings), I realize those listed problems aren't mainstream issues, but they are ancillary and should be considered (true, 3 is somewhat wishful thinking). Then there comes the lovely part about portage integration- have you started coding anything up for this, or thought out exactly the steps (eg pseudocode) for doing this?
I've been working on diffball (couldn't think of a better name for the project) pretty much the whole time, and haven't yet started modifying portage/emerge to handle version patching.

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Wed May 21, 2003 4:02 pm Post subject: Re: Different stuff

Quote:

The resulting tarball is checked by portage when you emerge it. I thought this was sufficient.

Quote:

If one were doing a multiple-patch in one file setup, an md5sum pulled from the server is viable, although I still posit that little is gained by rolling multiple patches into one file. Moot issue none the less, it's something you want/intend.

I never intended to have multiple patches in one file. In this post I was talking about a trojanned patch which contained other updates. The multiple patch thing is only useful if you're using patches manually - just one file to download (see how convenient my kde patches are).

Quote:

I must admit I'm not entirely familiar w/ deltup's options (I don't think there is a help option yet, or at least last I checked). -c=?, -n=?

you can get help with deltup CL options by using the command with invalid options (or without any options). -c is used to combine patches and -n is used to select a range of patches to apply. There's also a man page and README which get installed if you use the ebuild.

Quote:

What's this equate to? In other words intendant on doing intra-tarball patches? I guess I don't follow exactly what you mean.

I mean to be able to tarball patches together so that they can be downloaded and applied with a single command (same as if they were -c(ombined) together).

Quote:

First, and possibly kicking the dead horse again dependant on the answer to the paragraph above, xdelta like most delta alg's sucks with structued data. While a tarball is loosely structured, it's still a structured dataset.
This is important in a few ways- first off, speed and memory. Xdelta builds a chksum representation of the entire file, then proceed's through the file doing lookup's in a hash table for a matching chksum. A) The author may've corrected this, but the original xdelta alg trashed any false chksum matches which results in a larger delta. B) Continuing w/ speed and memory, a tarball is a congregation of datasets, each onto their own. In other words, including the tarball header, say file x occupies bytes 512-1536 of a 262144 (512^2): xdelta runs through the entire range, rather then doing the 1024 block that that files occupies. Yes xdelta may be able (theoretically) to produce a smaller delta when considering it as one giant blob, but it's unlikely, further it's much more hardware intensive to consider it all as one file when in reality it is multiple files. Breaking it down into individual files likely increases the speed (possibly less collisions to deal with), but is guranteed to improve memory usage. Try running xdelta against uncompressed openoffice tarballs (each around 560mb), it thrashes the heck out of my system which has 512mb of ram and a gig of swap.

I think you have a point there (although my system didn't thrash with 768MB and no swap). However, it would create inefficient deltas in many cases because files often get moved around (renamed, put into different directories) in tarballs. Secondly, sometimes large blocks of code are cut from one source file because they belong somewhere else. I need to look into it though...
I think it should either be done one way or the other (a "fast" CL option would be OK too...). I think people would prefer smaller patches to faster processing (patch only needs to be made once, but can be applied many times), especially since delta algorithms can do such an efficient job (after all, OO is one out of thousands).
As for structured data and tarballs, I think it would be a good idea to extract the tar header, perhaps make modifications based on it's structure, and place them all in a big chunk for delta and bzip2 to chew on!

Quote:

Second, and this I realize isn't likely to happen too often, xdelta is a one way modification. Eg. it's non-reversable, you need a seperate delta to go from 1.0.3 to 1.0.2. Personal opinion, but I'd tend to think any diff setup for portage would need at least the ability to regress a version via a diff.
Downside is that this obviously steps the size of the delta up, although I have a few ideas how to cut down on that- most delta alg's for a replacment just have the new text... compute the offset between the original byte and the new byte, and distribute that offset instead of the final version. At the very least, using that method data that is replaced/changed won't require the original and target data (2 * # byes), just the offset between the two (1x for replacements + new inserts). This can be extended further, although it's partially dependant on a control syntax designed around it. Regardless, the ability to go in reverse is something to be considered (preferably w/out requiring another delta comp. w/ the versions reversed).

I think that's an excellent idea, it's one of the reasons I included the "combine patches" option. It certainly would be worthwhile to implement this. Your idea for calculating the offset is excellent too.

Quote:

Third, concatting version patches. Say being the nice developer I am, I decide to release alongside my v1.4 to v1.5 delta, a delta of v1.0 to v1.5 for those who are trailing. This would require two delta computations- 1.4 to 1.5, and 1.0 to 1.5 . While you could just concat all the delta's of the version hops between 1.0 and 1.5, that would result in a large file.
Have you checked into the ability to basically process and sum up multiple deltas between versions? I had a semi-functional attempt at this w/ delta's produced by diff, but I've yet to try it w/ xdelta. I don't know the format well enough, but I think it may be non-sequential in it's patch listings in the file format- this would make summing trickier.

That would be a useful option too, it doesn't look like a priority though...

Quote:

Fourth, somebody ought to pester the author of xdelta (Joshua MacDonald) and see if he's still active w/ this or not- going by his papers, this seems to be something on the side he released while working towards cvs type stuff (prcs for instance). Although, the last release of prcs was in april last year (open bugs too). Just curious if there is any active development on it or not, v2.0 has been beta for something like 3 years now.

I still think the ultimate solution would be to make a customized delta application with a simple control structure and support some of the features you mentioned above (2 and 3). Besides supporting the new features, it could be more efficient and it we could integrate it as a lib instead of using system calls. Xdelta works OK for the time being...

Quote:

Either way, continuing the giant post (and rampant miss-spellings), I realize those listed problems aren't mainstream issues, but they are ancillary and should be considered (true, 3 is somewhat wishful thinking). Then there comes the lovely part about portage integration- have you started coding anything up for this, or thought out exactly the steps (eg pseudocode) for doing this?
I've been working on diffball (couldn't think of a better name for the project) pretty much the whole time, and haven't yet started modifying portage/emerge to handle version patching.

I must confess that I've never written in Python before (I must learn because it looks like a great scripting language). It looks like the code could be placed in the "fetch" function from "portage.py"

ferringb
Retired Dev

Joined: 03 Apr 2003
Posts: 357

Posted: Wed May 21, 2003 5:18 pm Post subject: Re: Different stuff

jjw wrote:

True, although that's only applicable if one is attempting to match filename to filename. None the less, the comment about data moving from file to file is a valid one- I have a few ideas, but nothing I'd label as a 'good' solution. I'm currently attempting to do checks against deletions/additions to the target by looking for moved files via chksum checks (bit more complex then that, but it'll suffice).

jjw wrote:

As for structured data and tarballs, I think it would be a good idea to extract the tar header, perhaps make modifications based on it's structure, and place them all in a big chunk for delta and bzip2 to chew on!

What you're talking about in the latter is basically splitting the tarball into 2 files, header and data. It's actually pretty easy to do, I could pretty easily copy and paste what I've written into deltup to do this- question being what format, and how to store it. I'd think keeping it in one file would be wise.
Note, the splitting is easy. Reconstructing would be a bit trickier, but w/ the structs/code I've wrote so far, doable. I could either dump the relevant code for you, or just mention what you want and I'll split a diff of it off when I get time (w/in a few days I'd think). If you're curious about doing it yourself, take a look in /usr/include/tar.h and the perl module Tar . Aside from searching, and in general screwing around w/ tarballs thats how I've written the code that manipulates tarballs.
One thing I would be curious about, is the fact that by isolating the tarball headers into one file you're making it into a truly structured file (512 byte records)- I'd wonder how well behaved xdelta would be with that, and dealing w/ small headers (debianutils for instance is a tarball of roughly 30 files).

jjw wrote:

That would be a useful option too, it doesn't look like a priority though...

True, although at least with what I've got so far, it's not to hard to add it. Since everything is broken down into files, one can isolate the specific patches for each version per file, and sum them that way. I haven't yet tried it under the new setup, but it should be doable.

jjw wrote:

That's sort of the direction I'm heading- while I'm intendant on breaking the tarball into individual datasets (and breaking archives w/in down too), I've basically been working towards moving away from reliance on diff and xdelta to a different format/alg.

jjw wrote:

Well, given, but I'm curious about A) proposed name format/version, B) proposed method/process of querying some server for available patches.
Specifically with B, I'd be curious how you intend on dealing w/ the fact that in the beginning, there aren't going to be delta's for the majority of the packages.

jjw
n00b
n00b

Joined: 20 Mar 2003
Posts: 59
Location: Austin, TX, US

Posted: Wed May 21, 2003 7:32 pm Post subject: Re: Different stuff

Quote:

I've already written the code to extract the headers (but it's going to the back-burner until I finish some more important stuff). Reconstructing won't be hard - and it won't require any control structs. All I have to do is put the first header in the front, the second one 1024+filesize-(filesize mod 512), and so on...

Quote:

One thing I would be curious about, is the fact that by isolating the tarball headers into one file you're making it into a truly structured file (512 byte records)- I'd wonder how well behaved xdelta would be with that, and dealing w/ small headers (debianutils for instance is a tarball of roughly 30 files).

I don't think the headers have to be structured as you define it. The header delta could be made very efficiently if it were restructured so that related fields were adjacent. Number of headers doesn't matter because there's no control struct overhead and headers and data can be concatenated before using xdelta. But I might not concatenate them if a delta algorithm with less overhead could be made...

Quote:

I'm going to write a parser that will recursively descend into tarballs/compressed data to allow the delta algorithm to match more closely, but the individual dataset thing would be a lot of work...

Quote:

Nothing is wrong with your proposition, it's pretty obvious actually: <packagename>_<oldversion>-<newversion>.dtu.
As for B, portage could start out by using patches only when specified, and when they became ubiquitous it could use them by default. I don't see why we'd want a special method of querying the server because that's basically what we're doing when we attempt to download a package! What's wrong with good old ftp/http?

ferringb
Retired Dev

Joined: 03 Apr 2003
Posts: 357

Posted: Wed May 21, 2003 9:10 pm Post subject: Re: Different stuff

jjw wrote:

If you're concatting the headers together into one file, and the data into a seperate, the headers file would be structured as such- creating a more efficient delta for it shouldn't be too hard, I've basically implemented a 2 byte indicator of modified fields, then null delimited strings/nums of the new entries. From their my attempt gets a bit different, but you get the idea for doing tar header updates.
One could probably try and do a copy/insert setup on the header changes, but that seems like overkill for the most part- the data changes are too small.
As for attempting to align related fields so they're adjacent (first section is all names, second is linkname... etc), I'd wonder what you'd hope to accomplish via that. Compression might be better, but I'd be curious how xdelta would be more behaved- it only includes changes, regardless of where they are.

jjw wrote:

I'm going to write a parser that will recursively descend into tarballs/compressed data to allow the delta algorithm to match more closely, but the individual dataset thing would be a lot of work...

Assuming I'm reading it correctly, from the sounds of it, you're proposing the ability to identify w/in a stream an archive/compressed section? You could attempt to identify it via the appropriate magic/id, but that would require the ability to identify how long the data is, which would be a pain unless you were reading the length from the tarball/archive header. Unless you're planning on attempting to figure out the specific compressed/archive file's length automatically, the alg would have to be aware of the tar headers...
Assuming you made the alg tar-header aware, you're basically half way to diffing per file anyways- the alternative being writing a diff generator that can A) identify in a stream a compressed/archive file, B) go recursive into that specific section, C) include the patch from the recursion fun into the total patch, D) skip checksumming on that section. While it's doable, I'd think it's not the cleanest solution. Your intention for how to do this?

In terms of an individual dataset thing being more work, I'd disagree- it's the equivalent of doing multiple diff's rather then one overarching one. As long as the delta compression code is setup right/cleanly, it's really not that much different then doing the whole file.
There still is the issue of data that jumps from file to file, but I'd think there is an efficient solution for that (possibly a control kludge).

Display posts from previous:

	Gentoo Forums Forum Index Portage & Programming	All times are GMT Goto page Previous 1, 2, 3, 4, 5 Next
Page 2 of 5

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Copyright 2001-2024 Gentoo Foundation, Inc. Designed by Kyle Manna © 2003; Style derived from original subSilver theme. | Hosting by Gossamer Threads Inc. © | Powered by phpBB 2.0.23-gentoo-p11 © 2001, 2002 phpBB Group
Privacy Policy