Emerge - the future?

Message

Tronic · Post by **Tronic** » Thu Aug 14, 2003 1:30 pm

Not going to start a flamewar, nor propose that all Gentoo tools must be rewritten right now. But really, sooner or later it will have to rewritten. The purpose of this thread is to discuss what features it could have and how it could be implemented. Please don't post "funny" post about coffee maker and kicthen sink integration, thanks.

I'm sorry if this thread is a dupe - didn't find one already existing with a quick look, so posted a new one.

So, here's some ideas I have:

Design / code:

-Written in C++.

-Using internal text parsing, etc. (not calling awk or other external tools, because that is just slow).

-Threaded - after figuring out the dependancies, it could probably begin downloading all the packages (there could be a limit for simultaneus downloads, of course) and after some downloads are finished, the building of those packages can begin (assuming of course that the deps required for building are met). This gives really much improved performance, because network I/O and CPU-intensive compiling running side-by-side don't give any performance hit to each-other.

-Caching of databases (so that it doesn't need to read thousands of files from the filesystem every time, but could just check the headers of all packages from a single file) (and better make the database binary, not XML: I can't understand why everyone wants to use XML for everything these days, even when it really doesn't fit).

-Probably no internal UI, but only an interface for making one (and then one for curses, one for gtk, etc)

User-oriented features:

-Instead of the regular 50 lines/second build dump, display good overall progress indicators (and optionally the detailed build process in some window). This is getting even more important with threads!

-When something is masked, some errors occur etc, automatically display the _entire_ dependacy tree leading to that unsolvable error. This makes solving and understanding the problem a lot easier.

-Some intelligent system for pretty much automatically sharing the downloaded packages (distfiles) in a LAN (especially useful in campuses or home networks with many Gentoo boxes behind a single network link). Should probably be P2P, and not require setting up a real server and configuring all boxes to use it (and this is big enough to be a separate software project).

So, go on, throw in your ideas and needs... I'd like to hear what others have had in mind on how to improve this.

Feel free to also tell where I made mistakes. I realize that some of the things in my list are very difficult to implement, but I also know that all are doable.

Tronic · Post by **Tronic** » Thu Aug 14, 2003 1:39 pm

One addition, for which I don't have a clue on how to actually implement:

-Intelligently figure out which mirror to use and download packages from mirrors from which they are available from (at the rate new packages are appearing, soon it will be impossible to keep all packages on all mirrors; I think we all also have noticed the problem of recent software versions not being mirrored yet, when we are trying to fetch).

Lovechild · Post by **Lovechild** » Thu Aug 14, 2003 1:55 pm

psstt...

www.zynot.org - the future of Portage

Senso · Post by **Senso** » Thu Aug 14, 2003 2:19 pm

Why do we see "Portage should be rewritten in C/C++" almost once a week? I'm a Python whore and so, I don't see the point in this statement.

Threading, caching and text parsing are all possible with Python. Sure, C/C++ is faster, but I really don't care about a speedup of 3% or 15%.
Python makes it easier for anyone to write ebuilds or fix a problem in an existing ebuild. Go try that with C. At least, C/C++ Portage could use some scripting language like Lua for the ebuilds to permit modularity but in this case, I don't see why we should drop Python.

I agree with a mid-to-long term rewrite, but I don't think it needs to be in another language.

Ox- · Post by **Ox-** » Thu Aug 14, 2003 2:49 pm

Lovechild wrote:psstt...

www.zynot.org - the future of Portage

http://www.joelonsoftware.com/articles/ ... 00069.html

Tronic wrote:Not going to start a flamewar, nor propose that all Gentoo tools must be rewritten right now. But really, sooner or later it will have to rewritten.

Don't let the above essay stop you... scratch whatever itch is bothering you. You have some good ideas, although I suspect it's a going to be a lot harder than you are expecting.

Anyway, the only feature I'll suggest is that this new tool should remain compatible with emerge as far as database formats, otherwise it'll never make it out of the gate unless you start your own new distribution as well

Ox- · Post by **Ox-** » Thu Aug 14, 2003 3:08 pm

Senso wrote:Threading, caching and text parsing are all possible with Python. Sure, C/C++ is faster, but I really don't care about a speedup of 3% or 15%.

To reinforce what you're saying, real speedups are going to come with improving the algorithms, not the language. If changing the algorithms to take advantage of threading will make a C++ version run faster, then it'll make a Python version run faster as well.

Also, I've only been using Gentoo a few months, but it looks to me like 99.9% of the time on an emerge is compiling with gcc (written in C) + rsync (yep, in C) + make (C) + wget (C). So, I'd be skeptical a C++ rewrite could even provide a 1% speedup of the overall emerge process.

far · Post by **far** » Thu Aug 14, 2003 3:10 pm

Tronic wrote:-Written in C++.

There is no point. The only thing about portage that is slow is searching for packages, but that has nothing to do with Python. A rewrite in C++ would merely introduce bugs and make the source code several times as big.

-Using internal text parsing, etc. (not calling awk or other external tools, because that is just slow).

Does Portage use awk now? This is also much easier to do in Python.

-Threaded

I know people are working on this, although I think they would rather use multiple processes.

-Caching of databases

I believe there is a portage implementation with a database back end. There is nothing wrong with using xml. Unlike binary data, it is readable (and debuggable) by human beings.

-Probably no internal UI, but only an interface for making one

There is a (low-level) Python api, but it is not well documented.

-Instead of the regular 50 lines/second build dump, display good overall progress indicators

And how would the program know how far the build process has come? Make does not provide such information.

All these things have been discussed many times in other threads.

aethyr · Post by **aethyr** » Thu Aug 14, 2003 3:12 pm

Senso wrote:Why do we see "Portage should be rewritten in C/C++" almost once a week?

For the same reason that we see people saying that getting rid of the client/server model of XFree will magically make their desktops faster, without any real understanding of the issues involved. Frankly, it wouldn't matter much (proportionally) if emerge was an order of magnitude slower. I'd say that when it comes to any decent sized package, maybe 5% of the total time is due to the emerge tool.

Say emerge takes 60 seconds to get something done (which is longer than it really does take). You spend another 60 seconds downloading the package, at 100k a second, that's a 6mb package, something the size of mozilla-firebird maybe. You then spend an hour compiling the package (I'm not sure how long it really takes, but we're working on easy units here).

You've just spent 3720 seconds, 1.61% of which was spent in the emerge tool.

Let's say you make emerge 100 times faster (extraordinarily unreasonable, since it's doing a lot of disk access). You now have spent 0.6 seconds in "emerge", for a total of 3660.6 seconds, 0.0016% of which spent in "emerge". However, you've only saved yourself 59.4 seconds, or 1.6% of the total 3720 seconds originally spent emerging the package.

That's for a 6MB package. Even for a 2MB package, you still only save yourself 4.5% of the time spent. And that's if you make emerge 100 times faster (which will never happen, since most of the time is probably spent reading files off the disk).

If you make it 10 times faster, you see those numbers drop to 1.45% saved, and 4.1% saved. And that's starting off with the assumption that "emerge" took a full minute to do its job (which it doesn't).

If you think that coding emerge in C/C++ will suddenly make things better, you're really looking at the wrong bottlenecks.

Senso · Post by **Senso** » Thu Aug 14, 2003 3:15 pm

Ox- wrote: Also, I've only been using Gentoo a few months, but it looks to me like 99.9% of the time on an emerge is compiling with gcc (written in C) + rsync (yep, in C) + make (C) + wget (C). So, I'd be skeptical a C++ rewrite could even provide a 1% speedup of the overall emerge process.

True, it's an even better argument than what I wrote earlier. The Python stuff in Portage is used to call apps written in C. Python calls wget and tells it to get the source from $WEBSITE, etc. Most of the computing comes from C binaries.
So, I think there are many different ways to improve the Python code. Full threading like Tronic explained would greatly help. But since threading is "optional" in Python, not everyone could use it. In any case, it's a good idea.

axxackall · Post by **axxackall** » Thu Aug 14, 2003 3:19 pm

Senso wrote:Why do we see "Portage should be rewritten in C/C++" almost once a week? I'm a Python whore and so, I don't see the point in this statement.

Perhaps we should begin every week with a new thread like "let's rewrite all scripts to Python!", just to compensate C/C++ zealots

Hmm, that could be interesting system, where ALL non-3rd-party (all that belongs to Gentoo itself) software is written on Python: Portage, initscripts, installation scripts, system (network, user, disk etc) management tools, various other tools and utilities ... Even UI for all of that must be written with Tkinter.

So, what are C/C++ zealots supposed to do? If they are really skilled in C/C++ then they should help vendors of gcc, mozilla etc. Otherwise they should learn Python

P.S. Forgot to mention: in future Gentoo should be no place for Perl, Ruby, Tcl, Java - anything that is not Python.

P.P.S. It was really a joke ... mostly

charlieg · Post by **charlieg** » Thu Aug 14, 2003 3:27 pm

The main new features of portage will be the use of a DB.

Usage of a basic DB (Berkerly, anybody? or Metakit?) to start with. Then things like dependencies (forward and reverse) can be established incredibly quickly.

The 'rewrite in C++' arguments are always naive. There is not real reason to do this.

Zynot is going nowhere fast.

Ox- · Post by **Ox-** » Thu Aug 14, 2003 3:35 pm

axxackall wrote:Perhaps we should begin every week with a new thread like "let's rewrite all scripts to Python!", just to compensate C/C++ zealots

I think we should change portage to use SCons instead of make!

Senso · Post by **Senso** » Thu Aug 14, 2003 3:40 pm

charlieg wrote:The main new features of portage will be the use of a DB.

Usage of a basic DB (Berkerly, anybody? or Metakit?) to start with. Then things like dependencies (forward and reverse) can be established incredibly quickly.

Metakit... It uses a HUGE amount of RAM, compared to other indexing/DB systems. Even Jakarta Lucene is maybe 10x better considering RAM usage. Metakit could be a problem to "low quality" hardware users (like me).

Senso · Post by **Senso** » Thu Aug 14, 2003 3:43 pm

axxackall wrote: Hmm, that could be interesting system, where ALL non-3rd-party (all that belongs to Gentoo itself) software is written on Python: Portage, initscripts, installation scripts, system (network, user, disk etc) management tools, various other tools and utilities ... Even UI for all of that must be written with Tkinter.

Look at the GLIS thread. I'll soon (i.e. 2-3 days) start a Tkinter UI for this project.

Gentoo Linux Install Script project. The project is still in it's infancy but it looks good to me.

MrPyro · Post by **MrPyro** » Thu Aug 14, 2003 4:05 pm

An idea that was mentioned in this thread: http://forums.gentoo.org/viewtopic.php?t=74143

Having Portage mark security updates as such, so that a server administrator can decide to just update security fixes while sticking with earlier versions of other code for stability.

The threading idea sounds good: download the first package, begin to compile it, while downloading the second package in the background.

Also, one thing I find annoying about emerge at the moment is that some packages have warning or informational messages that come up during installations: not compiler warnings, things like the message that tells you an easy way to configure Apache to use mod_php during the mod_php build. I tend to start my "emerge -pU --deep world" process running then go out somewhere, rather than sit and stare at my monitor watching compiler output, so I miss these messages. Some system where these kinds of messages are logged, so that they can be read later, would be helpful.

Senso · Post by **Senso** » Thu Aug 14, 2003 4:42 pm

MrPyro wrote: Also, one thing I find annoying about emerge at the moment is that some packages have warning or informational messages that come up during installations: not compiler warnings, things like the message that tells you an easy way to configure Apache to use mod_php during the mod_php build. I tend to start my "emerge -pU --deep world" process running then go out somewhere, rather than sit and stare at my monitor watching compiler output, so I miss these messages. Some system where these kinds of messages are logged, so that they can be read later, would be helpful.

Just a quick idea, but you could log normal output (1) to a file, cat it and remove all lines starting with "gcc" with sed. If you add more similar rules, you would maybe still have a lot of crap but it would be easier to browse the log and find useful messages.
An eventual command-line option doing this automatically is a good idea.

() · Post by () » Thu Aug 14, 2003 8:51 pm

Is there some lightweight (possibly object oriented) database system that could be worth looking at apart from metakit?

Senso · Post by **Senso** » Thu Aug 14, 2003 8:56 pm

() wrote:Is there some lightweight (possibly object oriented) database system that could be worth looking at apart from metakit?

I've been planning on playing with SQLite for a while... It's a SQL database, but embedded in your app directly (so the user doesn't have to download/install MySQL). The original version is for C/C++ but there are *many* wrappers for other languages, including Python, of course. I love the idea of an embedded SQL db but I've never really tried it.

Tronic · Post by **Tronic** » Thu Aug 14, 2003 9:15 pm

Hmm. I was surprised that so many people responded to that C++ part, which I didn't think was a big thing anyway (or maybe my message was too long and they only did read the first entry;).

Okay, let's break the problem up a bit. We have following slow points:
-Searching for packages (-s is too slow, -S is absofraggin'lutelyDAMNIT too slow)
-Figuring deps

The dependancy check isn't too slow at the moment, but I think that's something that is prone to suffer a lot of increased package numbers (6000 packages today, but really it should scale to much, much more than that). Haven't really thought about the algorithm and don't know what the current emerge uses for this, but if it is something that requires n^2 work (where n is the number of packages) or the like.. Well, it'll be real trouble.

Maybe it would be worth it to write these pieces in C/C++ and keep the rest in Python? Especially the searching is something I highly doubt could be fast enough on Python (except if you use some kind of word cache for that, but then you need to generate it and that brings new problems).

Mystilleef · Post by **Mystilleef** » Thu Aug 14, 2003 9:20 pm

C++!? Heck no! I'd rather portage was written in pure C with a Bash frontend. Python is less in the Unix/Linux spirit than Bash is.

Regards,

Mystilleef

far · Post by **far** » Thu Aug 14, 2003 9:33 pm

Tronic wrote:The dependancy check isn't too slow at the moment, but I think that's something that is prone to suffer a lot of increased package numbers

What does the number of packages have to do with dependancies?
If ebuild foo says "I depend on packages bar and baz", that will not change when the number of packages in portage increase.

Tronic wrote:Maybe it would be worth it to write these pieces in C/C++ and keep the rest in Python? Especially the searching is something I highly doubt could be fast enough on Python (except if you use some kind of word cache for that, but then you need to generate it and that brings new problems).

The speed of Python is not a problem. The problem is that you need to open and read 5000+ files. That will be slow no matter which language you use. Using a database backend could solve this problem.

carambola5 · Post by **carambola5** » Thu Aug 14, 2003 9:54 pm

Here's a question: why does each ebuild have its own DESCRIPTION variable? Shouldn't each package have this instead of each ebuild? Sure, the ebuilds could have EBUILD_DESCRIPTION variables that distinguish it from the other ebuilds in the same package, but overall, one piece of software should have one description.

And while we're at it, why does the description have to be a one-liner? Can we get a little more descriptive please? I mean...

Code: Select all

emerge -s koules
......
Description: fast action arcade-style game w/sound and network support

Not very descriptive in my book.

far · Post by **far** » Thu Aug 14, 2003 10:08 pm

carambola5 wrote:Here's a question: why does each ebuild have its own DESCRIPTION variable? Shouldn't each package have this instead of each ebuild?

Well, a "package" is really just a directory containing ebuild files ...

carambola5 wrote:And while we're at it, why does the description have to be a one-liner?

Good question. I think multi-lingual descriptions should be possible too.

Tronic · Post by **Tronic** » Thu Aug 14, 2003 10:24 pm

What does the number of packages have to do with dependancies?
If ebuild foo says "I depend on packages bar and baz", that will not change when the number of packages in portage increase.

Well, of course you then have to follow the trail, ask what bar and baz want, later figuring out if you can automatically solve some conflicts, etc. Once you get into deps of XFree86 libs, you'll soon be effectively travelling via deps of all the graphical apps.. (of course this depends a lot on how "smart" the deps checking system is, because more features == more things to check for)

The speed of Python is not a problem. The problem is that you need to open and read 5000+ files. That will be slow no matter which language you use. Using a database backend could solve this problem.

But if I had that database in one big file, with an average of 500 chars information per package, you'd still have to scan thru several megs of text and naturally also handle the database at the same time. Potentially you'd have to parse it in UTF-8 too. I don't have any benchmarks here on Python, but that might still be too slow (is it?)

Someone suggested SQL.. Dunno about its search performance either, could be fast too..

There is nothing wrong with using xml. Unlike binary data, it is readable (and debuggable) by human beings.

But at the same time it is difficult to read for machines (and it's software that should ever be reading or writing it anyway). The things we are talking about here are simplicity of implementation (escaping all data that isn't ASCII text, escaping XML reserved characters, parsing of tags that use loose syntax (the number of spaces between attributes and other such small things)), storage efficiency (big and ugly tags versus simple binary ones) and performance.

Tronic · Post by **Tronic** » Thu Aug 14, 2003 10:32 pm

Oh, about the progress indicators..

This box has been doing emerge gnome for around 15 hours now. The obvious problem with the current output is that I don't know what it is installing (no, didn't -p first), what it already has built (the scrollback can't get that far) nor how many packages there still are to go...

While the progress bars can't work when only building a single package, they'd surely be very useful in those operations which take the most time - those which recompile half of the entire system.

(and it doesn't really have to be a bar, many other ways of displaying the information might actually be better)

Emerge - the future?

Emerge - the future?

Re: Emerge - the future?

Pure C with a Bash frontend.