Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Why has portage become so slow
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3, 4, 5, 6, 7  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
yu.cy
n00b
n00b


Joined: 09 Nov 2013
Posts: 3

PostPosted: Sat Nov 09, 2013 6:13 pm    Post subject: Reply with quote

First of all, I have to declare that what I want is to share just my ideas about portage, not the final implement detail. The core of my idea is to make portage more modular, make the cpu intensive dependency calculation loose coupled with portage, so it is replaceable, and (I hope) we can improve it independent of the develop of portage itself(at lease for dependency resovle algorithm part). So some details I describe here is not important and might even can not be implement at all.

mv wrote:
yu.cy wrote:
Translate all config file to a standard format(maybe a formal description of the constraints) make external dependency calculation program(let's call it dependency calculation backend, or backend) simple and portable.

So instead of the simple and portable
/etc/portage/package.use/russian wrote:
sys-apps/portage linguas_ru
you want to translate it on the run into some "standard" format like
Quote:
<use-flag-change type=user-config where="/etc/portage/package.user/russian" line=1><use=1>linguas_ru</use></use-flag-change>
which in turn is parsed with some bulky library like libxml2. Sounds like something which speed up things enormously and makes writing the resolver a one-liner. Seriously, except for some very few files like /etc/portage/sets.conf practically everything already is in a standard format which is simpler to parse than any other format.



Both simple and portable are relative concepts. For simple, I means easy to process by program, not easy to read and write by human beings. A good intermediate language should be select so it easy for both portage to generate and backend easy to read and process, make it easy for human to read is an additional bonus. For my personal opinion, I perfer json and against xml(which hard to understand for both human and machine), S-exp might also be a good candidate. And the same for portable, I mean language neutral, for dependency calculation backend, we should not limit the implementation language, maybe we can write with C, lisp, haskell or even javascript, so we should select some widespead data exchange format, so I list json and xml(again, I don't like xml).


Quote:

Not to mention that in practice almost never the whole portage tree needs to be parsed for resolving while your suggestion would mean that everytime all information has to be parsed, translated, and parsed again. On systems with 5MB memory say, this alone is a process of some minutes everytime (because the file cache will not be sufficient), and then you have not even started the actual algorithm.



I think Gentoo is always about selections. I really like the USE flag system of portage. So, one day if we implement dependency calculation offload support for portage, the good old built-in dependency calculation must be still remain in the portage, as a fallback. This should perfect fit the condition of memory limit system(by the way, build gcc need A LOT of memory these days, I have to use nbd as swap for my arm embed board). And I think the offload support should be a USE flag of portage, when enable it, additional support packages will be required as dependency, and we can use environment variable or command line switch to control which backend to use.

Portage can be parsed just once, and translate to some form of database, and make it queryable by backend(or just make an adapter layer, make backend directly qurey portage), so this is not a problem. The config might translate realtime, but can also be translate lazily (only update when file changes), or even manually for slow system. But again, when the offload support is implement, the best option for slow machine might be just offload the calculation to some remote machine, just like distcc. For fast machine, I never think it is a problem.

Quote:

Quote:
And I think there're two part of information need to transfer to the backend, first is the environment, include dependency relation of the packages in the whole portage tree and user configured use, mask, keywords, etc., and second part is current package dependency constraints need to resolve.

You forgot the profile and all the defaults specified there. And remember that you have to keep track for every tiny bit of information where it was from since you have to report it back to the user in case of problems.
Quote:
When backend return the result back to portage, whether success, fail to meet some constraint or give some suggestion(drop some constraints), portage should take the charge of interpret the result and translate to user friendly message then output to terminal.

Ah, so suddenly portage should be able to do the magic of a reasonable outpout without having access to the full tree to get all information.


This is a probelm. I think it should be the task of portage to remeber the source of each constraint(might be this function is already implemented in portage), and the less knowledge about portage the backend program need to know, the better.

Finally, some scenarios I think might be happen in the future:
1. there is a 'offload'(or something else) USE flag for portage, when enable the USE, additional package 'portage-support' will be installed
2. when 'emerge' command is run with some command line switch or some environment variable, it will check whether 'portage-support'(or some better name) is installed, the method in this package will be used to resovle dependency calculation, is not, fallback method will be used.
3. the 'portage-support' package probably write in python, when dependency calculation method is called, it will invoke the backend program or communicate to a running backend service, translate current package constraints into some language that backend understand, then send to the backend. Then wait the return of backend and emmit necessary messages, and return the result to portage.
4. for disk limited machine, disable 'offload' might be a good option.
5. for slow machine, a remote backend can be used, the remote backend might be build into 'portage-support' package.
6. for a machine with large memory and many cores, maybe high performance highly parallelized backend write in haskell(the binary of haskell program is very large due static link) is a good option.

PS: ArneBab, thank you for your information about pkgcore, I will have a look when I have time
PS2: TomWij, thank you for your clearify about last problem.
Back to top
View user's profile Send private message
_______0
Guru
Guru


Joined: 15 Oct 2012
Posts: 521

PostPosted: Sat Nov 09, 2013 8:34 pm    Post subject: ideas Reply with quote

I was thinking about a couple of ideas myself.

First, I understand that certain operations are not doable in GPU. I am not sure which sort of algorithm is not suitable for GPU but portage calculations (algorithms) aren't usable for the GPU.

Portage basically uses three variables:

1- USE flags (included those for sound/vid cards, python etc.)
2- Stable/Testing
3- Package Version
4- Dependencies

So here goes my idea, why not transform all those variables into mathematical model suitable for GPU calculation? Currently must be using logic or something like that. Anyways, once portage tree is mapped into a mathematical model offload it to the GPU and on top of that use fast fourier transforms to come up with a result. I think this would be neat.

My second thought, unrelated to GPU, is about portage doing complete calculation on the fly. Why do this? I think portage could profile the user tree in a way that only any new differences the results are pre-calculated.

Let's say between two syncs only a few USE flags, versions and what not changes, so do some diff portage.old portage.new and the calculate only the new changes instead of traversing the entire portage tree.

This option could have variations such as have portage quietly in the background do calculations for installed packages. I think this is the most practical implementations as many systems is a matter of maintanance the same install base.

Yet another option that I've mentioned on this thread is have portage use a database like eix. Some operations could be pre-calculated, and give results in an instant.

How big would be a portage 'rainbow tables'? I also fancied this, if wpa is crackable and some other stuff, portage dependency calculation is crakcable as well.

On a final note, this isn't part of portage but the make part should be parallelized already.
Back to top
View user's profile Send private message
TomWij
Retired Dev
Retired Dev


Joined: 04 Jul 2012
Posts: 1553

PostPosted: Sat Nov 09, 2013 9:02 pm    Post subject: Re: ideas Reply with quote

_______0 wrote:
1- USE flags (included those for sound/vid cards, python etc.)
2- Stable/Testing
3- Package Version
4- Dependencies


5- Masks
6- Licenses
7- Blockers
8- Overlays

_______0 wrote:
So here goes my idea, why not transform all those variables into mathematical model suitable for GPU calculation? Currently must be using logic or something like that. Anyways, once portage tree is mapped into a mathematical model offload it to the GPU and on top of that use fast fourier transforms to come up with a result. I think this would be neat.


A mathematical model would be some form of tree and/or graph, which has its own algorithms some of which are used in Portage already; as for something like fourier transforms, those are for converting time to frequency so I am not sure how they can apply here. Perhaps if you express the tree or graph in matrices; you might be able to benefit from matrix calculations in the GPU, that seems an interesting study on its own. But, we're not even using the CPU parallel; so, I'd think GPU might be a bit overkill for now, that we might benefit faster in the short term by trying to get Portage working in parallel instead.

_______0 wrote:
My second thought, unrelated to GPU, is about portage doing complete calculation on the fly. Why do this? I think portage could profile the user tree in a way that only any new differences the results are pre-calculated.


See our earlier discussion in this thread as to why caching isn't the right "solution"; it can help shave of a bit of time if you select what you want to cache very well, but it's not a silver bullet.

_______0 wrote:
Let's say between two syncs only a few USE flags, versions and what not changes, so do some diff portage.old portage.new and the calculate only the new changes instead of traversing the entire portage tree.


Yes, that's the idea I mentioned earlier; by caching the dependencies for the current USE flag state on packages, you next time only have to USE reduce the dependencies if the actual USE flags changed. The upside is that the whole USE reduce algorithm becomes barely used, the downside is that due to the cache file needed it might not be so applicable for embedded systems; to some extent, a way to scale it needs to be looked into. We might also want to see whether we can take the state of masks into account here...

_______0 wrote:
This option could have variations such as have portage quietly in the background do calculations for installed packages.


Do you mean like a cron job that runs every so often? (eg. what "updatedb" does for locate)

_______0 wrote:
I think this is the most practical implementations as many systems is a matter of maintanance the same install base.


Yes, maybe we can prefill the cache, especially on stage3; perhaps even past that since indeed there are a lot of people running the same set of packages.

_______0 wrote:
Yet another option that I've mentioned on this thread is have portage use a database like eix. Some operations could be pre-calculated, and give results in an instant.


Same caching concept as above.

_______0 wrote:
How big would be a portage 'rainbow tables'? I also fancied this, if wpa is crackable and some other stuff, portage dependency calculation is crakcable as well.


Portage size is smaller, but I think that's not what only comes into play; I think Portage complexity is hard enough to make it impossible to use rainbow tables. Imagine the amount of possible combinations of packages. And even if you could; I guess it is a bit overkill to have a quite a few gigabytes around.

_______0 wrote:
On a final note, this isn't part of portage but the make part should be parallelized already.


Yes, it's quite good already; but still love to see more improvement there as well. More clang support, faster configure checks, etc...
Back to top
View user's profile Send private message
ArneBab
Guru
Guru


Joined: 24 Jan 2006
Posts: 429
Location: Graben-Neudorf, Germany

PostPosted: Sat Nov 09, 2013 11:05 pm    Post subject: Reply with quote

yu.cy wrote:
PS: ArneBab, thank you for your information about pkgcore, I will have a look when I have time


Glad to ☺

pkgcore can be so fast that it seems almost unreal (at least compared to any other package manager I know - not only in Gentoo), and with 2 or 3 devs I’m pretty sure that it would quickly become a viable portage replacement for most people. But it seems it is hard finding people who want to just write a good and efficient program instead of hacking on their personal toy idea how package managers should behave - as paludis shows…

To me pkgcore is a project which chose the technically and logically best way: Build a drop-in replacement with simply good engineering. But they were humble and friendly and did not shout around - and so only few people saw them. Great engineering, weak PR. And the fans were pragmatic folks who did not want to spend their free time in the evenings on promote the obvious way to go… or at least that was the reason why I did not get more involved: There were other projects which seemed to need my help much more…
_________________
Being unpolitical means being political without realizing it. - Arne Babenhauserheide ( http://draketo.de )

pkgcore: So fast that it feels unreal - by doing only what is needed.
Back to top
View user's profile Send private message
xaviermiller
Bodhisattva
Bodhisattva


Joined: 23 Jul 2004
Posts: 8706
Location: ~Brussels - Belgique

PostPosted: Mon Nov 18, 2013 10:05 am    Post subject: Reply with quote

Hello,

I have 2 questions :
- will Portage be faster on Pyton 3.3 or 2.7?
- for now python2 and python3 USE flags are not set. Would Portage be faster if I define them?
_________________
Kind regards,
Xavier Miller
Back to top
View user's profile Send private message
TomWij
Retired Dev
Retired Dev


Joined: 04 Jul 2012
Posts: 1553

PostPosted: Mon Nov 18, 2013 11:48 am    Post subject: Reply with quote

You might or might not see improvements or regressions on the benchmark scale; but I'm not sure if that is even worth it when you are talking about a change of seconds to a few minutes. Portage needs changes to its algorithmic complexity in multiple parts of the code (one such change, causing a reg exp to work different but perceivable the same; has caused a 3% drop in run-time); running things in parallel is one such example, which can be reached with PyPy, but requires quite some work as the algorithm needs to be rewritten to keep the parallel nature in mind. For Python 2.7 and Python 3.3 I believe this is still not fully possible due to the global interpreter lock (see https://wiki.python.org/moin/GlobalInterpreterLock for more details).
Back to top
View user's profile Send private message
_______0
Guru
Guru


Joined: 15 Oct 2012
Posts: 521

PostPosted: Mon Nov 18, 2013 3:42 pm    Post subject: Reply with quote

XavierMiller wrote:
Hello,

I have 2 questions :
- will Portage be faster on Pyton 3.3 or 2.7?
- for now python2 and python3 USE flags are not set. Would Portage be faster if I define them?


It was mentioned that it'd be 'slighty slower' due to UTF8. But only slightly, an un-avoidable regression for migrating to python 3.3.
Back to top
View user's profile Send private message
xaviermiller
Bodhisattva
Bodhisattva


Joined: 23 Jul 2004
Posts: 8706
Location: ~Brussels - Belgique

PostPosted: Mon Nov 18, 2013 5:58 pm    Post subject: Reply with quote

In my case, since my first bad observations, the timings are better : 2min vs 3-4 min. It is now bearable.

And more: the Portage messages are really useful when conflicts are detected (I had Python 3 masked, and switched to 3.3 with 3.3 as default single target, except for ~10 ebuilds that depend on 2.7).
_________________
Kind regards,
Xavier Miller
Back to top
View user's profile Send private message
TomWij
Retired Dev
Retired Dev


Joined: 04 Jul 2012
Posts: 1553

PostPosted: Mon Nov 18, 2013 6:13 pm    Post subject: Reply with quote

On the contrary, it might be a bit faster because an improvement to the __slots__ feature (http://docs.python.org/2/reference/datamodel.html#slots) like seen in http://tech.oyster.com/save-ram-with-python-slots/ becomes more standard with http://www.python.org/dev/peps/pep-0412/ which is to be seen in Python 3.3 or Python 3.4. Not sure about which different scales we are talking about though; given that Portage already uses it in quite some places, I guess the benefit might also end up being rather small. So, I think we need to benchmark to see which end result we obtain with the regressions and improvements.
Back to top
View user's profile Send private message
ulenrich
Veteran
Veteran


Joined: 10 Oct 2010
Posts: 1480

PostPosted: Thu Jan 23, 2014 12:12 am    Post subject: solved portage performance problem by selecting no-multilib Reply with quote

I solved my portage performance problem by selecting the no-multilib profile!

If you think about it this is simple logic:
All of emerge solving attempts don't add linearely with additional ebuilds but it surely quadruples!
Think of every ebuild that is multiarch enabled as an additional ebuild!

It just happened coincidentily
- Gentoos multiarch capabilities widened
- Portage made a version bump

When in winter the moon shines the night is cold:
The moon freezes the earth :)
Back to top
View user's profile Send private message
TomWij
Retired Dev
Retired Dev


Joined: 04 Jul 2012
Posts: 1553

PostPosted: Thu Jan 23, 2014 3:33 am    Post subject: Reply with quote

Switched to a --backtrack=0 approach here myself and it takes only half a minute each time; on a side note, there is definitely a regression in the last version ( 2.2.8 ) as we found a bad commit in there.
Back to top
View user's profile Send private message
rudregues
Apprentice
Apprentice


Joined: 29 Jan 2013
Posts: 231
Location: Brazil

PostPosted: Fri Jan 24, 2014 8:44 pm    Post subject: Re: solved portage performance problem by selecting no-multi Reply with quote

ulenrich wrote:
I solved my portage performance problem by selecting the no-multilib profile!
If I change to no-multilib portage will try to remove my 32 bit stuff whith deep-clean? (I've skype)
EDIT: tested myself and yes!

TomWij wrote:
Switched to a --backtrack=0 approach here myself and it takes only half a minute each time; on a side note, there is definitely a regression in the last version ( 2.2.8 ) as we found a bad commit in there.
What's the point of this feature? Since the dependency calculation fails the first time why it would work next?
_________________
Emerging en gentoo
Back to top
View user's profile Send private message
TomWij
Retired Dev
Retired Dev


Joined: 04 Jul 2012
Posts: 1553

PostPosted: Fri Jan 24, 2014 11:03 pm    Post subject: Reply with quote

Because it backtracks; which means, that it considers more possible solutions by going up in a binary tree and trying out branches that previously have not been tried yet.

Instead of doing that, I just force that tree down certain branches that fit each other instead; so, I'm replacing spending a lot of time on backtracking by something I can spend barely any time on.

The downside of no backtracking is that you need an even solid understanding of what is going on; mostly being able to understand the emerge output, but sometimes a further look might be needed...
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6, 7
Page 7 of 7

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum