Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
distccd: preventing accidental helper deaths?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Networking & Security
View previous topic :: View next topic  
Author Message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9675
Location: almost Mile High in the USA

PostPosted: Wed Nov 24, 2021 10:25 pm    Post subject: distccd: preventing accidental helper deaths? Reply with quote

Well, I've posted before it's possible to kill distccd helper machines by submitting multiple huge c++ distcc jobs to a specific machine, just wondering if there's a way for a helper machine to reject or give up on a job if it notices it using too much memory?

The helper machines have swap, but swapstorm could happen before it runs out of swap. I'd like to have it terminate jobs well before it runs out of swap.

Would be nice if there's a global limit on all jobs it's running but per-job is an acceptable workaround... anyone know? Note the helper could be a multicore machine so there's some finagling that needs to go on.

Note that this was more of preventing accidental DoS, not preventing unscrupulous users from abusing machines. Then again you could call qtwebengine builds an unscrupulous user of cpu resources...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
pingtoo
l33t
l33t


Joined: 10 Sep 2021
Posts: 912
Location: Richmond Hill, Canada

PostPosted: Wed Nov 24, 2021 10:57 pm    Post subject: Reply with quote

It is possible to limit number of jobs distccd will run concurrently. see man distccd

Most likely you want to modify /etc/conf.d/distccd's DISTCCD_OPTS=-j N where N is the usual advise for setting MAKEOPTS=-j N. Assume C++ job can use up to 2GB per. So N=(Total mem GB)/2GB
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9675
Location: almost Mile High in the USA

PostPosted: Wed Nov 24, 2021 11:21 pm    Post subject: Reply with quote

Actually I specifically want to limit memory consumption. I'm fine with running 8 x 512MB jobs, but 2 x 4GB jobs will sink the ship, so in this case jobs is not a proxy for memory consumption as it would limit throughput of smaller jobs.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21586

PostPosted: Thu Nov 25, 2021 2:02 am    Post subject: Reply with quote

Perhaps this is a job for the cgroups memory controller, so you can constrain how much memory distccd + its descendants can consume?
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9675
Location: almost Mile High in the USA

PostPosted: Thu Nov 25, 2021 3:41 am    Post subject: Reply with quote

Yeah was thinking about that as a stopgap solution.
Wish distccd would keep track of how much RAM it and its workers are taking...and if it's over a limit would simply reject new jobs instead of partially run and kill one arbitrarily. Then again cgroups would save the machine if the memory estimate is grossly underestimated at the time of determination...

Better than killing the whole box I suppose? hm... need to figure out cgroups is another issue now...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
pingtoo
l33t
l33t


Joined: 10 Sep 2021
Posts: 912
Location: Richmond Hill, Canada

PostPosted: Thu Nov 25, 2021 11:51 am    Post subject: Reply with quote

eccerr0r wrote:
Wish distccd would keep track of how much RAM it and its workers are taking

Hard to imagine why distccd should monitor memory consumption of its descendants. This sound like asking make -j N to limit total memory consumption I think this should be an OS function.

You can try set ulimit for per processes bases, cgroup control entire process group that may or may not do what you want.

If your distcc helper have more duty than compiling it may be easier to use cgroup since this way you set an upper bound. But it is a bit easier setup ulimit in term of configuration.

use rc_ulimit in /etc/conf.d/distccd to set distccd service if you use OpenRC

see below for more information about setting ulimit in OpenRC
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9675
Location: almost Mile High in the USA

PostPosted: Thu Nov 25, 2021 4:05 pm    Post subject: Reply with quote

pingtoo wrote:
eccerr0r wrote:
Wish distccd would keep track of how much RAM it and its workers are taking

Hard to imagine why distccd should monitor memory consumption of its descendants. This sound like asking make -j N to limit total memory consumption I think this should be an OS function.

As a volunteer, you don't want to be killing yourself... no sense to - when you're better off helping when you can, instead of being dead and can't help at all.

Quote:
You can try set ulimit for per processes bases, cgroup control entire process group that may or may not do what you want.

If your distcc helper have more duty than compiling it may be easier to use cgroup since this way you set an upper bound. But it is a bit easier setup ulimit in term of configuration.

use rc_ulimit in /etc/conf.d/distccd to set distccd service if you use OpenRC

see below for more information about setting ulimit in OpenRC

Have to think about this. Don't want the distccd service to die if it exceeds memory utilization, hope it's just the gcc or clang job that dies or better yet, never gets started.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21586

PostPosted: Thu Nov 25, 2021 5:51 pm    Post subject: Reply with quote

I agree it would be nice if this was easy, but the details are messy. Suppose a distccd that can accept up to 8 jobs, and that has the logic you describe for not taking jobs that would put it over memory threshold. In the general case, how does it know when it has too many jobs? What if the build coordinator sends to the distccd 8 jobs from qtwebengine, and due to scheduling and network latency, all 8 jobs are received before any of them start a gcc? The memory requirements will not be fully understood until gcc is running and has done some amount of work on the input, so during the buffering phase, the daemon will keep thinking it has plenty of capacity to spare, and will continue to accept jobs until it reaches its count of 8. Yet because qtwebengine is so expensive, the daemon should have stopped at 4, or maybe fewer. Conversely, suppose the same daemon receives 8 jobs for C files that are nothing but hardcoded data tables, with no processing logic. Those will require a tiny amount of memory, so stopping at 8 may be too cautious. To do what you want, the daemon needs to adequately anticipate how much memory a given compile job will need at its worst, and needs to anticipate that even if the job is still buffering. For C++ in particular, input file size is not well correlated to compiler memory requirements. There are some programs whose source text is short enough to fit in a post here, but whose memory requirements are huge, so we cannot simply count the size of the input, multiply by a constant, and call that the memory required.

Another way to approach this would be to have a smarter client and server, such that you could let the volunteer start to overload, kill its jobs due to Out-of-Memory, and report that back to the distcc client in a way that it can retry the job, possibly on a different machine, rather than immediately report failure to the build tool (make, ninja, etc.). As I understand it, as is, when a volunteer kills a job for any reason, even a transient one like Out-of-Memory, the build tool is immediately told of the failure, and then the entire build dies. If the distcc client on the build coordinator were a bit more optimistic about the chances of a successful retry, we could avoid killing the whole build.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9675
Location: almost Mile High in the USA

PostPosted: Thu Nov 25, 2021 6:05 pm    Post subject: Reply with quote

True, there's not a simple solution. Ideally the build server (make/ninja/waf/...) has a better idea of what to and what not to distribute, and where based on knowing what resources are available where.

However I'd say that it's OK for the helper distccd get the large number of unknown-at-time big jobs and then kill them as they find out they cannot complete them. This should be signaled back to the build server as a "cannot complete due to resource" versus "cannot complete due to syntax error". It also can do heuristics - if it gets a large source file and it's C++, it could assign a multiplier based on number of classes (or some other heuristic keyword that's simple to check) to it and base decision whether to take the job or not. And this should be reflected in its response - whether it tried and failed (OOMkiller or whatnot) versus a heuristic check fail.

Another possibility is distccd monitoring memory growth over time, if it gets a nasty c++ file that's hard to build, it will notice memory growth 10 seconds into the build, versus that constants file where gcc might finish processing it in 5 seconds... Again this is heuristics, but it seems there are some possible optimizations.

Distcc on the submitting machine actually seems to know that something fails on a helper and tries again on the local machine before giving up and returning a failure to make/ninja/waf/... so it's okay for a helper to fail a job, ideally telling why it decided to fail it. Seems like distcc should be able to resubmit on a different machine if it gets a heuristic "no resources" error versus an OOMkiller failure versus a "syntax error" before trying on localhost -- but that's another optimization to talk about. (Ultimately if it fails on localhost, waf/ninja/make should fail.)

Currently it seems that the signaling response could be made better between helpers and requesters, I'd think. The distccd should know and report why a job failed to the requester, and distcc seems to timeout things possibly prematurely which is another issue that could be tweaked...

I have a bunch of quad core machines that could help out builds, just that memory is the big constraint. Perhaps it's an issue of balancing loads so that machines have a better average memory utilization instead of packing loads on specific machines so that you can shut one machine off if needed...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Networking & Security All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum