distccd: preventing accidental helper deaths?

eccerr0r

Well, I've posted before it's possible to kill distccd helper machines by submitting multiple huge c++ distcc jobs to a specific machine, just wondering if there's a way for a helper machine to reject or give up on a job if it notices it using too much memory?

The helper machines have swap, but swapstorm could happen before it runs out of swap. I'd like to have it terminate jobs well before it runs out of swap.

Would be nice if there's a global limit on all jobs it's running but per-job is an acceptable workaround... anyone know? Note the helper could be a multicore machine so there's some finagling that needs to go on.

Note that this was more of preventing accidental DoS, not preventing unscrupulous users from abusing machines. Then again you could call qtwebengine builds an unscrupulous user of cpu resources...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

pingtoo · Posted: Wed Nov 24, 2021 10:57 pm Post subject:

It is possible to limit number of jobs distccd will run concurrently. see man distccd

Most likely you want to modify /etc/conf.d/distccd's DISTCCD_OPTS=-j N where N is the usual advise for setting MAKEOPTS=-j N. Assume C++ job can use up to 2GB per. So N=(Total mem GB)/2GB

eccerr0r · Posted: Wed Nov 24, 2021 11:21 pm Post subject:

Actually I specifically want to limit memory consumption. I'm fine with running 8 x 512MB jobs, but 2 x 4GB jobs will sink the ship, so in this case jobs is not a proxy for memory consumption as it would limit throughput of smaller jobs.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

Hu · Moderator Joined: 06 Mar 2007 Posts: 21586

Perhaps this is a job for the cgroups memory controller, so you can constrain how much memory distccd + its descendants can consume?

eccerr0r · Posted: Thu Nov 25, 2021 3:41 am Post subject:

Yeah was thinking about that as a stopgap solution.
Wish distccd would keep track of how much RAM it and its workers are taking...and if it's over a limit would simply reject new jobs instead of partially run and kill one arbitrarily. Then again cgroups would save the machine if the memory estimate is grossly underestimated at the time of determination...

Better than killing the whole box I suppose? hm... need to figure out cgroups is another issue now...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

pingtoo · Posted: Thu Nov 25, 2021 11:51 am Post subject:

eccerr0r · Posted: Thu Nov 25, 2021 4:05 pm Post subject:

Hu · Moderator Joined: 06 Mar 2007 Posts: 21586

I agree it would be nice if this was easy, but the details are messy. Suppose a distccd that can accept up to 8 jobs, and that has the logic you describe for not taking jobs that would put it over memory threshold. In the general case, how does it know when it has too many jobs? What if the build coordinator sends to the distccd 8 jobs from qtwebengine, and due to scheduling and network latency, all 8 jobs are received before any of them start a gcc? The memory requirements will not be fully understood until gcc is running and has done some amount of work on the input, so during the buffering phase, the daemon will keep thinking it has plenty of capacity to spare, and will continue to accept jobs until it reaches its count of 8. Yet because qtwebengine is so expensive, the daemon should have stopped at 4, or maybe fewer. Conversely, suppose the same daemon receives 8 jobs for C files that are nothing but hardcoded data tables, with no processing logic. Those will require a tiny amount of memory, so stopping at 8 may be too cautious. To do what you want, the daemon needs to adequately anticipate how much memory a given compile job will need at its worst, and needs to anticipate that even if the job is still buffering. For C++ in particular, input file size is not well correlated to compiler memory requirements. There are some programs whose source text is short enough to fit in a post here, but whose memory requirements are huge, so we cannot simply count the size of the input, multiply by a constant, and call that the memory required.

Another way to approach this would be to have a smarter client and server, such that you could let the volunteer start to overload, kill its jobs due to Out-of-Memory, and report that back to the distcc client in a way that it can retry the job, possibly on a different machine, rather than immediately report failure to the build tool (make, ninja, etc.). As I understand it, as is, when a volunteer kills a job for any reason, even a transient one like Out-of-Memory, the build tool is immediately told of the failure, and then the entire build dies. If the distcc client on the build coordinator were a bit more optimistic about the chances of a successful retry, we could avoid killing the whole build.

eccerr0r · Posted: Thu Nov 25, 2021 6:05 pm Post subject:

True, there's not a simple solution. Ideally the build server (make/ninja/waf/...) has a better idea of what to and what not to distribute, and where based on knowing what resources are available where.

However I'd say that it's OK for the helper distccd get the large number of unknown-at-time big jobs and then kill them as they find out they cannot complete them. This should be signaled back to the build server as a "cannot complete due to resource" versus "cannot complete due to syntax error". It also can do heuristics - if it gets a large source file and it's C++, it could assign a multiplier based on number of classes (or some other heuristic keyword that's simple to check) to it and base decision whether to take the job or not. And this should be reflected in its response - whether it tried and failed (OOMkiller or whatnot) versus a heuristic check fail.

Another possibility is distccd monitoring memory growth over time, if it gets a nasty c++ file that's hard to build, it will notice memory growth 10 seconds into the build, versus that constants file where gcc might finish processing it in 5 seconds... Again this is heuristics, but it seems there are some possible optimizations.

Distcc on the submitting machine actually seems to know that something fails on a helper and tries again on the local machine before giving up and returning a failure to make/ninja/waf/... so it's okay for a helper to fail a job, ideally telling why it decided to fail it. Seems like distcc should be able to resubmit on a different machine if it gets a heuristic "no resources" error versus an OOMkiller failure versus a "syntax error" before trying on localhost -- but that's another optimization to talk about. (Ultimately if it fails on localhost, waf/ninja/make should fail.)

Currently it seems that the signaling response could be made better between helpers and requesters, I'd think. The distccd should know and report why a job failed to the requester, and distcc seems to timeout things possibly prematurely which is another issue that could be tweaked...

I have a bunch of quad core machines that could help out builds, just that memory is the big constraint. Perhaps it's an issue of balancing loads so that machines have a better average memory utilization instead of packing loads on specific machines so that you can shut one machine off if needed...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?