View previous topic :: View next topic |
Author |
Message |
cfgauss l33t
Joined: 18 May 2005 Posts: 688 Location: USA
|
Posted: Sat Jul 14, 2018 5:28 pm Post subject: [SOLVED] Load too high |
|
|
I have a quad-core machine and often htop measures my 1-minute load average in the 10-20 range. htop sorts by CPU% and, strangely, the max CPU% is 0.0 when this happens. During this high load, KDE is unusable and I have to ssh in to the box to reboot.
This high load seems to be triggered by an emerge world but even after I kill the emerge process, the high load continues.
How can I find out which processes are killing the CPU?
Any debugging hints will be gratefully received.
[SOLVED] Many thanks to both Ant P. and khayyam for their knowledgeable suggestions. I ended up following Ant P.'s instructions to install the BFQ IO scheduler in my gentoo-sources-4.17.6 kernel: I set the CONFIG_SCSI_MQ_DEFAULT=y flag and used his echo | tee one-liner in an /etc/local.d boot script. As a test I re-emerged chromium, libreoffice, and qtwebengine, long-running emerges which, in the past, have completely frozen the GUI. When the load was in the 10.0-20.0 range, responsiveness did, of course, suffer but the box never froze and during periods of that load seemed normally responsive. I am delighted with this solution as I had feared my only other recourse was to buy newer, more powerful hardware or drop Gentoo and use a binary distribution. [/SOLVED]
Last edited by cfgauss on Wed Jul 18, 2018 3:55 am; edited 1 time in total |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Sat Jul 14, 2018 6:10 pm Post subject: Re: Load too high |
|
|
cfgauss wrote: | I have a quad-core machine and often htop measures my 1-minute load average in the 10-20 range. htop sorts by CPU% and, strangely, the max CPU% is 0.0 when this happens. During this high load, KDE is unusable and I have to ssh in to the box to reboot. |
cfgauss ... to be clear, do you mean '0.10-20' or '10-20.0'? The former would be perfectly normal, the latter extreme. As you see no process using those resources then we should probably assume that the kernel is, you should probably save the output of dmesg:
Code: | % ssh user@host "dmesg > dmesg-$(date +%s)" |
cfgauss wrote: | This high load seems to be triggered by an emerge world but even after I kill the emerge process, the high load continues. |
Which suggests you're triggering a kernel bug, or segfault ... so, again, we should see dmesg.
best ... khay |
|
Back to top |
|
|
cfgauss l33t
Joined: 18 May 2005 Posts: 688 Location: USA
|
Posted: Sat Jul 14, 2018 6:48 pm Post subject: Re: Load too high |
|
|
khayyam wrote: | cfgauss ... to be clear, do you mean '0.10-20' or '10-20.0'?
|
Sadly, 10.0-20.0.
khayyam wrote: | Code: | % ssh user@host "dmesg > dmesg-$(date +%s)" |
|
After a reboot, the system is normal. Would a dmesg in this state be useful or only a dmesg in a high-load state? |
|
Back to top |
|
|
PrSo Tux's lil' helper
Joined: 01 Jun 2017 Posts: 136
|
Posted: Sat Jul 14, 2018 7:21 pm Post subject: |
|
|
During that "stalling" can you observe in iotop extremely high queue of processes that are at 99.9% on IO> column?
I am fighting with this since 4.14 (now I am on 4.17) and sometimes, especially compiling qtwebengine when stalls occurs, I have "load avg" at 40.0!!!
And yes, it starts swapping at that moment.(scheduler is deadline)
Things gets better when i set up MAKEOPTS="-j4 -l3.6" (I have 4 core AMD APU and I am using portage in tmpfs which is limited to 5GB because of ssd) accordingly to https://wiki.gentoo.org/wiki/EMERGE_DEFAULT_OPTS
My machine has only 8 GB of ram.
I have tried other schedules as well but the same thing happens. BTW setting up in kernel config file CONFIG_SCSI_MQ_DEFAULT=y cause hang on resume from suspend. |
|
Back to top |
|
|
cfgauss l33t
Joined: 18 May 2005 Posts: 688 Location: USA
|
Posted: Sat Jul 14, 2018 7:54 pm Post subject: |
|
|
PrSo wrote: | During that "stalling" can you observe in iotop extremely high queue of processes that are at 99.9% on IO> column |
Just installed iotop and will look at this during the next high load incident.
PrSo wrote: | I am fighting with this since 4.14 (now I am on 4.17) and sometimes, especially compiling qtwebengine when stalls occurs, I have "load avg" at 40.0!!! |
I'm on 4.17.5 and have seen 40.0, also. High loads usually happen if a single emerge is lengthy: qtwebengine, libreoffice, chromium.
PrSo wrote: | Things gets better when i set up MAKEOPTS="-j4 -l3.6" |
I have the same MAKEOPTS as well as FEATURES="ccache". Neither seems to help.
PrSo wrote: | My machine has only 8 GB of ram. |
As do I.
I would think if this really were a kernel bug that there would be posts everywhere. Since there aren't, I'm assuming the problem is related to my particular hardware/software setup. I sincerely hope I don't have to buy new hardware simply to run Gentoo. |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Sat Jul 14, 2018 8:18 pm Post subject: Re: Load too high |
|
|
cfgauss wrote: | After a reboot, the system is normal. Would a dmesg in this state be useful or only a dmesg in a high-load state? |
cfgauss ... the kernel buffer ring, from which dmesg is derived, will have been cleared by a reboot, so at the time of occurrence ... that is why I provided the command via ssh, as it would save the additional overhead of acquiring a shell, etc.
If you have a logger and are logging crit, err, warn, segfault, then it's possible that this is in /var/log/messages (or whatever file you log such things to).
best ... khay |
|
Back to top |
|
|
cfgauss l33t
Joined: 18 May 2005 Posts: 688 Location: USA
|
Posted: Sat Jul 14, 2018 8:31 pm Post subject: |
|
|
Here is the portion of /var/log/messages that appears to be kernel errors. Does this shed any light on the problem? |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Sat Jul 14, 2018 9:04 pm Post subject: |
|
|
cfgauss wrote: | Here is the portion of /var/log/messages that appears to be kernel errors. Does this shed any light on the problem? |
cfgauss ... it does, but those hangs may be the result, rather than the cause, so we really need to see dmesg, 'ps auxwww', and your .config.
best ... khay |
|
Back to top |
|
|
cfgauss l33t
Joined: 18 May 2005 Posts: 688 Location: USA
|
Posted: Sat Jul 14, 2018 9:09 pm Post subject: |
|
|
khayyam wrote: | cfgauss wrote: | Here is the portion of /var/log/messages that appears to be kernel errors. Does this shed any light on the problem? |
cfgauss ... it does, but those hangs may be the result, rather than the cause, so we really need to see dmesg, 'ps auxwww', and your .config. |
During the next high load incident I can collect dmesg and ps auxwww. What, exactly, is .config? |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Sat Jul 14, 2018 9:51 pm Post subject: |
|
|
Those kernel errors are all hanging on disk IO. Set a better default scheduler, bfq for disks or kyber/deadline for SSD. |
|
Back to top |
|
|
bunder Bodhisattva
Joined: 10 Apr 2004 Posts: 5934
|
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Sat Jul 14, 2018 10:23 pm Post subject: |
|
|
That link only shows that CFQ wins on synthetic tests with a high-end laptop, fast SSD and stock Ubuntu. Good for a stereotypical kernel developer to shoo people away with a "works for me", not so much for real world users on real world hardware.
My suggestion comes from my own experience with 8-9 years of BFQ on Gentoo. Among other things, I've compiled Chromium on my netbook, eating into swap and load average going into space, while it remains somewhat responsive. |
|
Back to top |
|
|
bunder Bodhisattva
Joined: 10 Apr 2004 Posts: 5934
|
Posted: Sun Jul 15, 2018 1:21 pm Post subject: |
|
|
Load is relative, I have a VM that hits over 150 every time I run a test suite on it, it's perfectly usable while the tests are running. I'm more inclined to believe the graphs, regardless of the "high end laptop" because they're trying to test raw disk performance and block schedulers. If they ran the test on a raspberry pi they would be spending a lot more time in wait cycles because the hardware is so slow (therefore tainting the test). _________________
Neddyseagoon wrote: | The problem with leaving is that you can only do it once and it reduces your influence. |
banned from #gentoo since sept 2017 |
|
Back to top |
|
|
cfgauss l33t
Joined: 18 May 2005 Posts: 688 Location: USA
|
Posted: Sun Jul 15, 2018 9:39 pm Post subject: |
|
|
Ant P. wrote: | Those kernel errors are all hanging on disk IO. Set a better default scheduler, bfq for disks or kyber/deadline for SSD. |
Does BFQ have any advantages over CFQ for rotational HDDs (rather than SSDs)? If so, could you provide a link for configuring BFQ (kernel/grub/udev (?) etc)?
Many thanks for this suggestion. |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Mon Jul 16, 2018 4:15 pm Post subject: |
|
|
In the kernel:
make menuconfig: | [*] Enable the block layer --->
├ [*] Enable support for block device writeback throttling
├ [*] Multiqueue writeback throttling
└── IO Schedulers --->
├ Default I/O scheduler (BFQ)
└── <*> BFQ I/O scheduler |
Depending on the kernel version it may be called BFQ-MQ instead. The other two options are unrelated to BFQ but highly recommended as they help with IO overloading too.
You can check what you already have enabled using "cat /sys/block/sd*/queue/scheduler", and change it at runtime with e.g. "echo bfq | sudo tee /sys/block/sd*/queue/scheduler" |
|
Back to top |
|
|
cfgauss l33t
Joined: 18 May 2005 Posts: 688 Location: USA
|
Posted: Tue Jul 17, 2018 12:19 am Post subject: |
|
|
Ant P. wrote: | In the kernel:
make menuconfig: | [*] Enable the block layer --->
├ [*] Enable support for block device writeback throttling
├ [*] Multiqueue writeback throttling
└── IO Schedulers --->
├ Default I/O scheduler (BFQ)
└── <*> BFQ I/O scheduler |
Depending on the kernel version it may be called BFQ-MQ instead. The other two options are unrelated to BFQ but highly recommended as they help with IO overloading too.
You can check what you already have enabled using "cat /sys/block/sd*/queue/scheduler", and change it at runtime with e.g. "echo bfq | sudo tee /sys/block/sd*/queue/scheduler" |
Here's my IO Schedulers choices in my sys-kernel/gentoo-sources-4.17.6 kernel:
Code: | < > Deadline I/O scheduler
< > CFQ I/O scheduler
Default I/O scheduler (No-op) --->
<*> MQ deadline I/O scheduler
<*> Kyber I/O scheduler
<*> BFQ I/O scheduler
|
Also the Default I/O scheduler only lists the single-queue schedulers, Deadline, CFQ, and No-op.
Code: | ~$ cat /sys/block/sd*/queue/scheduler
[noop]
[noop]
|
Is it possible to install BFQ in this kernel? |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Tue Jul 17, 2018 2:01 am Post subject: |
|
|
It looks like vanilla/gentoo doesn't have the traditional BFQ patch, you either need to boot with scsi_mod.use_blk_mq=1 or set CONFIG_SCSI_MQ_DEFAULT=y.
The default option there won't change the mq scheduler (default is mq-deadline), you'll need to put the echo|tee command in an /etc/local.d/ script to set it at boot. |
|
Back to top |
|
|
khayyam Watchman
Joined: 07 Jun 2012 Posts: 6227 Location: Room 101
|
Posted: Tue Jul 17, 2018 2:10 am Post subject: |
|
|
cfgauss ...
you can either set 'elevator=bfq' as a kernel parameter (ie, in grub.cfg, or what-have-you), or have the following run via 'local':
/etc/local.d/iosched-bfq.start: | #!/bin/sh
set -e
PATH="$PATH:/lib/rc/bin"
export PATH
sysfs_sched="/sys/block/sda/queue/scheduler"
if [ -n "$1" ] ; then
scheduler="$1"
else
filename="$(basename -- $0 .${0##*.})"
scheduler="${filename#*-}"
fi
ebegin "Setting I/O scheduler to $scheduler"
# in case the required scheduler happens to be a built as a module.
modprobe -q "${scheduler}-iosched" || exit $?
if [ -e "$sysfs_sched" ] && ( grep -q "$scheduler" "$sysfs_sched" ) ; then
ebegin "Setting scheduler to $scheduler"
echo "$scheduler" > "$sysfs_sched"
fi
eend $? |
Note that the filename, or a parameter passed to the script, will set the "$scheduler" to be used.
best ... khay |
|
Back to top |
|
|
|