Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Bad performance when system under full load
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
e8root
n00b
n00b


Joined: 09 Feb 2024
Posts: 72

PostPosted: Thu Apr 18, 2024 7:26 pm    Post subject: Bad performance when system under full load Reply with quote

Let's say I use kernel 6.6.21 (not really but it doesn't matter), latest stable. Default settings. Computer is water cooled Core i9 13900KF with HT disabled and fixed clocks. No stability issues, no throttling. Normally I am running performance frequency scaling on kernel 6.8.x except tests I did on 6.6.21. GPU is Radeon 6900XT. Only boot options are related to edid. Otherwise related to kernel and scheduling etc. I didn't do anything. systemd profile 23.0. The issue happened before too.

When I e.g. run ebuild and play game at the same time the moment all cores are being used game starts to stutter. The only fix I found is to set affinity of ebuild (and all other applications which might want to use them) to disallow using all cores to leave free cores for the game. Though for whatever reason it doesn't work that well with E-cores unless I also disallow game from using any other cores than those which I disallowed for ebuild - in other words in this case it looks like kernel is still scheduling game on P-cores which it cannot use.

Changing niceness doesn't really change much - in fact it doesn't change anything at all. I can run eg. genkernel + ebuild at niceness -19 and game at 19 and it plays the same as if ran bulds at 19 and game on -19.

Same thing happened on 6.8.x with various settings, tried changing:
- changing timer frequency
- preemption model - various settings
- simple tick based vs dynticks + related tmer ticks settings
- messign with various options about fine granularity, high precision timer, etc.
- changing BMQ CPU scheduler to PDS CPU scheduler
- disabling completely zram services

...and then some completely desperate measures:
- changing drivers in /sys/devices/system/cpu/cpuidle/ and lots of other options which do not make any sense given the issue
- disabling idle driver or how its called to force full clock speed even though it doesn't look like idle clocks issue
- disabling mitigations (yes, I started getting desperate)

Nothing helps.

Of course it's not only wine games + ebuild issue. Just running anything that uses all cores makes system unresponsive and at times it cannot even handle playing YT depending on what I run. Even dragging windows around in X11 gets stuttery.

Anyone has any idea how to fix this?

ps. I put it here because i is not strictly related to games. Looks like a kernel issue.
_________________
Unix Wars - Episode V: AT&T Strikes Back
Back to top
View user's profile Send private message
Ralphred
Guru
Guru


Joined: 31 Dec 2013
Posts: 501

PostPosted: Thu Apr 18, 2024 8:41 pm    Post subject: Re: Bad performance when system under full load Reply with quote

e8root wrote:
When I e.g. run ebuild and play game at the same time the moment all cores are being used game starts to stutter.
This is normal. The only way around it can think of is setting -l in make.opts to leave "overhead" for whatever game you are playing, but you need to be mindful of memory usage as that can cause stuttering too as stuff gets swapped.

e8root wrote:
...and fixed clocks... Normally I am running performance frequency scaling

This isn't the best way; leaving as much "thermal overhead" as possible will maximise boost clock frequency usage. When gaming I'll peak @4.8GHz using powersave, but only @4.2GHz when using performance, this is because of the opportunistic nature of modern boost freq algorithms. Obviously, if Intel don't provide a suitable freq scaling algorithm that works with the variable loading games can exert, you'll have to just run performance and accept this is "less than optimal".
Back to top
View user's profile Send private message
e8root
n00b
n00b


Joined: 09 Feb 2024
Posts: 72

PostPosted: Fri Apr 19, 2024 6:12 am    Post subject: Reply with quote

Quote:
his is normal. The only way around it can think of is setting -l in make.opts to leave "overhead" for whatever game you are playing, but you need to be mindful of memory usage as that can cause stuttering too as stuff gets swapped.

If it always worked like that in every system then I might agree its too hard of a problem to solve.

Then again I always used Windows and I didn't see such issues there.
If you have enough memory, run system, game and heavy background application off separate hard drives and don't have HyperThreading/SMT disabled then the only way to your foreground application to run slower is background processes using memory and caches - which in practice means slightly lower frame rates in case of games but not by much and certainly nothing like stuttering.

So this is not unsolvable issue because actually there is no such issue and I refuse to even acknowledge Linux kernel developers could ever made system that cannot handle background tasks correctly. Something somewhere causes the issue. I am not yet one hundred percent sure if it is really the kernel or maybe it has to do with Wine as the issue is mostly visible in games and I don't yet have any Linux games... but I intend to try some. I probably should beforehand BUT I guessed maybe someone has experienced it and resolved it so no need to reinvent the wheel. Or at least throw some hint or confirm that it can be solved (I dunno - that rare it would seem in Internets thing called "let's check if this happens on my system" - rare even if checking would take less than a minute...) in which case there would be at least hope.

--------
As for clocks etc. I run flat rate near boost at fixed voltage. No temperature/throttling issues - look at "water cooled direct die" part of OP. In fact as far as temperatures go compilers aren't that bad. Even 100% CPU usage doesn't heat up CPU as much as something like Cinebench let alone Prime95 with AVX2 would but still even with Cinebench R23 in a loop I would say temps aren't bad. Had much worse on 13600KF on air and didn't have any issues in Windows at least.
Also I don't use HyperThreading so this unfortunate tech cannot cause any issues. Though HT bad as it is could at most reduce P-core performance to 65% and wouldn't really cause stuttering, again in Windows. In Linux so far I get severe stuttering and need to manually change thread affinities or games are unplayable.
_________________
Unix Wars - Episode V: AT&T Strikes Back
Back to top
View user's profile Send private message
logrusx
Veteran
Veteran


Joined: 22 Feb 2018
Posts: 1558

PostPosted: Fri Apr 19, 2024 7:36 am    Post subject: Reply with quote

e8root wrote:
Quote:
his is normal. The only way around it can think of is setting -l in make.opts to leave "overhead" for whatever game you are playing, but you need to be mindful of memory usage as that can cause stuttering too as stuff gets swapped.

If it always worked like that in every system then I might agree its too hard of a problem to solve.

Then again I always used Windows and I didn't see such issues there.


Do not compare apples to space balls.

Linux kernel multitasking is geard towards server loads. The only way known to me to have system processes not preempt user processes is Con Kolivas patches and his MuQSS scheduler which are no more.

Best Regards,
Georgi
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2012

PostPosted: Fri Apr 19, 2024 9:15 am    Post subject: Reply with quote

A few things that might help (in no particular order), though no guarantees, and you're probably already aware of them:

  • Use the BFQ I/O scheduler instead of mq-deadline.
  • Ensure you have --load-average and --jobs in MAKEOPTS combined with the same variables (confusingly) in EMERGE_DEFAULT_OPTS to leave some resources for non-emerge processes. This requires some finesse as --load-average means the same in both places, whereas the --jobs multiply to get the number of concurrent processes permitted.
  • Ditto ensure the number of possible processes multiplied by 2GB (if you normally use gcc) or 1.5GB (for clang) leaves memory for those other processes.
  • Have enough swap space - though in my experience on modern software, when paging or worse swapping becomes significant performance suffers.
  • Enable ZSWAP using Z3FOLD and LZO compression - reportedly, a lot of pages marked for swap or paging are mostly 0s, so the compression works well and fast, and thus it provides fast swap; it might help the kernel realize that memory is under pressure (I don't know if that helps scheduling) without incurring as much overhead as disk swap space. Whether that's worth the reduction available RAM for the ZSWAP area is open to evaluation. (IMHO shun ZRAM for swap, as IIUC it can't reclaim space once used for swapping.)
  • Remember that memory allocated to ZRAM or tmpfs for PORTAGE_TMPDIR (and implicitly used if MAKEOPTS includes --pipe) isn't available to applications - making emerge run fast that way will make other applications run slower.
  • I'm pretty sure NUMA is irrelevant, but someone who actually knows about it might say otherwise!

_________________
Greybeard
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 4186
Location: Bavaria

PostPosted: Fri Apr 19, 2024 11:29 am    Post subject: Reply with quote

Goverp wrote:
I'm pretty sure NUMA is irrelevant, but someone who actually knows about it might say otherwise!

Yes, with this Intel CPU you dont need NUMA (it is really only important for systems that actually know NUMA).
_________________
https://wiki.gentoo.org/wiki/User:Pietinger
Back to top
View user's profile Send private message
e8root
n00b
n00b


Joined: 09 Feb 2024
Posts: 72

PostPosted: Fri Apr 19, 2024 7:48 pm    Post subject: Reply with quote

logrusx wrote:
Do not compare apples to space balls.

Apples not. The only experience I had with Apples was hackintosh with OSX 10.5 - it was only enough to get idea of how more bloated than even Vista this system was :)

Quote:
Linux kernel multitasking is geard towards server loads. The only way known to me to have system processes not preempt user processes is Con Kolivas patches and his MuQSS scheduler which are no more.

Not sure if this is really the issue. I cannot imagine spikes in latency (stutters) in game could possibly help with server workloads.

Imho Linux kernel is optimized for Linus's computer with his exact BIOS settings and use cases. Anything else even if it breaks something else goes in. Except all these gaming patches - which I intent to try!

That said I am thinking if that isn't another speedstep situation and I maybe disabled something in UEFI that wasn't needed by windows and its "just works" scheduler. Or maybe Intel implementation for P/E cores is bad. The strange issue I observe is that game still stutters even if I prevent build tasks from using group of E-cores until I also limit game from using any other cores than these E-cores which I prevented build from using. This is strong indication something is up - though I don't have another computer to confirm if maybe this would also happen on eg. Zen if I did that with few last cores.
_________________
Unix Wars - Episode V: AT&T Strikes Back
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 4186
Location: Bavaria

PostPosted: Fri Apr 19, 2024 9:28 pm    Post subject: Re: Bad performance when system under full load Reply with quote

e8root wrote:
[...] Core i9 13900KF with HT disabled [...]

Though for whatever reason it doesn't work that well with E-cores unless I also disallow game from using any other cores than those which I disallowed for ebuild - in other words in this case it looks like kernel is still scheduling game on P-cores which it cannot use.

Just for the reocord:

Your i9 has
16 E(fficiency)-cores (Gracemont microarchitecture) and,
8 P(erformance)-cores (Raptor Cove microarchitecture) supporting Hyper-Threading, so you have 16 logical cores on this side
... together you have 32 logical cores.

If you disable SMT you will have 16 E-cores and 8 P-cores active ... together 24 physical cores ... and 24 logical cores !

My questions would be:

1. Have you set a MAKEOPTS= in your /etc/portage/make.conf ? how much ?
2. How do you disallow some cores for your games ? Do you work with cgroups ?

(I have also an i9-13900K ... with SMT enabled ... so I can use 32 logical cores ... and when I try to compile a bigger package with MAKEOPTS="-j32" I will reach 100° Celsius within 2 or 3 seconds ... although I have one of the most powerful AIO CPU water coolers ...)

If you want a check of your kernel .config I would need your .config and all 3 files: https://wiki.gentoo.org/wiki/User:Pietinger/Overview_of_System_Information
_________________
https://wiki.gentoo.org/wiki/User:Pietinger
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 4186
Location: Bavaria

PostPosted: Fri Apr 19, 2024 9:47 pm    Post subject: Reply with quote

e8root wrote:
Imho Linux kernel is optimized for Linus's computer with his exact BIOS settings and use cases. Anything else even if it breaks something else goes in. Except all these gaming patches - which I intent to try!

No ... just no ... 8)

Maybe watch this YT video from the great kernel developer André ALMEIDA:
"Kernel Recipes 2023 - Linux and gaming: the road to performance"
https://www.youtube.com/watch?v=KcKTWqLPaoM
_________________
https://wiki.gentoo.org/wiki/User:Pietinger
Back to top
View user's profile Send private message
logrusx
Veteran
Veteran


Joined: 22 Feb 2018
Posts: 1558

PostPosted: Fri Apr 19, 2024 11:00 pm    Post subject: Reply with quote

e8root wrote:

I cannot imagine spikes in latency (stutters) in game could possibly help with server workloads.


Imagination imagination is not at play here. That's exactly what happens when throughput has been prioritized over responsiveness. This crap Intel have come up with only makes the issue more complicated.

Having said that, I'm out.

Best Regards,
Georgi
Back to top
View user's profile Send private message
Anon-E-moose
Watchman
Watchman


Joined: 23 May 2008
Posts: 6100
Location: Dallas area

PostPosted: Fri Apr 19, 2024 11:26 pm    Post subject: Reply with quote

You could try taskset.

I use it to run my windows vm's, it sets aside whatever cores/threads you tell it to for whatever purpose.

Do keep in mind, that even if 1/2 of the cores were set aside for a vm, performance will still be affected ... somewhat.
_________________
PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland
Back to top
View user's profile Send private message
e8root
n00b
n00b


Joined: 09 Feb 2024
Posts: 72

PostPosted: Sun Apr 21, 2024 9:41 am    Post subject: Reply with quote

I have made python script which changes PID 1 and all children to not use N cores and gamescope (which I use to play games in separate TTY) and its children (read: the game) to use N cores and all as a kind of toggle with 8 P-cores being the default.
I can run not even 24 compiler/linker threads but multiples of that and games run flawlessly for as long as there is enough memory - which realistically I never saw compiler come close to even 1 GB per thread let alone 2 and with 64GB memory it (memory) is not an issue.

Hints:
- different zram compression - good point, zstd might not be ideal for latency sensitive situations. So far testing different kernel options and with zram enabled or disabled I wouldn't notice much issues caused by zram. Something however to consider, maybe do some synthetic test based on common usage eg. open bazzilion tabs in firefox/chrome and all sorts of other things and see how different algorithms fare using different QPIs

- BFQ I/O - pretty sure I/O scheduler isn't the cause but overall this IO scheduler topic is on list of the things I plan to investigate

- make.conf settings - no memory issues. Number of threads per job - here I go with "more than optimal" for testing. Maybe I have wrong expectations about how it should work and throwing multiple build each using core counts at the system just overwhelms it but then again in my testing the amount of stutter does differ from run to run and at some settings it wasn't that bad - unfortunately I didn't note exact settings so I need to redo tests

- swap space - twice the memory and mostly for hibernation to not complain. Biggest fan of hibernation - also since it allows to run multiple sessions in multiple operating systems eg. both Linux and Windows can be hibernated to run other system. Should my system experience swapping which zram cannot mitigate I'd just buy two sticks of 32GB RAM each for a total of 128GB - then I could even use Chrome :lol:

- zram /tmp and /var/tmp - I do remember

- numa - P/E cores seems to be handled solely by intel_pstate driver and don't use typical NUMA config options.
That said Alder/Raptor Lake are in fact structured in the way this driver should take NUMA-like characteristics of these chips to achieve the best performance. If it does - not really sure.
What I mean is that for example E-cores come in groups of 4 and latency between threads on E-core group is much lower than between E-core groups so if you'd run e.g. game that equally uses 4 threads there would be a performance difference between it using 4 cores from one E-core group versus using one core from each E-core group. How much I didn't test and I guess caches in E-core groups can to some extent mitigate performance loss due to latency (especially on Raptor Lake) but still I would expect lower performance in the latter case. At least when these threads need to communicate often.

- makeopts - I had -j25 -l24, changed to -j24 -l24 - not sure which is more correct but probably the latter one given it already does cause 100% CPU load and running two ebuilds at the same time make latency/stuttering issue in game worse

- insta 100'C on AIO cooled 13900K...
Yeah, that is why I use direct die custom water loop. At 5.6/4.5 with no HT ~70'C after a longer while and with HT ~80'C
Also I don't use auto settings in UEFI but ~1.32v for vcore and all voltages set to override to slowly over time optimize them. I haven't tested it on 13900KF much and these things depend on motherboard but on 13600KF on MSI Z790 board running CPU at auto versus slightly above ~1.1v (13600KF doesn't need much at its clocks) is like ~100W difference. I did see in reviews that to get >40K points in Cinebench just removing power limits they had to use almost 400W power. I did that at ~310W +/- 10W with more optimized (but not fully) settings so there definitely are power saving by just tweaking voltages. In practice you can do all core overclock while reducing power consumption on these processors.
Personally I don't use Hyper Threading so that helps a lot. HT improves performance but causes P cores to have only ~65% performance in cases all cores are loaded. Wouldn't be an issue if schedulers were tuned for interactive use so for example NOT use P-core used by thread with spawned window which you are actively clicking stuff. Not even Windows has this aggressive interactivity optimizations... though actually with Raptor Lake there is a setting and default at that which forces background tasks on E-cores so in fact it can accomplish that - maybe in less than optimal way (you don't need all P-cores typically to sit iddle) and it doesn't always work and I prefer to disable this "Thread Director" feature alongside HyperThreading to get overall similar interactivity latency and background task performance.

For Linux I might have to rethink my setup, do some tests, check if maybe intel_pstate isn't behaving well with Hyper Threading disabled but does with it enabled, etc. So far HT didn't seem to improve the issue so probably not but I cannot claim to have tested each and every combination of settings. There were times when I would say latency was ok-ish with only occasional and minimal stutters but I forgot to note them down thinking I'll remember and I didn't. Not sure if those settings would be optimal for proper thread allocation to P-cores first. Lots of things to worry about in these CPUs unfortunately and if Intel did poor job supporting each and every setting combination and use case then I might fall under "we didn't optimize for that"

Also why I joked about Linus and his specific settings being optimized for. There was after all 'incident' he rejected PR (or maybe it was his enraged message for PR being merged which got quickly reverted) for 6.8 kernel branch because his Threadripper with its settings didn't play nice with the changes. I can only imagine that if Linus used Raptor Lake CPU and was avid gamer who also builds software while playing games on the same PC we would have Linux much more optimized for these use cases than what we have now.
_________________
Unix Wars - Episode V: AT&T Strikes Back
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 4186
Location: Bavaria

PostPosted: Sun Apr 21, 2024 11:06 am    Post subject: Reply with quote

e8root wrote:
[...] P/E cores seems to be handled solely by intel_pstate driver and don't use typical NUMA config options. [...]

AFAIK Intel still works on some optimizations for these kind of CPUs ... we have to wait for 6.9 (or even 6.10) ...

https://www.phoronix.com/news/Linux-6.9-APIC-x86-CPU-Topology
https://www.phoronix.com/news/Linux-Per-CPU-NUMA-Node-Cpumask

e8root wrote:
[...] Yeah, that is why I use direct die custom water loop. [...]

Wow ... 8O ... have you done it yourself ?
_________________
https://wiki.gentoo.org/wiki/User:Pietinger
Back to top
View user's profile Send private message
ThePsyjo
n00b
n00b


Joined: 24 Apr 2024
Posts: 1

PostPosted: Wed Apr 24, 2024 2:36 pm    Post subject: Reply with quote

Since I have switched to =sys-kernel/gentoo-sources-6.8.5 I have bad performance under full load too. So much so that the desktop was barely usable. When updating the config I found 'CONFIG_SCHED_BMQ' to be interesting and just enabled it.

What i found out today is that 'CONFIG_SCHED_ALT=y' and 'CONFIG_SCHED_BMQ=y' is the reason for the stuttering under load because it disables 'CONFIG_SCHED_AUTOGROUP=y'. After going back to

Code:

CONFIG_SCHED_ALT=n
CONFIG_SCHED_BMQ=
CONFIG_SCHED_AUTOGROUP=y


Everything runs smooth again.
Back to top
View user's profile Send private message
e8root
n00b
n00b


Joined: 09 Feb 2024
Posts: 72

PostPosted: Sat Apr 27, 2024 6:06 am    Post subject: Reply with quote

@ThePsyjo
Thanks your settings didn't resolve the issue with stuttering in games under full load or at least not fully but I noticed recently kernel wouldn't correctly assign threads to P cores on new kernels.

Raptor Lake without proper scheduler can have wild performance fluctuations like different benchmark results depending on where threads land (P or E cores) and now it works solid.

Can you paste your kernel config somewhere? Maybe I am missing something.
Also are you using experimental USE flag for kernel sources?

@pietinger
Nice Intel still work on this topic.
Tried to build 6.9-tip kernel but genkernel complained about some virtualbox headers not compiling and building with make caused Grub to complain about magic numbers. Guess I'll just wait for normal unstable release in Gentoo.
_________________
Unix Wars - Episode V: AT&T Strikes Back
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum