Gentoo crashes when trying to do a 'heavy' task

Message

dragonfire2003 · Post by **dragonfire2003** » Sun Apr 10, 2022 8:57 pm

Good morning people of the Gentoo Forums! So, Recently a very ODD behavior I noticed with my Gentoo install is that it crashes whenever I try to do a heavy task
(And yes, Said tasks can be done on other operating systems, even Arch)
Specs:
24GB of ram
RTX 2060
AMD Ryzen 5
SSD
I'm using the Binary Kernel
Tasks that make Gentoo crash:
Trying to render a video on DaVinci resolve
Trying to play a game on wine
There are probably more which I didn't find out about yet, But those are the ones I need a fix ASAP
Two things I'm sure aren't related:
- Cooling Issues (Temperature is normal)
- Ram issues

alamahant · Post by **alamahant** » Sun Apr 10, 2022 9:01 pm

I noticed with my Gentoo install is that it crashes whenever I try to do a heavy task

what happens?
Does it become unresponsive or poweroff?
What does dmesg say?

dragonfire2003 · Post by **dragonfire2003** » Sun Apr 10, 2022 9:14 pm

alamahant wrote:
I noticed with my Gentoo install is that it crashes whenever I try to do a heavy task
what happens?
Does it become unresponsive or poweroff?
What does dmesg say?

poweroff
dmesg output: https://pastebin.com/NnyMqpZp
the last line is intriguing

Code: Select all

[   84.301576] xhci_hcd 0000:01:00.0: WARN: buffer overrun event for slot 3 ep 4 on endpoint

alamahant · Post by **alamahant** » Sun Apr 10, 2022 11:10 pm

I dont think its the culprit but what do you have connected via usb?
xhci_hcd
seems to be usb related.

dragonfire2003 · Post by **dragonfire2003** » Sun Apr 10, 2022 11:39 pm

alamahant wrote:I dont think its the culprit but what do you have connected via usb?
xhci_hcd
seems to be usb related.

things i have connected:
a fan
my mouse
my keyboard
my microphone
thats it

alamahant · Post by **alamahant** » Sun Apr 10, 2022 11:52 pm

Plz install
linux-firmware
https://wiki.gentoo.org/wiki/AMD_microcode#Emerge
and maybe be check if the fan is to blame...

dragonfire2003 · Post by **dragonfire2003** » Mon Apr 11, 2022 12:33 am

alamahant wrote:Plz install
linux-firmware
https://wiki.gentoo.org/wiki/AMD_microcode#Emerge
and maybe be check if the fan is to blame...

so I reached this part of the linux-firmware install:

Code: Select all

Regenerate the grub config using following command:
root #grub-mkconfig -o /boot/grub/grub.cfg

when I run

Code: Select all

grub-mkconfig -o /boot/grub/grub.cfg

this shows up:

Code: Select all

/usr/sbin/grub-mkconfig: line 260: /boot/grub/grub.cfg.new: No such file or directory

and I tried to render a video on davinci without the fan and without the microphone, same results
edit: i also cannot edit any kernel settings bc im using the binary kernel

alamahant · Post by **alamahant** » Mon Apr 11, 2022 12:38 am

Try plz

Code: Select all

ls  /boot/grub
mountpoint /boot
mount /boot
ls  /boot/grub

dragonfire2003 · Post by **dragonfire2003** » Mon Apr 11, 2022 12:41 am

alamahant wrote:Try plz
Code: Select all
ls  /boot/grub
mountpoint /boot
mount /boot
ls  /boot/grub

outputs in order

Code: Select all

ls: cannot access '/boot/grub': No such file or directory
/boot is a mountpoint
mount: /boot: /dev/sda1 already mounted on /boot.
ls: cannot access '/boot/grub': No such file or directory

alamahant · Post by **alamahant** » Mon Apr 11, 2022 4:20 pm

Then plz do

Code: Select all

umount /boot
ls /boot

Have you actually installed grub(grub-install..........)
?

Post by **NeddySeagoon** » Mon Apr 11, 2022 5:35 pm

dragonfire2003,

As its only you having this problem, its something unique to you.
That usually means hardware, as we all share the same software.

Poweroff points to overheating, an the system shutting down, to save itself from damage.

Being old and cynical, tell us how you know the temperatures and the RAM are good?

I've just had two faulty RAM sticks. The first one was easy to diagnose. Uncorrectable ECC errors at boot, so booting was not possible.
The second was harder. It too gave uncorrectable ECC errors eventually but it took over a week to pinpoint it to the RAM.
Note that this is ECC RAM too. Ordinary RAM is much harder to diagnose.

If you overclock, that includes XMP, turn it all off.

dragonfire2003 · Post by **dragonfire2003** » Mon Apr 11, 2022 7:05 pm

NeddySeagoon wrote:dragonfire2003,

As its only you having this problem, its something unique to you.
That usually means hardware, as we all share the same software.

Poweroff points to overheating, an the system shutting down, to save itself from damage.

Being old and cynical, tell us how you know the temperatures and the RAM are good?

I've just had two faulty RAM sticks. The first one was easy to diagnose. Uncorrectable ECC errors at boot, so booting was not possible.
The second was harder. It too gave uncorrectable ECC errors eventually but it took over a week to pinpoint it to the RAM.
Note that this is ECC RAM too. Ordinary RAM is much harder to diagnose.

If you overclock, that includes XMP, turn it all off.

I saw some other people having the same problem in the past but whatever.

Poweroff points to overheating, an the system shutting down, to save itself from damage.

Indeed that should be the thing that's causing my system to shut down, But it doesn't make sense! I have 6 fans along with an external one and I live in the 9th coldest city in Brazil, I also checked the temperature and it seems fine!

Being old and cynical, tell us how you know the temperatures and the RAM are good?

Temperature seems fine (50° which is the usual) and I've checked my RAM sticks and they also seem fine.
(Done a lot of diagnostics, Nothing seems wrong with them and no weird errors at startup)

eccerr0r · Post by **eccerr0r** » Mon Apr 11, 2022 7:07 pm

Don't forget bad motherboards, had that happen too - when parts (cpu, ram) tested in another board, it works fine.

And about XMP ... I have one computer that if I disable XMP, the machine won't boot Linux. Running memtest86+ I get tons of errors. With it enabled, machine boots and runs fine, and memtest86+ comes clean. *shrug* not sure what's up with this.

Post by **NeddySeagoon** » Mon Apr 11, 2022 7:13 pm

dragonfire2003,

Tell us how you measure the temperature
Tell us how you tested the RAM.

Did you assemble the system yourself?
If so tell us how the heatsink is fitted to the CPU. Thermal paste and so on.

CooSee · Post by **CooSee** » Mon Apr 11, 2022 10:26 pm

you should try another kernel.

tried long-term kernel once, but system behaved weird, therefore i stayed with current gentoo-sources.

regarding binary kernel - (no offence) never liked it, because there are to much things activated which i never need.

or maybe, try other distro,e.g Garuda via usb, if the behaviour of your system is the same.

good luck

mike155 · Post by **mike155** » Mon Apr 11, 2022 11:36 pm

What about the USB error messages in dmesg?

Code: Select all

[   20.387963] usb 1-9: Not enough bandwidth for new device state.
[   20.387968] usb 1-9: Not enough bandwidth for altsetting 1
[   20.387969] usb 1-9: 1:1: usb_set_interface failed (-28)
[   20.393089] usb 1-9: Not enough bandwidth for new device state.
[   20.393090] usb 1-9: Not enough bandwidth for altsetting 1
[   20.393091] usb 1-9: 1:1: usb_set_interface failed (-28)
....

I would definitely try to fix the issue.

dragonfire2003 · Post by **dragonfire2003** » Tue Apr 12, 2022 1:43 am

NeddySeagoon wrote:dragonfire2003,

Tell us how you measure the temperature
Tell us how you tested the RAM.

Did you assemble the system yourself?
If so tell us how the heatsink is fitted to the CPU. Thermal paste and so on.

Tell us how you measure the temperature

I used my cousin's thermal camera

Tell us how you tested the RAM.

I used a few diagnostic tools and I opened it myself to check if there was anything wrong (I know what I'm doing and I have the tools for opening it)

dragonfire2003 · Post by **dragonfire2003** » Tue Apr 12, 2022 1:45 am

CooSee wrote:you should try another kernel.

tried long-term kernel once, but system behaved weird, therefore i stayed with current gentoo-sources.

regarding binary kernel - (no offence) never liked it, because there are to much things activated which i never need.

or maybe, try other distro,e.g Garuda via usb, if the behaviour of your system is the same.

good luck

I wish I could try another kernel but because of nvidia's bullsh*t I can't
And I tried other systems using a USB Stick and even dual boot, Everything works fine
(Including exporting videos in DaVinci and such)

dragonfire2003 · Post by **dragonfire2003** » Tue Apr 12, 2022 1:46 am

eccerr0r wrote:Don't forget bad motherboards, had that happen too - when parts (cpu, ram) tested in another board, it works fine.

And about XMP ... I have one computer that if I disable XMP, the machine won't boot Linux. Running memtest86+ I get tons of errors. With it enabled, machine boots and runs fine, and memtest86+ comes clean. *shrug* not sure what's up with this.

Every piece of hardware is working fine and I'm 100% sure about that, I have tested everything on my PC and it all works fine.

I get no errors with memtest86

eccerr0r · Post by **eccerr0r** » Tue Apr 12, 2022 4:11 am

So, video card problems? Video card drivers forcing you to use specific kernels ... use older video card drivers?

You're ruling out everything but it's your computer that's different than the rest of us who are not having problems with the same system software...

Goverp · Post by **Goverp** » Tue Apr 12, 2022 7:32 am

A thought, probably irrelevant, but AMD cpus are notoriously sensitive to heatsink paste. If you fit you own fan and don't get the paste right, in the past at least, you'd get thermal problems or in the worst case damage.

Post by **NeddySeagoon** » Tue Apr 12, 2022 8:49 am

dragonfire2003,

A thermal camera will not tell you about your CPU transistor junction temperature, which in one of the ones that matters.

Install lm-sensors and configure it for your motherboard. That will require kernel support if you don't have it.
There may be some kernel provided temperatures in /sys/class/thermal/... The output is in milliC.

Code: Select all

$ cat /sys/class/thermal/thermal_zone0/temp
42842

That's 42.842C.

Boot into memtes86 or memtest86+ and run a few cycles.

Run prime95, which is a good CPU stress test.

You need to tell us what you did and provide results.
Your assertion that its not the hardware, when its only you that is having problems, is unlikely to be correct.

Post by **pjp** » Tue Apr 12, 2022 3:50 pm

dragonfire2003 wrote:I saw some other people having the same problem in the past but whatever.

I had a similar problem in the past, and it turned out to be a hardware problem.

The reason people ask what you've done isn't to question your knowledge or abilities. Even very experienced people make mistakes or miss things. The questions are to help others gain a level of confidence that they agree with your analysis. Providing details helps get passed that more quickly.

My problem wasn't discovered by running memtest for ~5 hours. I had to run it for >24 hours before I found what turned out to be a motherboard memory slot problem. Only one specific "heavy" task caused the reboot.

Post by **Chiitoo** » Wed Apr 13, 2022 12:42 pm

I'll throw in power supply unit gone bad, mostly just because I've had that happen way too often... and it can be easy to test if one happens to have more than one laying about.

Speaking of MemTest86 (not +), the free version from PassMark, I used it to confirm bad RAM late last year, but didn't want to RMA it right under Christmas.

Now I wanted to finally go through it, but wanted to test it once more again and... got no errors. :V

Turns out there was regression introduced in release 9.3, which affects the currently most recent release 9.4.1000 as well.

I got some debug builds from PassMark, and we were able to confirm the issue and it should be fixed in the next release (curiously the paid version is supposedly unaffected). Specifically, errors during the test number 13, hammer test, were not being triggered due to a "single-sided" version of the test being used instead of a "double-sided" version.

eccerr0r · Post by **eccerr0r** » Wed Apr 13, 2022 3:55 pm

Goverp wrote:A thought, probably irrelevant, but AMD cpus are notoriously sensitive to heatsink paste. If you fit you own fan and don't get the paste right, in the past at least, you'd get thermal problems or in the worst case damage.

TBH all CPUs with high power dissipation and "low" temperature tolerance are subject to heatsink paste issues... at least causing thermal throttling events. I knew of the old Athlon XPs that would literally immolate if you did not have a heatsink on (and probably similar if your heatsink paste was not up to snuff) but are the newer ones as sensitive? Haven't gotten a new CPU in ages...