Troubleshooting a non-booting kernel

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

Hi All,

I can usually help myself when it comes to troubleshooting kernel configurations, but now i have one that beats me ...
The hardware involved is a Clevo N240JU laptop (actually branded as BTO), my kernel is configured with all required modules built-in.
The problem is that my kernel doesn't boot most of the time, but sometimes it does. When it does everything works as expected. I got three kernels in use: 5.4.28 (which always works), 5.4.72 (which works most of the time, but sometimes has the same problem), and not recently 5.10.27 (which boots sometimes bot most of the time it doesn't).

When i select it in the grub menu, this is the output: (beware of any types, i had to manually type it)

NeddySeagoon · Posted: Sat May 08, 2021 10:42 am Post subject:

pa4wdh.

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

pietinger · Posted: Sat May 08, 2021 12:32 pm Post subject:

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

I just opened my laptop to see what type of memory is in there. It's a single 8GB module, so i can't remove it to experiment with different memory configurations. Sure, the memtest+ output is pretty convincing, but i'd expect general system instability when it's really as bad as memtest+ says it is.

I also got lucky and my new kernel booted. I checked dmesg and noticed the first thing it does is a CPU microcode update. I removed the microcode stuff and now my new kernel boots reliably (tested 10+ times).

pietinger · Posted: Sat May 08, 2021 3:15 pm Post subject:

pa4wdh,

do you have compiled your microcode static in your kernel ?

If yes: You must know then it is a blob in your kernel and is NOT loaded at boottime from your /lib/firmware/... This means: it could be you have another version in your older kernels. You can check by booting them and watch at the very beginning of "dmesg". Here is mine as example:

NeddySeagoon · Posted: Sat May 08, 2021 4:10 pm Post subject:

pietinger,

Loading CPU microcode form /lib/firmware has been broken for a long time.
It either needs to be in its own initrd or built into the kernel.

The problem relates to updating the microcode that is being used to control the CPU, so it has to be done 'early'.
It's much worse than self modifying code. At least you know that the code being updated is not being executed at the same instant.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

I'm indeed building the microcode blob into the kernel. Just to be sure i checked using the instructions on the wiki, and i'm using the right microcode file.

The microcode versions included in the different versions are:
kernel 5.4.28: ref 0x00d6, 2019-10-03
kernel 5.4.72: ref 0x00e0, 2020-06-24
kernel 5.10.27: ref 0x00e2, 2020-07-14 (now not included anymore :-)

)
The CPU is an i5-6200U.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com

pietinger · Posted: Sat May 08, 2021 5:46 pm Post subject:

pietinger · Posted: Sat May 08, 2021 5:50 pm Post subject:

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

I've done some tests with the microcode, all done with the 5.10.27 kernel, just swapped the microcode file supplied in the kernel.
ref 0x00d6, 2019-10-03: Boots without problems
ref 0x00e0, 2020-06-24: Boots without problems (like it did before, the problems were so rare i never took time to troubleshoot it)
ref 0x00e2, 2020-07-14: Fails to boot
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com

pietinger · Posted: Sun May 09, 2021 9:31 am Post subject:

pa4wdh,

so maybe it is not a hardware problem; moreover its a problem with your microcode ;-)

Many greetings,
Peter

(I love threads like this one, because I also learn more)

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

I don't really think the microcode itself is wrong. The download (via portage) has been verified, and as far as i know the file itself is also signed. If that would be wrong it would simply refuse to install or always fail to boot. It's the "sometimes" that makes it strange.
If i could take a wild guess i would say the memory inside the processor that holds the microcode update during runtime might be faulty, but as far as i know there is no way to test that.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com

NeddySeagoon · Posted: Sun May 09, 2021 10:13 am Post subject:

pa4wdh,

The microcode is always executed from the microcode RAM. The embedded (in the CPU) micocode is copied to the microcode RAM as part of the CPU internal reset.
The storage for the embedded microcode is far too slow to operate at the CPUs rated clock speed.

Check the Intel web site for the newest microcode for your CPU. Just occasionally, updates are released within a few days of each other.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

The intel website isn't really informative when it comes to microcode, their github seems to be ok. The file listed there for my processor is the same version i already have. I verified the sha512sum of the already present ucode file and the new downloaded file and they match, so that also rules out a faulty download of the ucode i used so far.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com

Hu · Moderator Joined: 06 Mar 2007 Posts: 21633

Your copy of the microcode may be what its authors published, but we still have an open question of whether what they published is just broken for your setup. For example, if they assumed certain performance characteristics that your CPU cannot reliably deliver, or assumed the kernel would behave in a certain way, their code may intermittently fail on your CPU. The fact that the older microcode is completely reliable is interesting, and makes it particularly aggravating that microcode is extremely opaque. We cannot meaningful analyze whether the version that fails has an implementation defect.

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

Very true Hu, and exactly the reason why i usually avoid closed software at all costs, but with microcode i don't have a choice.

Also interesting, yesterday while using the laptop i had it crash once, and as far as i saw it was for no good reason. A reboot later and the same action worked just fine. Which might be an indication that the memory is actually faulty. I think i'll just but an 8GB memory module, it either solves my problem or it's a nice upgrade

_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

Today i finally got some time to get a new memory module.

The new module is marketed as super-duper-fast, and indeed, memtest86+ finds the errors must faster now

So now i have two memory modules and my laptop has two slots. No matter which combination of modules and slots i try, the result is always the same. I also noticed that the addresses where memtest86+ finds the errors are also always the same.
That makes me believe that if there is some faulty hardware, it's probably the memory controller which i believe to be inside the CPU. I say if because i still have a hard time believing the huge amount of errors reported by memtest86+ combined with the stability i experience when running it.
The issue with the microcode has also stayed the same. Microcode loading enabled gives a kernel which doesn't boot 9 out of 10 times.

Well ... at least i can compile rust in tmpfs now

_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com

Hu · Moderator Joined: 06 Mar 2007 Posts: 21633

Perhaps the fault is in the memory controller, and is only visible if the number of requests per millisecond exceeds some threshold. In such a case, memtest would trigger it since it does almost nothing other than memory accesses, but more typical uses might not trigger it because they keep stalling out to do real work by waiting for I/O, running computations on the CPU, etc.

Does your firmware offer the ability to configure RAM access timing? Can you try underclocking the RAM to see if that suppresses the errors?

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

Thanks for the suggestion Hu.
Unfortunately the firmware is quite limited and doesn't allow me to make any settings regarding memory timings. I can check if there's a firmware update available.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

I did some more testing today. Main thing is that i don't really trust the memtest output, so i need an other way to test.

First i let memtest86+ run a bit longer than i did so far. I usually stopped it after a few minutes because even in such a short time 60K+ errors are reported. Now i left it running for about 30 minutes, it found 2,5M+ errors, basically reporting every memory address faulty (which makes it slow because it prints every error).
Then i did a regular boot, switched off swap and created a tmpfs with a huge file as big as it would allow me. This file is almost 16G big, since the rest of my desktop software barely uses anything. If i make it any bigger the OOM killer starts killing random stuff

. The file is created from /dev/urandom and i created a sha512sum of it. Now i have 4 terminals running a loop which waits some random time and and runs sha512sum -c for that file so see if any bit errors occur. This has been running for 30 minutes now without any errors, top reports only 12.5M available.

Do you have any other/better suggestions to test the memory while linux is running?
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com

user · Apprentice Joined: 08 Feb 2004 Posts: 202

pa4wdh · l33t Joined: 16 Dec 2005 Posts: 812

Thanks for the tip user (nice nick by the way :wink:

)

I've installed memtester, stopped as much processes as possible and run it with the highest amount of memory it could, which turned out to be 15750M. The tests took a while and all test results are ok.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com