Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Troubleshooting a non-booting kernel
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sat May 08, 2021 9:21 am    Post subject: Troubleshooting a non-booting kernel Reply with quote

Hi All,

I can usually help myself when it comes to troubleshooting kernel configurations, but now i have one that beats me ...
The hardware involved is a Clevo N240JU laptop (actually branded as BTO), my kernel is configured with all required modules built-in.
The problem is that my kernel doesn't boot most of the time, but sometimes it does. When it does everything works as expected. I got three kernels in use: 5.4.28 (which always works), 5.4.72 (which works most of the time, but sometimes has the same problem), and not recently 5.10.27 (which boots sometimes bot most of the time it doesn't).

When i select it in the grub menu, this is the output: (beware of any types, i had to manually type it)
Code:

early console in extract_kernel
input_data: 0x00000000027982e0
input_len: 0x0000000000988185
output: 0x000000001000000
output_len: 0x000000000020de89c
kernel_total_size: 0x0000000001e2c000
needed_size: 0x000000002200000
trampoline_32bit: 0x0000000000099000
Physical KASLR using RDRAND RDTSC...
Virtual KASLR using RDRAND RDTSC...

Decompressing Linux... Parsing ELF... Performing relocations... done.
Booting the kernel.

When it works i get the usual texts of all hardware being detected and initialized, but when it fails it stays like this (kept it like this for 15 minutes). When that happens i can only switch it off with a long press on the power button to shut it down.
The 5.10.27 config can be found here: https://ernstagn.home.xs4all.nl/gentoo/logs/config-5.10.27

Any clues?
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54237
Location: 56N 3W

PostPosted: Sat May 08, 2021 10:42 am    Post subject: Reply with quote

pa4wdh.

Quote:
... sometimes ...
That sounds like a hardware problem.

Does it really not boot, or does it boot but display nothing on the console. There are two ways to check.
1) After a successful boot, following a failed boot, do you get messages about filesystem was not cleanly unmounted ... or replaying journal ?
2) Can you log in over ssh after a 'failed' boot ?

Then there are some basic hardware tests.
Boot into memtest86 and run a few cycles. That's a good test of the memory subsystem. Do not run it from inside Linux as that's not useful.
Failures need further investigation. They do not always mean faulty RAM.

Run Prime95. That will test your CPU and cooling system. Keep an eye on the CPU temperature while it runs.
Run at least once complete cycle but stop it if the CPU temperature does not stabilise and fix your cooling system.
If you have a laptop, you may well go into thermal throttling.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sat May 08, 2021 11:10 am    Post subject: Reply with quote

Quote:

That sounds like a hardware problem.

Agree, but that doesn't explain why older kernels always boot correctly.

Quote:

Does it really not boot, or does it boot but display nothing on the console. There are two ways to check.
1) After a successful boot, following a failed boot, do you get messages about filesystem was not cleanly unmounted ... or replaying journal ?
2) Can you log in over ssh after a 'failed' boot ?

On a failed boot the system freezes, no HDD activity or whatsoever. Only a "hard" shutdown via the powerbutton works, if it somehow did something a ctrl+alt+del should trigger a reboot.
1) On a next (working) boot there's nothing about repairing filesystems.
2) I can't actually try that, i have to enter a LUKS passphrase before networking is configured.

I just installed memtest86+. It reports loads of errors (60k+ within a minute). It's so much i'm struggling to believe it, since it has been compiling rust in tmpfs using all it's ram + some swap without problems.
As for cooling, it even happens when the system has been off for hours and is booted the first time and the fan is not running (and is working, which can be clearly heard during compiling :) ).
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 4148
Location: Bavaria

PostPosted: Sat May 08, 2021 12:32 pm    Post subject: Reply with quote

pa4wdh wrote:
I just installed memtest86+. It reports loads of errors (60k+ within a minute). It's so much i'm struggling to believe it, [...]

I go with Neddy and also beleive in a hardware problem.
pa4wdh wrote:
Agree, but that doesn't explain why older kernels always boot correctly. [...] since it has been compiling rust in tmpfs using all it's ram + some swap without problems.

Now the big question: When you did this ? When have you compiled your old kernel ? When have you compiled your new kernel ?

I dont think its the kernel config; it is maybe a compiled-in problem at that time when you had problems with your RAM ...
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sat May 08, 2021 1:31 pm    Post subject: Reply with quote

I just opened my laptop to see what type of memory is in there. It's a single 8GB module, so i can't remove it to experiment with different memory configurations. Sure, the memtest+ output is pretty convincing, but i'd expect general system instability when it's really as bad as memtest+ says it is.

I also got lucky and my new kernel booted. I checked dmesg and noticed the first thing it does is a CPU microcode update. I removed the microcode stuff and now my new kernel boots reliably (tested 10+ times).

Quote:

Now the big question: When you did this ? When have you compiled your old kernel ? When have you compiled your new kernel ?

This question can be answered :)
5.10.27 was compiled yesterday (or with my recent change: today) with GCC 10.2.0
5.4.72 was compiled February 2nd with GCC 9.2.0
5.4.28 was compiled April 2nd 2020 with GCC 9.2.0
Boot tests have all been done today.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 4148
Location: Bavaria

PostPosted: Sat May 08, 2021 3:15 pm    Post subject: Reply with quote

pa4wdh,

do you have compiled your microcode static in your kernel ?

If yes: You must know then it is a blob in your kernel and is NOT loaded at boottime from your /lib/firmware/... This means: it could be you have another version in your older kernels. You can check by booting them and watch at the very beginning of "dmesg". Here is mine as example:
Code:
May  8 13:12:25 localhost syslogd[1688]: syslogd v2.2.2: restart.
May  8 13:12:20 localhost kernel: microcode: microcode updated early to revision 0xe2, date = 2020-07-14
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54237
Location: 56N 3W

PostPosted: Sat May 08, 2021 4:10 pm    Post subject: Reply with quote

pietinger,

Loading CPU microcode form /lib/firmware has been broken for a long time.
It either needs to be in its own initrd or built into the kernel.

The problem relates to updating the microcode that is being used to control the CPU, so it has to be done 'early'.
It's much worse than self modifying code. At least you know that the code being updated is not being executed at the same instant.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sat May 08, 2021 5:13 pm    Post subject: Reply with quote

I'm indeed building the microcode blob into the kernel. Just to be sure i checked using the instructions on the wiki, and i'm using the right microcode file.

The microcode versions included in the different versions are:
kernel 5.4.28: ref 0x00d6, 2019-10-03
kernel 5.4.72: ref 0x00e0, 2020-06-24
kernel 5.10.27: ref 0x00e2, 2020-07-14 (now not included anymore :-) )
The CPU is an i5-6200U.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 4148
Location: Bavaria

PostPosted: Sat May 08, 2021 5:46 pm    Post subject: Reply with quote

NeddySeagoon wrote:
Loading CPU microcode form /lib/firmware has been broken for a long time.

Neddy,

thanks a lot ! I didnt knew because I built my microcode in the kernel for many years now (I had never load it at runtime).
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 4148
Location: Bavaria

PostPosted: Sat May 08, 2021 5:50 pm    Post subject: Reply with quote

pa4wdh wrote:
kernel 5.10.27: ref 0x00e2, 2020-07-14 (now not included anymore :-) )
The CPU is an i5-6200U.

Interesting ... :-)
I have an i7-6700; same microcode version you have (/had) and kernel 5.10.35 (but I was on 5.10.27 also) => Not any problem.

(If you want to play a little bit, you could install the microcode before and built it into your 5.10-Kernel; just to check if it is really the microcode (but I dont think so)).
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sun May 09, 2021 9:17 am    Post subject: Reply with quote

I've done some tests with the microcode, all done with the 5.10.27 kernel, just swapped the microcode file supplied in the kernel.
ref 0x00d6, 2019-10-03: Boots without problems
ref 0x00e0, 2020-06-24: Boots without problems (like it did before, the problems were so rare i never took time to troubleshoot it)
ref 0x00e2, 2020-07-14: Fails to boot
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 4148
Location: Bavaria

PostPosted: Sun May 09, 2021 9:31 am    Post subject: Reply with quote

pa4wdh,

so maybe it is not a hardware problem; moreover its a problem with your microcode ;-)

Many greetings,
Peter


(I love threads like this one, because I also learn more)
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sun May 09, 2021 9:37 am    Post subject: Reply with quote

I don't really think the microcode itself is wrong. The download (via portage) has been verified, and as far as i know the file itself is also signed. If that would be wrong it would simply refuse to install or always fail to boot. It's the "sometimes" that makes it strange.
If i could take a wild guess i would say the memory inside the processor that holds the microcode update during runtime might be faulty, but as far as i know there is no way to test that.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54237
Location: 56N 3W

PostPosted: Sun May 09, 2021 10:13 am    Post subject: Reply with quote

pa4wdh,

The microcode is always executed from the microcode RAM. The embedded (in the CPU) micocode is copied to the microcode RAM as part of the CPU internal reset.
The storage for the embedded microcode is far too slow to operate at the CPUs rated clock speed.

Check the Intel web site for the newest microcode for your CPU. Just occasionally, updates are released within a few days of each other.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sun May 09, 2021 11:12 am    Post subject: Reply with quote

The intel website isn't really informative when it comes to microcode, their github seems to be ok. The file listed there for my processor is the same version i already have. I verified the sha512sum of the already present ucode file and the new downloaded file and they match, so that also rules out a faulty download of the ucode i used so far.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21633

PostPosted: Sun May 09, 2021 4:14 pm    Post subject: Reply with quote

Your copy of the microcode may be what its authors published, but we still have an open question of whether what they published is just broken for your setup. For example, if they assumed certain performance characteristics that your CPU cannot reliably deliver, or assumed the kernel would behave in a certain way, their code may intermittently fail on your CPU. The fact that the older microcode is completely reliable is interesting, and makes it particularly aggravating that microcode is extremely opaque. We cannot meaningful analyze whether the version that fails has an implementation defect.
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Mon May 10, 2021 5:22 am    Post subject: Reply with quote

Very true Hu, and exactly the reason why i usually avoid closed software at all costs, but with microcode i don't have a choice.

Also interesting, yesterday while using the laptop i had it crash once, and as far as i saw it was for no good reason. A reboot later and the same action worked just fine. Which might be an indication that the memory is actually faulty. I think i'll just but an 8GB memory module, it either solves my problem or it's a nice upgrade :)
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sat May 15, 2021 2:24 pm    Post subject: Reply with quote

Today i finally got some time to get a new memory module.

The new module is marketed as super-duper-fast, and indeed, memtest86+ finds the errors must faster now :)

So now i have two memory modules and my laptop has two slots. No matter which combination of modules and slots i try, the result is always the same. I also noticed that the addresses where memtest86+ finds the errors are also always the same.
That makes me believe that if there is some faulty hardware, it's probably the memory controller which i believe to be inside the CPU. I say if because i still have a hard time believing the huge amount of errors reported by memtest86+ combined with the stability i experience when running it.
The issue with the microcode has also stayed the same. Microcode loading enabled gives a kernel which doesn't boot 9 out of 10 times.

Well ... at least i can compile rust in tmpfs now :)
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21633

PostPosted: Sun May 16, 2021 5:30 pm    Post subject: Reply with quote

Perhaps the fault is in the memory controller, and is only visible if the number of requests per millisecond exceeds some threshold. In such a case, memtest would trigger it since it does almost nothing other than memory accesses, but more typical uses might not trigger it because they keep stalling out to do real work by waiting for I/O, running computations on the CPU, etc.

Does your firmware offer the ability to configure RAM access timing? Can you try underclocking the RAM to see if that suppresses the errors?
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Mon May 17, 2021 6:14 am    Post subject: Reply with quote

Thanks for the suggestion Hu.
Unfortunately the firmware is quite limited and doesn't allow me to make any settings regarding memory timings. I can check if there's a firmware update available.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sat May 22, 2021 11:51 am    Post subject: Reply with quote

I did some more testing today. Main thing is that i don't really trust the memtest output, so i need an other way to test.

First i let memtest86+ run a bit longer than i did so far. I usually stopped it after a few minutes because even in such a short time 60K+ errors are reported. Now i left it running for about 30 minutes, it found 2,5M+ errors, basically reporting every memory address faulty (which makes it slow because it prints every error).
Then i did a regular boot, switched off swap and created a tmpfs with a huge file as big as it would allow me. This file is almost 16G big, since the rest of my desktop software barely uses anything. If i make it any bigger the OOM killer starts killing random stuff :). The file is created from /dev/urandom and i created a sha512sum of it. Now i have 4 terminals running a loop which waits some random time and and runs sha512sum -c for that file so see if any bit errors occur. This has been running for 30 minutes now without any errors, top reports only 12.5M available.

Do you have any other/better suggestions to test the memory while linux is running?
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
user
Apprentice
Apprentice


Joined: 08 Feb 2004
Posts: 202

PostPosted: Sat May 22, 2021 2:45 pm    Post subject: Reply with quote

Quote:
Do you have any other/better suggestions to test the memory while linux is running?

yes, sys-apps/memtester
Code:
# eshowkw sys-apps/memtester
Keywords for sys-apps/memtester:
      |                             |   u   | 
      | a   a     p s   a   r       |   n   | 
      | m   r h   p p   l i i m m s | e u s | r
      | d a m p p c a x p a s 6 i 3 | a s l | e
      | 6 r 6 p p 6 r 8 h 6 c 8 p 9 | p e o | p
      | 4 m 4 a c 4 c 6 a 4 v k s 0 | i d t | o
------+-----------------------------+-------+-------
4.5.0 | + ~ o o + + + + ~ ~ o o ~ o | 7 o 0 | gentoo


If bad memory is found exclude it from usage at next boot by kernel parameter memmap=

For example memtester found bad memory at address 0x5afcdc70 exclude surrounding memory space by
Code:
memmap=64K$0x5afc0000
(low:0x5afc0000 0x5afcdc70 max:0x5afcffff)
Back to top
View user's profile Send private message
pa4wdh
l33t
l33t


Joined: 16 Dec 2005
Posts: 812

PostPosted: Sat May 22, 2021 5:07 pm    Post subject: Reply with quote

Thanks for the tip user (nice nick by the way :wink: )

I've installed memtester, stopped as much processes as possible and run it with the highest amount of memory it could, which turned out to be 15750M. The tests took a while and all test results are ok.
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

My shared code repository: https://code.pa4wdh.nl.eu.org
Music, Free as in Freedom: https://www.jamendo.com
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum