View previous topic :: View next topic |
Author |
Message |
mortonP Tux's lil' helper
Joined: 22 Dec 2015 Posts: 84
|
Posted: Wed Dec 29, 2021 11:13 pm Post subject: kernel 5.15.x breaks root on DHCP+NFS |
|
|
Hi...
Every year's end I make the jump from old LTS to new LTS kernel, meaning this year from 5.10.x to 5.15.x
I boot Gentoo via a kernel that mounts its root fs from NFS:
CONFIG_CMDLINE="ip=dhcp root=/dev/nfs nfsroot=192.168.x.x:/gentoo,tcp,vers=4.1 ...."
and everything worked so far fine.
Except, with 5.15 this no longer works, the kernel hangs at boot and after some time emits
"VFS: Unable to mount root fs via NFS"
I spent now half a day trying every kernel from 5.10: 5.11, 5.12, 5.13, up to 5.14.21, they all work fine.
5.15.x fails; I tried various combinations of "new" kernel config options
5.16-rc7 fails, too
Looking at the DHCP server log I don't see the DHCP query from the booting kernel before mounting the NFS root, which would explain the hang - there's no network coming up.
So I suspect more it is kernel's DHCP IP autoconfig fails instead of an NFS mount fail.
Still, I see 5.15 brought "exciting NFS changes", maybe these NFS core changes broke something?
Surely someone would have noticed this failing since 5.15 has been out already for a while?
I'm running out of ideas how to debug this further and get 5.15 running....
...do you know of something to google for or try?
Thank you! |
|
Back to top |
|
|
alamahant Advocate
Joined: 23 Mar 2019 Posts: 3879
|
Posted: Wed Dec 29, 2021 11:44 pm Post subject: |
|
|
Do you have
Code: |
CONFIG_CMDLINE_BOOL=y
|
? _________________
|
|
Back to top |
|
|
mortonP Tux's lil' helper
Joined: 22 Dec 2015 Posts: 84
|
Posted: Thu Dec 30, 2021 12:13 am Post subject: |
|
|
alamahant wrote: | Do you have
Code: |
CONFIG_CMDLINE_BOOL=y
|
? |
Yes. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54244 Location: 56N 3W
|
Posted: Thu Dec 30, 2021 1:54 pm Post subject: |
|
|
mortonP,
Pastebin your 5.15 kernel .config file please.
Check your dhcp server log for signs that an IP was requested and offered. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21635
|
Posted: Thu Dec 30, 2021 4:29 pm Post subject: |
|
|
Can you drop into an initramfs rescue shell, and look around to determine what is and is not working? You wrote at the beginning that you don't see the DHCP query in the DHCP server log. Can you collect a network packet capture, to confirm that the query was never even sent to the system running the DHCP server? |
|
Back to top |
|
|
Anon-E-moose Watchman
Joined: 23 May 2008 Posts: 6098 Location: Dallas area
|
Posted: Thu Dec 30, 2021 6:26 pm Post subject: |
|
|
There's a good chance that either an option has changed or been added for nfs related stuff, I'd check that whole subsystem, rather than use defaults from 5.10 _________________ PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland |
|
Back to top |
|
|
mortonP Tux's lil' helper
Joined: 22 Dec 2015 Posts: 84
|
Posted: Thu Dec 30, 2021 9:43 pm Post subject: |
|
|
I figured it out, by basically brute-force bisecting the .config changes between 5.14 and 5.15 - much of the options I do not really understand what they do.
5.14:
│ Symbol: E1000E [=y]
│ Type : tristate
│ Defined at drivers/net/ethernet/intel/Kconfig:58
│ Prompt: Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support
│ Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_INTEL [=y] && PCI [=y] && (!SPARC32 || BROKEN [=n])
5.15:
│ Symbol: E1000E [=y]
│ Type : tristate
│ Defined at drivers/net/ethernet/intel/Kconfig:58
│ Prompt: Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support
│ Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_INTEL [=y] && PCI [=y] && (!SPARC32 || BROKEN [=n]) && PTP_1588_CLOCK_OPTIONAL [=y]
There is no initramfs.
Kernel image itself runs DHCP and mounts rootfs via NFS and does normal boot as if from local disk.
This is only possible if all necessary drivers are compiled into kernel image - including network devices.
5.14 -> 5.15 for Intel NICs it gets an additional option && PTP_1588_CLOCK_OPTIONAL which is not =y by default.
So the e1000e also automatically becomes =M and so the kernel image loses networking....
...oops
I don't know how to feel about this now, spent 2 days debugging this.
But I learned again something, and I hope you too.
Sorry for bothering - in retrospect the symptoms and the cause absolutely make sense... |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21635
|
Posted: Thu Dec 30, 2021 10:27 pm Post subject: |
|
|
Perhaps there should be an initramfs, so you can drop in and look around when things don't work.
Do you even need this kernel to have CONFIG_MODULES=y? If not, consider disabling module support, which might encourage oldconfig to behave better when next this kind of thing happens. For a kernel booted over the network, I would think that having all kernel functionality built in is a net win, unless you routinely don't use significant amounts of the kernel, but want them available as modules for those rare days you use them.
How did your kernel end up with PTP_1588_CLOCK_OPTIONAL not set to =y? As I read the Kconfig language, it should have been =y, unless you had made PTP support a module: v5.15.19:drivers/ptp/Kconfig: | 8 config PTP_1588_CLOCK
10 depends on NET && POSIX_TIMERS
11 default ETHERNET
30 config PTP_1588_CLOCK_OPTIONAL
31 tristate
32 default y if PTP_1588_CLOCK=n
33 default PTP_1588_CLOCK
|
|
|
Back to top |
|
|
grknight Retired Dev
Joined: 20 Feb 2015 Posts: 1660
|
Posted: Fri Dec 31, 2021 2:18 am Post subject: |
|
|
mortonP wrote: | I figured it out, by basically brute-force bisecting the .config changes between 5.14 and 5.15 - much of the options I do not really understand what they do. |
It is never a bad idea to run the old and new configs through the /usr/src/linux/scripts/diffconfig tool to see what has changed. Especially good between major.minor releases just in case. |
|
Back to top |
|
|
mortonP Tux's lil' helper
Joined: 22 Dec 2015 Posts: 84
|
Posted: Sun Jan 02, 2022 2:38 pm Post subject: |
|
|
Hu wrote: | Perhaps there should be an initramfs, so you can drop in and look around when things don't work. ;)
How did your kernel end up with PTP_1588_CLOCK_OPTIONAL not set to =y? |
I havn't used an initramfs for years... One kernel image file is enough to keep track of? :-)
I redid the 5.10 -> 5.15 .config upgrade and ended up again with PTP_1588 as module - either I'm too stupid or there is another dependency somewhere.... |
|
Back to top |
|
|
mortonP Tux's lil' helper
Joined: 22 Dec 2015 Posts: 84
|
Posted: Sun Jan 02, 2022 2:41 pm Post subject: |
|
|
grknight wrote: |
It is never a bad idea to run the old and new configs through the /usr/src/linux/scripts/diffconfig tool to see what has changed. |
Ooooh... that's a nice tool, didn't know about that yet. Thank you! :-) |
|
Back to top |
|
|
mortonP Tux's lil' helper
Joined: 22 Dec 2015 Posts: 84
|
Posted: Sun Jan 02, 2022 2:56 pm Post subject: |
|
|
Now I upgraded 5.10 -> 5.15 also on the NFS server (also Gentoo) and client-side early boot fails now with
mount: /foobar... : mount(2) system call failed: Object is remote.
*sigh* Something changed server side, too...
Edit:
The mount error client-side is when the service-side NFS directory being exported contains a bind mount. So far this was not a problem, it seemingly is with 5.15 now.
The hang on boot client-side is actually a loooong delay, waiting a minute for "random fast init done". This seems to be a common problem on clients without keyboard that entropy is missing. |
|
Back to top |
|
|
toralf Developer
Joined: 01 Feb 2004 Posts: 3922 Location: Hamburg
|
Posted: Sun Jan 02, 2022 4:50 pm Post subject: Re: kernel 5.15.x breaks root on DHCP+NFS |
|
|
mortonP wrote: | Hi...
Every year's end I make the jump from old LTS to new LTS kernel, meaning this year from 5.10.x to 5.15.x | It will become probably an LTS but as of today it is not officially announced. |
|
Back to top |
|
|
mortonP Tux's lil' helper
Joined: 22 Dec 2015 Posts: 84
|
Posted: Sun Jan 02, 2022 6:08 pm Post subject: Re: kernel 5.15.x breaks root on DHCP+NFS |
|
|
toralf wrote: | mortonP wrote: | Hi...
Every year's end I make the jump from old LTS to new LTS kernel, meaning this year from 5.10.x to 5.15.x | It will become probably an LTS but as of today it is not officially announced. |
According to https://www.kernel.org/category/releases.html it is an LTS. |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9679 Location: almost Mile High in the USA
|
Posted: Sat Aug 13, 2022 2:21 am Post subject: |
|
|
I was about to make a new post on this but now I think it would have been a dupe...
I was trying to build a fresh PXE boot system. Got the client machine to pull up a kernel just fine, but it fails to find init.
I see the request on the NFS server for rpc.mountd so DHCP and the mount request went out, but basically it sits there after the kernel dhcp client picks up data from the DHCP server...
It sits there for almost 100 seconds before it times out complaining about not finding init.
I suspect I'll have to try something other than a 5.15 kernel to see if my settings are correct or not. Very weird. Also I'll have to redo my initramfs as it does not do nfs at all, so no debug through that (though after dropping into the shell on the initramfs, it clearly can ping on the network, etc., so appears to be an NFS mounting issue at this point.)
---
hmmm. nevermind, might have different problems here after all, just got excited from the delay seen. ip=dhcp should have dumped out the dhcp data which it does for me on kernel output, but it hangs for about 100 seconds just after that, then claims it can't find /sbin/init. Fortunately or unfortunately I see the same results on qemu as I do on physical hardware... _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
gjaekel n00b
Joined: 24 Nov 2022 Posts: 1
|
Posted: Thu Nov 24, 2022 12:44 pm Post subject: Another core reason, but the same result |
|
|
I stepped into a comparable today while switiching from a kernel 5.10 to 5.15 on a CISCO blade center. Here, the blade also boot via PXE, the kernel and initrd is pulled via BOOTP, and it is IP-configured by DHCP
The boot process fails "inside" the initrd while attempting to switch to the new nfs root. It turns out, that the rootpath was empty. Unfortunately, this is not detected and handled as an error by the init script. For this reason, strange things happens resulting in a kernel panic. By using the kernel commandline parameter i was able to interrupt the boot process to have a look at the kernel messages.
The blade is configured to provide two NICs (eth0 and eth1). It turns out, that "now" the eth1 NIC becomes ready before eth0. And by accident, it got an answer here from a foreign DHCP server using an DHCP pool which offers no such options as the rootpath.
As written, this is by accident and is the result of an other one's misconfiguration. But this don't seem to never happen booting the 5.10 kernel, but always with the 5.15 kernel . Please note that booting the blade is very uncomfortable, because the BIOS hardware test takes more than 2 minutes; i.e. "never" and "always" should be read as 3 of 3 times. Therefore, it seems to be related to some "minor" changes that with the newer kernel the eth1 becomes "ready and up" also (or before eth0) and the DHCP-client pick up the announcement on this network.
I was able to solve my issue by using instead of the former used, simple at the kernel commandline provided by the DHCP server. |
|
Back to top |
|
|
|