Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
kernel 5.15.x breaks root on DHCP+NFS
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
mortonP
Tux's lil' helper
Tux's lil' helper


Joined: 22 Dec 2015
Posts: 84

PostPosted: Wed Dec 29, 2021 11:13 pm    Post subject: kernel 5.15.x breaks root on DHCP+NFS Reply with quote

Hi...

Every year's end I make the jump from old LTS to new LTS kernel, meaning this year from 5.10.x to 5.15.x
I boot Gentoo via a kernel that mounts its root fs from NFS:
CONFIG_CMDLINE="ip=dhcp root=/dev/nfs nfsroot=192.168.x.x:/gentoo,tcp,vers=4.1 ...."
and everything worked so far fine.
Except, with 5.15 this no longer works, the kernel hangs at boot and after some time emits
"VFS: Unable to mount root fs via NFS"

I spent now half a day trying every kernel from 5.10: 5.11, 5.12, 5.13, up to 5.14.21, they all work fine.
5.15.x fails; I tried various combinations of "new" kernel config options
5.16-rc7 fails, too

Looking at the DHCP server log I don't see the DHCP query from the booting kernel before mounting the NFS root, which would explain the hang - there's no network coming up.
So I suspect more it is kernel's DHCP IP autoconfig fails instead of an NFS mount fail.
Still, I see 5.15 brought "exciting NFS changes", maybe these NFS core changes broke something?
Surely someone would have noticed this failing since 5.15 has been out already for a while?

I'm running out of ideas how to debug this further and get 5.15 running....
...do you know of something to google for or try?

Thank you!
Back to top
View user's profile Send private message
alamahant
Advocate
Advocate


Joined: 23 Mar 2019
Posts: 3879

PostPosted: Wed Dec 29, 2021 11:44 pm    Post subject: Reply with quote

Do you have
Code:

CONFIG_CMDLINE_BOOL=y

?
_________________
:)
Back to top
View user's profile Send private message
mortonP
Tux's lil' helper
Tux's lil' helper


Joined: 22 Dec 2015
Posts: 84

PostPosted: Thu Dec 30, 2021 12:13 am    Post subject: Reply with quote

alamahant wrote:
Do you have
Code:

CONFIG_CMDLINE_BOOL=y

?


Yes.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54244
Location: 56N 3W

PostPosted: Thu Dec 30, 2021 1:54 pm    Post subject: Reply with quote

mortonP,

Pastebin your 5.15 kernel .config file please.

Check your dhcp server log for signs that an IP was requested and offered.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21635

PostPosted: Thu Dec 30, 2021 4:29 pm    Post subject: Reply with quote

Can you drop into an initramfs rescue shell, and look around to determine what is and is not working? You wrote at the beginning that you don't see the DHCP query in the DHCP server log. Can you collect a network packet capture, to confirm that the query was never even sent to the system running the DHCP server?
Back to top
View user's profile Send private message
Anon-E-moose
Watchman
Watchman


Joined: 23 May 2008
Posts: 6098
Location: Dallas area

PostPosted: Thu Dec 30, 2021 6:26 pm    Post subject: Reply with quote

There's a good chance that either an option has changed or been added for nfs related stuff, I'd check that whole subsystem, rather than use defaults from 5.10
_________________
PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland
Back to top
View user's profile Send private message
mortonP
Tux's lil' helper
Tux's lil' helper


Joined: 22 Dec 2015
Posts: 84

PostPosted: Thu Dec 30, 2021 9:43 pm    Post subject: Reply with quote

I figured it out, by basically brute-force bisecting the .config changes between 5.14 and 5.15 - much of the options I do not really understand what they do.

5.14:
│ Symbol: E1000E [=y]
│ Type : tristate
│ Defined at drivers/net/ethernet/intel/Kconfig:58
│ Prompt: Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support
│ Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_INTEL [=y] && PCI [=y] && (!SPARC32 || BROKEN [=n])

5.15:
│ Symbol: E1000E [=y]
│ Type : tristate
│ Defined at drivers/net/ethernet/intel/Kconfig:58
│ Prompt: Intel(R) PRO/1000 PCI-Express Gigabit Ethernet support
│ Depends on: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_INTEL [=y] && PCI [=y] && (!SPARC32 || BROKEN [=n]) && PTP_1588_CLOCK_OPTIONAL [=y]


There is no initramfs.
Kernel image itself runs DHCP and mounts rootfs via NFS and does normal boot as if from local disk.

This is only possible if all necessary drivers are compiled into kernel image - including network devices.
5.14 -> 5.15 for Intel NICs it gets an additional option && PTP_1588_CLOCK_OPTIONAL which is not =y by default.
So the e1000e also automatically becomes =M and so the kernel image loses networking....
...oops

I don't know how to feel about this now, spent 2 days debugging this.
But I learned again something, and I hope you too.

Sorry for bothering - in retrospect the symptoms and the cause absolutely make sense...
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21635

PostPosted: Thu Dec 30, 2021 10:27 pm    Post subject: Reply with quote

Perhaps there should be an initramfs, so you can drop in and look around when things don't work. ;)

Do you even need this kernel to have CONFIG_MODULES=y? If not, consider disabling module support, which might encourage oldconfig to behave better when next this kind of thing happens. For a kernel booted over the network, I would think that having all kernel functionality built in is a net win, unless you routinely don't use significant amounts of the kernel, but want them available as modules for those rare days you use them.

How did your kernel end up with PTP_1588_CLOCK_OPTIONAL not set to =y? As I read the Kconfig language, it should have been =y, unless you had made PTP support a module:
v5.15.19:drivers/ptp/Kconfig:
     8   config PTP_1588_CLOCK
    10      depends on NET && POSIX_TIMERS
    11      default ETHERNET
    30   config PTP_1588_CLOCK_OPTIONAL
    31      tristate
    32      default y if PTP_1588_CLOCK=n
    33      default PTP_1588_CLOCK
Back to top
View user's profile Send private message
grknight
Retired Dev
Retired Dev


Joined: 20 Feb 2015
Posts: 1660

PostPosted: Fri Dec 31, 2021 2:18 am    Post subject: Reply with quote

mortonP wrote:
I figured it out, by basically brute-force bisecting the .config changes between 5.14 and 5.15 - much of the options I do not really understand what they do.


It is never a bad idea to run the old and new configs through the /usr/src/linux/scripts/diffconfig tool to see what has changed. Especially good between major.minor releases just in case.
Back to top
View user's profile Send private message
mortonP
Tux's lil' helper
Tux's lil' helper


Joined: 22 Dec 2015
Posts: 84

PostPosted: Sun Jan 02, 2022 2:38 pm    Post subject: Reply with quote

Hu wrote:
Perhaps there should be an initramfs, so you can drop in and look around when things don't work. ;)

How did your kernel end up with PTP_1588_CLOCK_OPTIONAL not set to =y?


I havn't used an initramfs for years... One kernel image file is enough to keep track of? :-)

I redid the 5.10 -> 5.15 .config upgrade and ended up again with PTP_1588 as module - either I'm too stupid or there is another dependency somewhere....
Back to top
View user's profile Send private message
mortonP
Tux's lil' helper
Tux's lil' helper


Joined: 22 Dec 2015
Posts: 84

PostPosted: Sun Jan 02, 2022 2:41 pm    Post subject: Reply with quote

grknight wrote:


It is never a bad idea to run the old and new configs through the /usr/src/linux/scripts/diffconfig tool to see what has changed.


Ooooh... that's a nice tool, didn't know about that yet. Thank you! :-)
Back to top
View user's profile Send private message
mortonP
Tux's lil' helper
Tux's lil' helper


Joined: 22 Dec 2015
Posts: 84

PostPosted: Sun Jan 02, 2022 2:56 pm    Post subject: Reply with quote

Now I upgraded 5.10 -> 5.15 also on the NFS server (also Gentoo) and client-side early boot fails now with
mount: /foobar... : mount(2) system call failed: Object is remote.
*sigh* Something changed server side, too...

Edit:
The mount error client-side is when the service-side NFS directory being exported contains a bind mount. So far this was not a problem, it seemingly is with 5.15 now.
The hang on boot client-side is actually a loooong delay, waiting a minute for "random fast init done". This seems to be a common problem on clients without keyboard that entropy is missing.
Back to top
View user's profile Send private message
toralf
Developer
Developer


Joined: 01 Feb 2004
Posts: 3922
Location: Hamburg

PostPosted: Sun Jan 02, 2022 4:50 pm    Post subject: Re: kernel 5.15.x breaks root on DHCP+NFS Reply with quote

mortonP wrote:
Hi...

Every year's end I make the jump from old LTS to new LTS kernel, meaning this year from 5.10.x to 5.15.x
It will become probably an LTS but as of today it is not officially announced.
Back to top
View user's profile Send private message
mortonP
Tux's lil' helper
Tux's lil' helper


Joined: 22 Dec 2015
Posts: 84

PostPosted: Sun Jan 02, 2022 6:08 pm    Post subject: Re: kernel 5.15.x breaks root on DHCP+NFS Reply with quote

toralf wrote:
mortonP wrote:
Hi...

Every year's end I make the jump from old LTS to new LTS kernel, meaning this year from 5.10.x to 5.15.x
It will become probably an LTS but as of today it is not officially announced.


According to https://www.kernel.org/category/releases.html it is an LTS.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9679
Location: almost Mile High in the USA

PostPosted: Sat Aug 13, 2022 2:21 am    Post subject: Reply with quote

I was about to make a new post on this but now I think it would have been a dupe...
I was trying to build a fresh PXE boot system. Got the client machine to pull up a kernel just fine, but it fails to find init.
I see the request on the NFS server for rpc.mountd so DHCP and the mount request went out, but basically it sits there after the kernel dhcp client picks up data from the DHCP server...
It sits there for almost 100 seconds before it times out complaining about not finding init.

I suspect I'll have to try something other than a 5.15 kernel to see if my settings are correct or not. Very weird. Also I'll have to redo my initramfs as it does not do nfs at all, so no debug through that (though after dropping into the shell on the initramfs, it clearly can ping on the network, etc., so appears to be an NFS mounting issue at this point.)

---

hmmm. nevermind, might have different problems here after all, just got excited from the delay seen. ip=dhcp should have dumped out the dhcp data which it does for me on kernel output, but it hangs for about 100 seconds just after that, then claims it can't find /sbin/init. Fortunately or unfortunately I see the same results on qemu as I do on physical hardware...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
gjaekel
n00b
n00b


Joined: 24 Nov 2022
Posts: 1

PostPosted: Thu Nov 24, 2022 12:44 pm    Post subject: Another core reason, but the same result Reply with quote

I stepped into a comparable today while switiching from a kernel 5.10 to 5.15 on a CISCO blade center. Here, the blade also boot via PXE, the kernel and initrd is pulled via BOOTP, and it is IP-configured by DHCP

The boot process fails "inside" the initrd while attempting to switch to the new nfs root. It turns out, that the rootpath was empty. Unfortunately, this is not detected and handled as an error by the init script. For this reason, strange things happens resulting in a kernel panic. By using the kernel commandline parameter
Code:
rdinit=/bin/sh
i was able to interrupt the boot process to have a look at the kernel messages.

The blade is configured to provide two NICs (eth0 and eth1). It turns out, that "now" the eth1 NIC becomes ready before eth0. And by accident, it got an answer here from a foreign DHCP server using an DHCP pool which offers no such options as the rootpath.

As written, this is by accident and is the result of an other one's misconfiguration. But this don't seem to never happen booting the 5.10 kernel, but always with the 5.15 kernel . Please note that booting the blade is very uncomfortable, because the BIOS hardware test takes more than 2 minutes; i.e. "never" and "always" should be read as 3 of 3 times. Therefore, it seems to be related to some "minor" changes that with the newer kernel the eth1 becomes "ready and up" also (or before eth0) and the DHCP-client pick up the announcement on this network.

I was able to solve my issue by using
Code:
ip=:::::eth0:dhcp
instead of the former used, simple
Code:
ip=dhcp
at the kernel commandline provided by the DHCP server.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum