Page 1 of 1

Gentoo on Hawk/Raptor

Posted: Sat Nov 05, 2022 5:53 pm
by NeddySeagoon
Team,

I have one of these Hawks. It's what a Raspberry Pi wants to be when it grows up :)
All appears well if I run in on a 5.15.x kernel. 5.16.x to 5.19.x all generate RCU grace period timeouts when they have been up for between 30 min and 26 days.
The Raptor in the title uses the same CPU but has the second memory channel fitted.

I've not seen the RCU grace period timeouts on 6.0.x yet but it doesn't run very long before the kernel panics and make a mess on the console. The console is serial over LAN, thanks to the board management computer.

Code: Select all

# [80368.354113] Internal error: Oops: 96000004 [#1] SMP
[80368.366624] Modules linked in: vhost_net vhost vhost_iotlb tap tun i2c_dev crct10dif_ce
[80368.382250] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.0.7-gentoo #1
[80368.396238] Hardware name: MiTAC HAWK EV-883832-X3-0001/HAWK, BIOS 1.2 06/27/2020
[80368.411319] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[80368.425924] pc : aer_irq+0x178/0x230
[80368.437064] lr : aer_irq+0xbc/0x230
[80368.448016] sp : ffff800008003ed0
[80368.458723] x29: ffff800008003ed0 x28: ffff8000096b8000 x27: 0000009fe650a000
[80368.473320] x26: 0000009fe650a000 x25: ffff8000092499f8 x24: ffff80000980b1ee
[80368.487843] x23: ffff0008073db400 x22: 0000000000000100 x21: 0000000000000130
[80368.502301] x20: ffff0008074a0080 x19: ffff000802223000 x18: 0000000000000000
[80368.516689] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000004000
[80368.531015] x14: 0000000000027100 x13: 0000000001c76924 x12: 003d0900f29fa5bd
[80368.545311] x11: 0000000000000000 x10: 0000000100000008 x9 : 0000000001c76924
[80368.559587] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[80368.573797] x5 : 0000000000000000 x4 : 0000000000000001 x3 : 0000000000000000
[80368.587917] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[80368.601914] Call trace:
[80368.611069]  aer_irq+0x178/0x230
[80368.620864]  __handle_irq_event_percpu+0x5c/0x17c
[80368.632011]  handle_irq_event+0x4c/0x180
[80368.642321]  handle_fasteoi_irq+0xbc/0x270
[80368.652628]  generic_handle_domain_irq+0x3c/0x6c
[80368.663309]  gic_handle_irq+0x6c/0xfc
[80368.672969]  call_on_irq_stack+0x2c/0x38
[80368.682729]  do_interrupt_handler+0xa4/0xb0
[80368.692575]  el1_interrupt+0x34/0x64
[80368.701625]  el1h_64_irq_handler+0x18/0x34
[80368.711012]  el1h_64_irq+0x68/0x6c
[80368.719567]  cpuidle_enter_state+0x130/0x36c
[80368.728920]  cpuidle_enter+0x38/0x60
[80368.737459]  cpuidle_idle_call+0x134/0x190
[80368.746506]  do_idle+0xac/0x110
[80368.754570]  cpu_startup_entry+0x28/0x30
[80368.763398]  kernel_init+0x0/0x140
[80368.771660]  arch_post_acpi_subsys_init+0x0/0x18
[80368.781137]  start_kernel+0x498/0x4f8
[80368.789614]  __primary_switched+0xbc/0xc4
[80368.798350] Code: 17ffffbc d2800001 d3410400 b94037e3 (3940b822) 
[80368.809150] ---[ end trace 0000000000000000 ]---
[80368.818386] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[80368.829893] SMP: stopping secondary CPUs
[80368.839470] Kernel Offset: 0x120000 from 0xffff800008000000
[80368.849552] PHYS_OFFSET: 0x80000000
[80368.857590] CPU features: 0x0000,00045021,00001086
[80368.866911] Memory Limit: none
[80368.874399] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
I don't think its hardware. If it were, why does 5.15.x appear to work?
Also, the Mudan, which is like a cut down Hawk, has problems with 5.16.x kernels. My Mudan was retired in favour of the Hawk about that time.
They have a lot in common though. The Mudan is an X-Gene 1 CPU the Hawk is an X-Gene 3 CPU, which like four X-Gene 1's in the same package.

It seems to independent of load and CUP temperature

Thoughts, ideas, questions and hints at how and what to bisect would be appreciated.

Posted: Sat Nov 05, 2022 7:24 pm
by pingtoo
Neddy,

I couldn't tell from your posted dmesg output if this is RCU problem. however when I search Kernel document tree I found Using RCU’s CPU Stall Detector may be offer some help.

Posted: Sat Nov 05, 2022 7:26 pm
by NeddySeagoon
pingtoo,

Its not the RCU stall, at least I don't think it is.

Code: Select all

Kernel panic - not syncing: Oops: Fatal exception in interrupt
Its from the 6.0.7 kernel.

Thank you for that link.

RCU stalls look like ... http://0x0.st/oEK4.txt