Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED] kernel 4.12: cpu stall with dm-raid
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
ecko
Tux's lil' helper
Tux's lil' helper


Joined: 04 Jul 2010
Posts: 102

PostPosted: Sun Aug 06, 2017 2:56 pm    Post subject: [SOLVED] kernel 4.12: cpu stall with dm-raid Reply with quote

Hello, since I upgraded from 4.11 to 4.12, I get cpu stalls at random moments (system is desktop for office work, mostly idle). During the event, I/O is frozen (including SATA disk and USB mouse, but PS/2 keyboard is fine); programs in memory are responsive (as long as they don't need I/O). Unix utility "top" reports md_raid occupying 100% of a core (the /home is raid1 from the linux kernel), while iotop reports no particular I/O activity.

What can I do?

dmesg below (running gentoo-sources-4.12.4)

Code:

[  249.148386] INFO: rcu_sched self-detected stall on CPU
[  249.148390]  0-...: (2099 ticks this GP) idle=27e/140000000000001/0 softirq=4532/4533 fqs=1049
[  249.148390]   (t=2100 jiffies g=2467 c=2466 q=27)
[  249.148392] NMI backtrace for cpu 0
[  249.148393] CPU: 0 PID: 3162 Comm: md0_raid1 Not tainted 4.12.4-gentoo #1
[  249.148394] Hardware name: System manufacturer System Product Name/P8P67 PRO, BIOS 1253 01/20/2011
[  249.148394] Call Trace:
[  249.148396]  <IRQ>
[  249.148399]  dump_stack+0x4d/0x67
[  249.148401]  nmi_cpu_backtrace+0x95/0xa0
[  249.148411]  ? irq_force_complete_move+0xe0/0xe0
[  249.148412]  nmi_trigger_cpumask_backtrace+0x91/0xc0
[  249.148413]  arch_trigger_cpumask_backtrace+0x14/0x20
[  249.148415]  rcu_dump_cpu_stacks+0x93/0xce
[  249.148417]  rcu_check_callbacks+0x767/0x8b0
[  249.148419]  ? tick_sched_handle.isra.7+0x30/0x30
[  249.148420]  update_process_times+0x2a/0x50
[  249.148421]  tick_sched_handle.isra.7+0x29/0x30
[  249.148422]  tick_sched_timer+0x3d/0x70
[  249.148423]  __hrtimer_run_queues+0xda/0x210
[  249.148424]  hrtimer_interrupt+0xac/0x1f0
[  249.148426]  local_apic_timer_interrupt+0x33/0x50
[  249.148427]  smp_apic_timer_interrupt+0x33/0x50
[  249.148429]  apic_timer_interrupt+0x86/0x90
[  249.148430] RIP: 0010:mutex_lock+0x10/0x30
[  249.148430] RSP: 0018:ffffc900004efd58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  249.148431] RAX: 0000000000000000 RBX: ffff88039da3c000 RCX: ffff88039d482400
[  249.148432] RDX: ffff88039d992cc0 RSI: 0000000000000000 RDI: ffff88039da3c368
[  249.148432] RBP: ffffc900004efd98 R08: 0000000000000000 R09: 0000000000000000
[  249.148433] R10: ffffc900004efeb0 R11: 0000000000000000 R12: ffff88039da3c000
[  249.148433] R13: ffff88039d482428 R14: ffff88039d992cc0 R15: 0000000000000000
[  249.148434]  </IRQ>
[  249.148436]  ? bitmap_daemon_work+0x27/0x340
[  249.148438]  md_check_recovery+0x22/0x460
[  249.148440]  raid1d+0x4c/0x900 [raid1]
[  249.148442]  md_thread+0x115/0x140
[  249.148442]  ? md_thread+0x115/0x140
[  249.148444]  ? wake_atomic_t_function+0x60/0x60
[  249.148445]  kthread+0x104/0x140
[  249.148446]  ? md_register_thread+0xe0/0xe0
[  249.148447]  ? kthread_create_on_node+0x40/0x40
[  249.148448]  ret_from_fork+0x22/0x30


Last edited by ecko on Wed Sep 13, 2017 8:34 am; edited 1 time in total
Back to top
View user's profile Send private message
LIsLinuxIsSogood
Veteran
Veteran


Joined: 13 Feb 2016
Posts: 1179

PostPosted: Mon Aug 07, 2017 12:42 am    Post subject: Reply with quote

If I were you (and I'm not)...have you tried booting into single user mode without the /home partition mounted. If you can gain access to the operating system without any reliance on the second disk (mirror) you may be able to isolate if it is related at all to the newly added RAID features for the kernel, which were shown here (https://fossbytes.com/linux-kernel-4-12-download-features/)

It is a shot in the dark, but since all RAID features rely on two or more disks, perhaps there is a related bug, or else if you do see the problem go away after detaching the mirror then you might be able to add it back afterwards (problem-free).

Any luck?
Back to top
View user's profile Send private message
ecko
Tux's lil' helper
Tux's lil' helper


Joined: 04 Jul 2010
Posts: 102

PostPosted: Tue Aug 08, 2017 12:23 pm    Post subject: Reply with quote

LIsLinuxIsSogood wrote:
have you tried booting into single user mode without the /home partition mounted?


Thanks for the suggestion. I rebooted with home unmounted (added option noauto in fstab) and let the machine at the X login screen during 10 hours at night; no problem happened. When I mounted /home and logged into the system, the problem happened after 3 hours.

(To make sure I will repeat during the night when the machine is totally idle.) The test was done with 4.12.5 (released 2 days ago with 3 commits related to raid).

I just noticed in the logs that the stall is often (but not always) followed, exactly 30 seconds later, by complains regarding the clocksource.

Code:

clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource:                       'hpet' wd_now: c3882729 wd_last: 3045c21d mask: ffffffff
clocksource:                       'tsc' cs_now: 9788d643a734 cs_last: 96ffcb8fa416 mask: ffffffffffffffff
sched_clock: Marking unstable (48823520000829, 1115764495)<-(48824723369567, -87604243)
tsc: Marking TSC unstable due to clocksource watchdog
clocksource: Switched to clocksource hpet
Back to top
View user's profile Send private message
radio_flyer
Guru
Guru


Joined: 04 Nov 2004
Posts: 317
Location: Northern California

PostPosted: Wed Aug 16, 2017 3:54 pm    Post subject: Reply with quote

You're not running KDE are you? If so, Baloo will hang I/O hard for that long.
Back to top
View user's profile Send private message
ecko
Tux's lil' helper
Tux's lil' helper


Joined: 04 Jul 2010
Posts: 102

PostPosted: Wed Aug 16, 2017 4:41 pm    Post subject: Reply with quote

radio_flyer wrote:
You're not running KDE are you? If so, Baloo will hang I/O hard for that long.


I use a simple fluxbox setup and baloo is not installed. I use app-misc/recoll as indexer, it updates on a cron job at a known time of the day (and not correlated to the observed problem). Also iotop does not report I/O activity during the problem, so I was thinking of an I/O lockup due to a bug in the linux raid code. I am now in the process of bissecting the kernel. The problem sometimes only shows up after 1 day of uptime, so I will need one more week to go through the remaining 10 bissecting steps.
Back to top
View user's profile Send private message
snIP3r
l33t
l33t


Joined: 21 May 2004
Posts: 853
Location: germany

PostPosted: Tue Aug 22, 2017 3:06 pm    Post subject: Reply with quote

hi all!

i have similar issue:

Code:

Aug 21 18:45:28 area52 kernel: INFO: rcu_sched self-detected stall on CPU
Aug 21 18:45:28 area52 kernel: \x090-...: (2099 ticks this GP) idle=53a/140000000000001/0 softirq=871134/871134 fqs=1049
Aug 21 18:45:28 area52 kernel: \x09 (t=2100 jiffies g=777122 c=777121 q=140)
Aug 21 18:45:28 area52 kernel: NMI backtrace for cpu 0
Aug 21 18:45:28 area52 kernel: CPU: 0 PID: 2480 Comm: md127_raid1 Not tainted 4.12.5-gentoo #1
Aug 21 18:45:28 area52 kernel: Hardware name: ASUSTeK COMPUTER INC. P9D-X Series/P9D-X Series, BIOS 0704 03/28/2014
Aug 21 18:45:28 area52 kernel: Call Trace:
Aug 21 18:45:28 area52 kernel:  <IRQ>
Aug 21 18:45:28 area52 kernel:  dump_stack+0x4d/0x6a
Aug 21 18:45:28 area52 kernel:  nmi_cpu_backtrace+0x9b/0xa0
Aug 21 18:45:28 area52 kernel:  ? irq_force_complete_move+0xf0/0xf0
Aug 21 18:45:28 area52 kernel:  nmi_trigger_cpumask_backtrace+0x8f/0xc0
Aug 21 18:45:28 area52 kernel:  arch_trigger_cpumask_backtrace+0x14/0x20
Aug 21 18:45:28 area52 kernel:  rcu_dump_cpu_stacks+0x8f/0xca
Aug 21 18:45:28 area52 kernel:  rcu_check_callbacks+0x701/0x850
Aug 21 18:45:28 area52 kernel:  ? tick_sched_handle.isra.17+0x30/0x30
Aug 21 18:45:28 area52 kernel:  update_process_times+0x2a/0x50
Aug 21 18:45:28 area52 kernel:  tick_sched_handle.isra.17+0x2d/0x30
Aug 21 18:45:28 area52 kernel:  tick_sched_timer+0x38/0x70
Aug 21 18:45:28 area52 kernel:  __hrtimer_run_queues+0xde/0x210
Aug 21 18:45:28 area52 kernel:  hrtimer_interrupt+0xa3/0x190
Aug 21 18:45:28 area52 kernel:  local_apic_timer_interrupt+0x33/0x60
Aug 21 18:45:28 area52 kernel:  smp_apic_timer_interrupt+0x33/0x50
Aug 21 18:45:28 area52 kernel:  apic_timer_interrupt+0x86/0x90
Aug 21 18:45:28 area52 kernel: RIP: 0010:md_check_recovery+0x5b/0x460
Aug 21 18:45:28 area52 kernel: RSP: 0018:ffffc9000229bda8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Aug 21 18:45:28 area52 kernel: RAX: 0000000000000000 RBX: ffff880220eb8800 RCX: ffff88022638b500
Aug 21 18:45:28 area52 kernel: RDX: ffffc9000229be40 RSI: 0000000000000000 RDI: ffff880220eb8800
Aug 21 18:45:28 area52 kernel: RBP: ffffc9000229bdc0 R08: 0000000000000000 R09: 0000000000001b5d
Aug 21 18:45:28 area52 kernel: R10: ffffc9000229beb0 R11: 0000000000000000 R12: ffff880220eb8800
Aug 21 18:45:28 area52 kernel: R13: ffff88022638b528 R14: ffff880220e95480 R15: 0000000000000000
Aug 21 18:45:28 area52 kernel:  </IRQ>
Aug 21 18:45:28 area52 kernel:  raid1d+0x4c/0x7f0
Aug 21 18:45:28 area52 kernel:  md_thread+0x10d/0x140
Aug 21 18:45:28 area52 kernel:  ? md_thread+0x10d/0x140
Aug 21 18:45:28 area52 kernel:  ? wake_up_bit+0x30/0x30
Aug 21 18:45:28 area52 kernel:  kthread+0x104/0x140
Aug 21 18:45:28 area52 kernel:  ? md_register_thread+0xe0/0xe0
Aug 21 18:45:28 area52 kernel:  ? kthread_create_on_node+0x40/0x40
Aug 21 18:45:28 area52 kernel:  ret_from_fork+0x22/0x30


or

Code:

Aug 21 19:49:11 area52 kernel: INFO: rcu_sched self-detected stall on CPU
Aug 21 19:49:11 area52 kernel: \x090-...: (2099 ticks this GP) idle=642/140000000000001/0 softirq=1024655/1024655 fqs=1049
Aug 21 19:49:11 area52 kernel: \x09 (t=2100 jiffies g=912276 c=912275 q=121)
Aug 21 19:49:11 area52 kernel: NMI backtrace for cpu 0
Aug 21 19:49:11 area52 kernel: CPU: 0 PID: 2489 Comm: md124_raid1 Not tainted 4.12.5-gentoo #1
Aug 21 19:49:11 area52 kernel: Hardware name: ASUSTeK COMPUTER INC. P9D-X Series/P9D-X Series, BIOS 0704 03/28/2014
Aug 21 19:49:11 area52 kernel: Call Trace:
Aug 21 19:49:11 area52 kernel:  <IRQ>
Aug 21 19:49:11 area52 kernel:  dump_stack+0x4d/0x6a
Aug 21 19:49:11 area52 kernel:  nmi_cpu_backtrace+0x9b/0xa0
Aug 21 19:49:11 area52 kernel:  ? irq_force_complete_move+0xf0/0xf0
Aug 21 19:49:11 area52 kernel:  nmi_trigger_cpumask_backtrace+0x8f/0xc0
Aug 21 19:49:11 area52 kernel:  arch_trigger_cpumask_backtrace+0x14/0x20
Aug 21 19:49:11 area52 kernel:  rcu_dump_cpu_stacks+0x8f/0xca
Aug 21 19:49:11 area52 kernel:  rcu_check_callbacks+0x701/0x850
Aug 21 19:49:11 area52 kernel:  ? tick_sched_handle.isra.17+0x30/0x30
Aug 21 19:49:11 area52 kernel:  update_process_times+0x2a/0x50
Aug 21 19:49:11 area52 kernel:  tick_sched_handle.isra.17+0x2d/0x30
Aug 21 19:49:11 area52 kernel:  tick_sched_timer+0x38/0x70
Aug 21 19:49:11 area52 kernel:  __hrtimer_run_queues+0xde/0x210
Aug 21 19:49:11 area52 kernel:  hrtimer_interrupt+0xa3/0x190
Aug 21 19:49:11 area52 kernel:  local_apic_timer_interrupt+0x33/0x60
Aug 21 19:49:11 area52 kernel:  smp_apic_timer_interrupt+0x33/0x50
Aug 21 19:49:11 area52 kernel:  apic_timer_interrupt+0x86/0x90
Aug 21 19:49:11 area52 kernel: RIP: 0010:_raw_spin_lock_irqsave+0x6/0x30
Aug 21 19:49:11 area52 kernel: RSP: 0018:ffffc900022e3db0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Aug 21 19:49:11 area52 kernel: RAX: 0000000000000000 RBX: ffff88022615c414 RCX: 0000000000000000
Aug 21 19:49:11 area52 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88022615c414
Aug 21 19:49:11 area52 kernel: RBP: ffffc900022e3dc0 R08: 0000000000000000 R09: 0000000000000d9b
Aug 21 19:49:11 area52 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff880220f26800
Aug 21 19:49:11 area52 kernel: R13: ffff88022615c428 R14: ffff8802263ca780 R15: 0000000000000000
Aug 21 19:49:11 area52 kernel:  </IRQ>
Aug 21 19:49:11 area52 kernel:  raid1d+0xa1/0x7f0
Aug 21 19:49:11 area52 kernel:  md_thread+0x10d/0x140
Aug 21 19:49:11 area52 kernel:  ? md_thread+0x10d/0x140
Aug 21 19:49:11 area52 kernel:  ? wake_up_bit+0x30/0x30
Aug 21 19:49:11 area52 kernel:  kthread+0x104/0x140
Aug 21 19:49:11 area52 kernel:  ? md_register_thread+0xe0/0xe0
Aug 21 19:49:11 area52 kernel:  ? kthread_create_on_node+0x40/0x40
Aug 21 19:49:11 area52 kernel:  ret_from_fork+0x22/0x30


or this

Code:

Aug 21 20:58:30 area52 kernel: INFO: rcu_sched self-detected stall on CPU
Aug 21 20:58:30 area52 kernel: \x090-...: (2099 ticks this GP) idle=f1a/140000000000001/0 softirq=1200414/1200414 fqs=1049
Aug 21 20:58:30 area52 kernel: \x09 (t=2100 jiffies g=1041137 c=1041136 q=198)
Aug 21 20:58:30 area52 kernel: NMI backtrace for cpu 0
Aug 21 20:58:30 area52 kernel: CPU: 0 PID: 2489 Comm: md124_raid1 Tainted: G        W       4.12.5-gentoo #1
Aug 21 20:58:30 area52 kernel: Hardware name: ASUSTeK COMPUTER INC. P9D-X Series/P9D-X Series, BIOS 0704 03/28/2014
Aug 21 20:58:30 area52 kernel: Call Trace:
Aug 21 20:58:30 area52 kernel:  <IRQ>
Aug 21 20:58:30 area52 kernel:  dump_stack+0x4d/0x6a
Aug 21 20:58:30 area52 kernel:  nmi_cpu_backtrace+0x9b/0xa0
Aug 21 20:58:40 area52 kernel:  ? irq_force_complete_move+0xf0/0xf0
Aug 21 20:58:40 area52 kernel:  nmi_trigger_cpumask_backtrace+0x8f/0xc0
Aug 21 20:58:40 area52 kernel:  arch_trigger_cpumask_backtrace+0x14/0x20
Aug 21 20:58:40 area52 kernel:  rcu_dump_cpu_stacks+0x8f/0xca
Aug 21 20:58:40 area52 kernel:  rcu_check_callbacks+0x701/0x850
Aug 21 20:58:40 area52 kernel:  ? tick_sched_handle.isra.17+0x30/0x30
Aug 21 20:58:40 area52 kernel:  update_process_times+0x2a/0x50
Aug 21 20:58:40 area52 kernel:  tick_sched_handle.isra.17+0x2d/0x30
Aug 21 20:58:40 area52 kernel:  tick_sched_timer+0x38/0x70
Aug 21 20:58:40 area52 kernel:  __hrtimer_run_queues+0xde/0x210
Aug 21 20:58:40 area52 kernel:  hrtimer_interrupt+0xa3/0x190
Aug 21 20:58:40 area52 kernel:  local_apic_timer_interrupt+0x33/0x60
Aug 21 20:58:40 area52 kernel:  smp_apic_timer_interrupt+0x33/0x50
Aug 21 20:58:40 area52 kernel:  apic_timer_interrupt+0x86/0x90
Aug 21 20:58:40 area52 kernel: RIP: 0010:raid1d+0x47/0x7f0
Aug 21 20:58:40 area52 kernel: RSP: 0018:ffffc900022e3dd0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
Aug 21 20:58:40 area52 kernel: RAX: ffff88022615c418 RBX: ffff88022615c400 RCX: ffff88022615c400
Aug 21 20:58:40 area52 kernel: RDX: ffffc900022e3e40 RSI: 0000000000000000 RDI: ffff880220f26800
Aug 21 20:58:40 area52 kernel: RBP: ffffc900022e3e90 R08: 0000000000000000 R09: 00000000000010e3
Aug 21 20:58:40 area52 kernel: R10: ffffc900022e3eb0 R11: 0000000000000000 R12: ffff880220f26800
Aug 21 20:58:40 area52 kernel: R13: ffff88022615c428 R14: ffff8802263ca780 R15: 0000000000000000
Aug 21 20:58:40 area52 kernel:  </IRQ>
Aug 21 20:58:40 area52 kernel:  md_thread+0x10d/0x140
Aug 21 20:58:40 area52 kernel:  ? md_thread+0x10d/0x140
Aug 21 20:58:40 area52 kernel:  ? wake_up_bit+0x30/0x30
Aug 21 20:58:40 area52 kernel:  kthread+0x104/0x140
Aug 21 20:58:40 area52 kernel:  ? md_register_thread+0xe0/0xe0
Aug 21 20:58:40 area52 kernel:  ? kthread_create_on_node+0x40/0x40
Aug 21 20:58:40 area52 kernel:  ret_from_fork+0x22/0x30


and as far as i have analyzed it, its related to my raid config if my raiddisks (two sata drives) will spin up after they were in idle mode. running my previusly used kernel 4.4.6 had no such errors. so i also will check the newly introduced features...
perhaps someone has an idea about the issue?

greets
snIP3r
_________________
Intel i3-4130T on ASUS P9D-X
Kernel 5.15.88-gentoo SMP
-----------------------------------------------
if your problem is fixed please add something like [solved] to the topic!
Back to top
View user's profile Send private message
snIP3r
l33t
l33t


Joined: 21 May 2004
Posts: 853
Location: germany

PostPosted: Tue Aug 22, 2017 3:49 pm    Post subject: Reply with quote

looks like this is about our issue:

https://lkml.org/lkml/2017/8/6/197
_________________
Intel i3-4130T on ASUS P9D-X
Kernel 5.15.88-gentoo SMP
-----------------------------------------------
if your problem is fixed please add something like [solved] to the topic!
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Tue Sep 05, 2017 5:22 pm    Post subject: Reply with quote

Same here. Under high disk load, the server throws similar message and then stops all disk I/O. It is not even able to write an error log, so it took me days to track it down. But I managed to log errors remotely, as I noticed that the networking lives a bit longer. There is no RAID5/6 on the server, only RAID1, but the error seems md_raid related.

I am able to reproduce the crash pretty regularly on this hardware, so if you have anything non-destructive that can be tried, I may be able to test it.

Code:
Sep  5 18:55:26 10.0.0.149 kernel: INFO: rcu_sched self-detected stall on CPU
Sep  5 18:55:26 10.0.0.149 kernel: \x090-...: (2099 ticks this GP) idle=c7e/140000000000001/0 softirq=1075486/1075486 fqs=1049
Sep  5 18:55:26 10.0.0.149 kernel: \x09 (t=2100 jiffies g=588759 c=588758 q=3532)
Sep  5 18:55:26 10.0.0.149 kernel: NMI backtrace for cpu 0
Sep  5 18:55:26 10.0.0.149 kernel: CPU: 0 PID: 124 Comm: md3_raid1 Tainted: G        W       4.12.5-gentoo #1
Sep  5 18:55:26 10.0.0.149 kernel: Hardware name: HPE ML10Gen9/ML10Gen9, BIOS 1.003 07/27/2016
Sep  5 18:55:26 10.0.0.149 kernel: Call Trace:
Sep  5 18:55:26 10.0.0.149 kernel:  <IRQ>
Sep  5 18:55:26 10.0.0.149 kernel:  dump_stack+0x4d/0x6a
Sep  5 18:55:26 10.0.0.149 kernel:  nmi_cpu_backtrace+0x95/0xa0
Sep  5 18:55:26 10.0.0.149 kernel:  ? irq_force_complete_move+0xf0/0xf0
Sep  5 18:55:26 10.0.0.149 kernel:  nmi_trigger_cpumask_backtrace+0x88/0xd0
Sep  5 18:55:26 10.0.0.149 kernel:  arch_trigger_cpumask_backtrace+0x14/0x20
Sep  5 18:55:26 10.0.0.149 kernel:  rcu_dump_cpu_stacks+0x93/0xce
Sep  5 18:55:26 10.0.0.149 kernel:  rcu_check_callbacks+0x767/0x8b0
Sep  5 18:55:26 10.0.0.149 kernel:  ? acct_account_cputime+0x17/0x20
Sep  5 18:55:26 10.0.0.149 kernel:  ? tick_sched_do_timer+0x40/0x40
Sep  5 18:55:26 10.0.0.149 kernel:  update_process_times+0x2a/0x50
Sep  5 18:55:26 10.0.0.149 kernel:  tick_sched_handle.isra.15+0x2d/0x40
Sep  5 18:55:26 10.0.0.149 kernel:  tick_sched_timer+0x38/0x70
Sep  5 18:55:26 10.0.0.149 kernel:  __hrtimer_run_queues+0xda/0x210
Sep  5 18:55:26 10.0.0.149 kernel:  hrtimer_interrupt+0xac/0x1f0
Sep  5 18:55:26 10.0.0.149 kernel:  local_apic_timer_interrupt+0x33/0x50
Sep  5 18:55:26 10.0.0.149 kernel:  smp_apic_timer_interrupt+0x33/0x50
Sep  5 18:55:26 10.0.0.149 kernel:  apic_timer_interrupt+0x86/0x90
Sep  5 18:55:26 10.0.0.149 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Sep  5 18:55:26 10.0.0.149 kernel: RSP: 0018:ffffc9000065fda0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Sep  5 18:55:26 10.0.0.149 kernel: RAX: 0000000000000000 RBX: ffff8802693ee800 RCX: 0000000000000001
Sep  5 18:55:26 10.0.0.149 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8802693eea80
Sep  5 18:55:26 10.0.0.149 kernel: RBP: ffffc9000065fdc0 R08: 0000000000000000 R09: 0000000000000000
Sep  5 18:55:26 10.0.0.149 kernel: R10: ffffc9000065feb0 R11: 0000000000000000 R12: 0000000000000000
Sep  5 18:55:26 10.0.0.149 kernel: R13: ffff8802681c0328 R14: ffff880268fea880 R15: 0000000000000000
Sep  5 18:55:26 10.0.0.149 kernel:  </IRQ>
Sep  5 18:55:26 10.0.0.149 kernel:  ? md_check_recovery+0x2b7/0x460
Sep  5 18:55:26 10.0.0.149 kernel:  raid1d+0x4c/0x8e0
Sep  5 18:55:26 10.0.0.149 kernel:  md_thread+0x115/0x140
Sep  5 18:55:26 10.0.0.149 kernel:  ? md_thread+0x115/0x140
Sep  5 18:55:26 10.0.0.149 kernel:  ? wake_atomic_t_function+0x60/0x60
Sep  5 18:55:26 10.0.0.149 kernel:  kthread+0x103/0x140
Sep  5 18:55:26 10.0.0.149 kernel:  ? find_pers+0x70/0x70
Sep  5 18:55:26 10.0.0.149 kernel:  ? kthread_create_on_node+0x40/0x40
Sep  5 18:55:26 10.0.0.149 kernel:  ret_from_fork+0x22/0x30
Back to top
View user's profile Send private message
snIP3r
l33t
l33t


Joined: 21 May 2004
Posts: 853
Location: germany

PostPosted: Tue Sep 05, 2017 5:55 pm    Post subject: Reply with quote

yes, it's md related. i switched back to my former used kernel - no such errors. so for me i am waiting for the next stable kernel...
_________________
Intel i3-4130T on ASUS P9D-X
Kernel 5.15.88-gentoo SMP
-----------------------------------------------
if your problem is fixed please add something like [solved] to the topic!
Back to top
View user's profile Send private message
ecko
Tux's lil' helper
Tux's lil' helper


Joined: 04 Jul 2010
Posts: 102

PostPosted: Thu Sep 07, 2017 10:13 am    Post subject: Reply with quote

My bissection lead this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d5e72dfdf0fa29a21143fd72746c6f43295ce9f "This update includes the usual round of major driver updates".

I did some limited testing with 4.13-rc7 and for now the problem did not show up. I'll test for longer with 4.13 before declaring it solved.
Back to top
View user's profile Send private message
ecko
Tux's lil' helper
Tux's lil' helper


Joined: 04 Jul 2010
Posts: 102

PostPosted: Wed Sep 13, 2017 8:36 am    Post subject: Reply with quote

After several days of tests, the problem does not happen with kernel 4.13.
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Fri Sep 15, 2017 7:10 pm    Post subject: Reply with quote

ecko wrote:
After several days of tests, the problem does not happen with kernel 4.13.

I'm trying to trigger the error all day on kernel 4.12.12, and so far it seems fixed there as well.
Back to top
View user's profile Send private message
masc
n00b
n00b


Joined: 29 Dec 2008
Posts: 29

PostPosted: Mon Sep 18, 2017 7:57 am    Post subject: Reply with quote

araxon wrote:
ecko wrote:
After several days of tests, the problem does not happen with kernel 4.13.

I'm trying to trigger the error all day on kernel 4.12.12, and so far it seems fixed there as well.


it seems to be fixed in `4.12.11` as well.
Back to top
View user's profile Send private message
peppev
n00b
n00b


Joined: 10 Aug 2009
Posts: 26
Location: Italy

PostPosted: Mon Sep 18, 2017 8:03 am    Post subject: Reply with quote

masc wrote:
araxon wrote:
ecko wrote:
After several days of tests, the problem does not happen with kernel 4.13.

I'm trying to trigger the error all day on kernel 4.12.12, and so far it seems fixed there as well.


it seems to be fixed in `4.12.11` as well.


One of my systems is under the stable 4.12.15 and this morning dmesg reports:

Code:

[154509.424066] INFO: rcu_sched self-detected stall on CPU
[154509.424075]         0-...: (59999 ticks this GP) idle=32a/140000000000001/0 softirq=7363992/7363992 fqs=14633
[154509.424076]          (t=60000 jiffies g=3560363 c=3560362 q=845)
[154509.424081] NMI backtrace for cpu 0
[154509.424087] CPU: 0 PID: 3095 Comm: md3_raid1 Tainted: P           O    4.12.5-gentoo #1
[154509.424088] Hardware name:                  /D925XECV2                      , BIOS CV92510A.86A.0504.2006.1128.1903 11/28/2006
[154509.424090] Call Trace:
[154509.424093]  <IRQ>
[154509.424101]  dump_stack+0x4d/0x63
[154509.424105]  nmi_cpu_backtrace+0x76/0x85
[154509.424109]  ? irq_force_complete_move+0xd5/0xd5
[154509.424112]  nmi_trigger_cpumask_backtrace+0x51/0xb2
[154509.424115]  arch_trigger_cpumask_backtrace+0x14/0x16
[154509.424119]  rcu_dump_cpu_stacks+0x89/0xb6
[154509.424123]  rcu_check_callbacks+0x232/0x5eb
[154509.424127]  ? raise_softirq_irqoff+0x9/0x1e
[154509.424130]  update_process_times+0x2a/0x4f
[154509.424134]  tick_sched_handle+0x2f/0x3b
[154509.424136]  tick_sched_timer+0x34/0x5a
[154509.424139]  __hrtimer_run_queues+0xba/0x182
[154509.424142]  hrtimer_interrupt+0x67/0x105
[154509.424145]  local_apic_timer_interrupt+0x46/0x49
[154509.424148]  smp_apic_timer_interrupt+0x24/0x34
[154509.424152]  apic_timer_interrupt+0x86/0x90
[154509.424157] RIP: 0010:do_raw_spin_lock+0xd/0x1c
[154509.424159] RSP: 0018:ffffc90000d03da0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[154509.424162] RAX: 0000000000000000 RBX: ffff880092059800 RCX: 0000000000000000
[154509.424164] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880092059a80
[154509.424166] RBP: ffffc90000d03da8 R08: ffff880037810000 R09: ffff880037810000
[154509.424168] R10: ffff8800920599e8 R11: 0000000000000372 R12: ffff88009167e300
[154509.424170] R13: ffff8800947f93d0 R14: ffff880037810000 R15: 0000000000000000
[154509.424172]  </IRQ>
[154509.424176]  ? _raw_spin_lock+0x9/0xb
[154509.424180]  md_check_recovery+0x21c/0x3cc
[154509.424184]  raid1d+0x3b/0x6e5
[154509.424187]  md_thread+0x110/0x14a
[154509.424190]  ? md_thread+0x110/0x14a
[154509.424193]  ? wake_up_atomic_t+0x27/0x27
[154509.424195]  ? md_do_sync+0xca0/0xca0
[154509.424199]  kthread+0xf7/0xfc
[154509.424202]  ? init_completion+0x23/0x23
[154509.424204]  ret_from_fork+0x22/0x30


Never done in the previous stable gentoo-sources kernels (last one was 4.9.34).

Does it look related to the problem discussed in this thread?
Back to top
View user's profile Send private message
masc
n00b
n00b


Joined: 29 Dec 2008
Posts: 29

PostPosted: Mon Sep 18, 2017 8:10 am    Post subject: Reply with quote

peppev wrote:

Does it look related to the problem discussed in this thread?


it certainly looks like it.
Back to top
View user's profile Send private message
peppev
n00b
n00b


Joined: 10 Aug 2009
Posts: 26
Location: Italy

PostPosted: Mon Sep 18, 2017 9:06 am    Post subject: Reply with quote

masc wrote:
peppev wrote:

Does it look related to the problem discussed in this thread?


it certainly looks like it.


Well, I guess we have a problem with 4.12.5?

In less than a week from when I emerged it, I found 3 blocking bugs.

1) the tape changer sg device driver doesn't work, needs a patch;
2) it has a very nasty bug in the netfilter conntrack code which lead to random panics;
3) it seems, from this thread, to have a problem with mdadm arrays.

Mmm ... stable?
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Wed Sep 20, 2017 12:14 pm    Post subject: Reply with quote

peppev wrote:

One of my systems is under the stable 4.12.15 and this morning dmesg reports:

Code:

...
[154509.424087] CPU: 0 PID: 3095 Comm: md3_raid1 Tainted: P           O    4.12.5-gentoo #1
...


Seems like 4.12.5, not 4.12.15. In other words, it is the same exact bug we are discussing here. Try upgrading to later kernel, as suggested in this thread.
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Wed Sep 20, 2017 12:21 pm    Post subject: Reply with quote

peppev wrote:
Well, I guess we have a problem with 4.12.5?

In less than a week from when I emerged it, I found 3 blocking bugs.

1) the tape changer sg device driver doesn't work, needs a patch;
2) it has a very nasty bug in the netfilter conntrack code which lead to random panics;
3) it seems, from this thread, to have a problem with mdadm arrays.

Mmm ... stable?

4.12.5 seems to be removed from Gentoo portage already. Nothing much more to be done here.
Back to top
View user's profile Send private message
peppev
n00b
n00b


Joined: 10 Aug 2009
Posts: 26
Location: Italy

PostPosted: Fri Sep 22, 2017 6:05 pm    Post subject: Reply with quote

araxon wrote:
peppev wrote:

One of my systems is under the stable 4.12.15 and this morning dmesg reports:

Code:

...
[154509.424087] CPU: 0 PID: 3095 Comm: md3_raid1 Tainted: P           O    4.12.5-gentoo #1
...


Seems like 4.12.5, not 4.12.15. In other words, it is the same exact bug we are discussing here. Try upgrading to later kernel, as suggested in this thread.


Apologies for the obvious "typo", of course is 4.12.5.

I installed 4.12.12, the mtx and conntrack bugs has been corrected, patches already available from months are present in the new kernel.

Let see if mdadm is solved, I've no idea how to check about this problem in the kernel source.

I may only say I had been, probably, really "unlucky" in my mothly emerge schedule, being "trapped" in such a bad shape kernel.

Though, some check before declaring a kernel "stable" would be appreciated.

It is really disappointing to see systems which had been solid as a rock for years, under Gentoo, with terabytes of data stored in their disks, to panic for a stupid "typo" (as was declared by the original developer in this thread: https://www.spinics.net/lists/kernel/msg2558062.html) like a crazy.

Hope to be more lucky in the future.
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Tue Sep 26, 2017 7:39 am    Post subject: Reply with quote

peppev wrote:

Let see if mdadm is solved, I've no idea how to check about this problem in the kernel source.


I was able to crash my machine on kernel 4.12.5 (with the mdadm bug) in a matter of hours. I'm on 4.12.12 for past 10 days 24/7, and it seems that it does not have this particular bug anymore.

peppev wrote:

Though, some check before declaring a kernel "stable" would be appreciated.

It is really disappointing to see systems which had been solid as a rock for years, under Gentoo, with terabytes of data stored in their disks, to panic for a stupid "typo" (as was declared by the original developer in this thread: https://www.spinics.net/lists/kernel/msg2558062.html) like a crazy.

Hope to be more lucky in the future.


Yes, that is embarrassing, and I too would be happier if it would not happen in the future. But we are getting this whole Gentoo miracle thing for free, so I'm grateful either way. Excellent value for the (zero) money spent.
Back to top
View user's profile Send private message
peppev
n00b
n00b


Joined: 10 Aug 2009
Posts: 26
Location: Italy

PostPosted: Tue Sep 26, 2017 5:51 pm    Post subject: Reply with quote

araxon wrote:
peppev wrote:

Let see if mdadm is solved, I've no idea how to check about this problem in the kernel source.


I was able to crash my machine on kernel 4.12.5 (with the mdadm bug) in a matter of hours. I'm on 4.12.12 for past 10 days 24/7, and it seems that it does not have this particular bug anymore.

peppev wrote:

Though, some check before declaring a kernel "stable" would be appreciated.

It is really disappointing to see systems which had been solid as a rock for years, under Gentoo, with terabytes of data stored in their disks, to panic for a stupid "typo" (as was declared by the original developer in this thread: https://www.spinics.net/lists/kernel/msg2558062.html) like a crazy.

Hope to be more lucky in the future.


Yes, that is embarrassing, and I too would be happier if it would not happen in the future. But we are getting this whole Gentoo miracle thing for free, so I'm grateful either way. Excellent value for the (zero) money spent.


I'm grateful to Gentoo as you are, not only for the metadistro I may use in an "uncountable" number of ways, but also for keeping me so near to the upstream as it may be possible in our days, especially at my age ;-(

And I understand how "impossbile" may be to check all the "branches" a kernel may walk through running its code.

But this particular version of the kernel seems to have been distributed as "stable" in a very great hurry, missing a bunch of patches already available from months.

It never happened before in the seven years I used Gentoo in my "production" systems.

Just wondering why.

About the mdadm bug, I'm still in the "check state".

I've a dozen of systems still running 4.12.5, with mdadm arrays, which doesn't show trace of the problem.

Just one of my systems printed the "stall" warning in its dmesg, without any other apparent problem.

I bet this bug, if it is a "single bug", it is not an "easy one" and may not be "over", at least until we find a kernel patch which show the reason of the stall message.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 21844

PostPosted: Wed Sep 27, 2017 1:01 am    Post subject: Reply with quote

This is the typical problem caused by different definitions of "stable." Upstream stable kernels start as the most recent Linus release (excluding release-candidates and snapshots), then add patches tagged as fixes (usually, but not always, tagged as such by the patch's author). Upstream typically performs basic build tests, but relies on the authors of the individual fixes to test functionality. There is typically some overlap where a previous stable kernel will receive additional fixes after a newer major series is available, but the same caveat applies. Users and, to some extent, distributions want to treat "stable" as implying a lack of serious new bugs. In a general sense, the stable series kernels from upstream are more stable than the base Linus kernel from which they derive, since they only take fixes on top of that kernel rather than big new features. However, each new Linus kernel features extensive changes relative to the prior Linus kernel, any of which could be bad if its respective author did not adequately test it.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum