[SOLVED] kernel 4.12: cpu stall with dm-raid

ecko · Tux's lil' helper Joined: 04 Jul 2010 Posts: 102

Hello, since I upgraded from 4.11 to 4.12, I get cpu stalls at random moments (system is desktop for office work, mostly idle). During the event, I/O is frozen (including SATA disk and USB mouse, but PS/2 keyboard is fine); programs in memory are responsive (as long as they don't need I/O). Unix utility "top" reports md_raid occupying 100% of a core (the /home is raid1 from the linux kernel), while iotop reports no particular I/O activity.

What can I do?

dmesg below (running gentoo-sources-4.12.4)

LIsLinuxIsSogood · Veteran Joined: 13 Feb 2016 Posts: 1179

If I were you (and I'm not)...have you tried booting into single user mode without the /home partition mounted. If you can gain access to the operating system without any reliance on the second disk (mirror) you may be able to isolate if it is related at all to the newly added RAID features for the kernel, which were shown here (https://fossbytes.com/linux-kernel-4-12-download-features/)

It is a shot in the dark, but since all RAID features rely on two or more disks, perhaps there is a related bug, or else if you do see the problem go away after detaching the mirror then you might be able to add it back afterwards (problem-free).

Any luck?

ecko · Tux's lil' helper Joined: 04 Jul 2010 Posts: 102

radio_flyer · Posted: Wed Aug 16, 2017 3:54 pm Post subject:

You're not running KDE are you? If so, Baloo will hang I/O hard for that long.

ecko · Tux's lil' helper Joined: 04 Jul 2010 Posts: 102

snIP3r · l33t Joined: 21 May 2004 Posts: 853 Location: germany

hi all!

i have similar issue:

snIP3r · l33t Joined: 21 May 2004 Posts: 853 Location: germany

looks like this is about our issue:

https://lkml.org/lkml/2017/8/6/197
_________________
Intel i3-4130T on ASUS P9D-X
Kernel 5.15.88-gentoo SMP
-----------------------------------------------
if your problem is fixed please add something like [solved] to the topic!

araxon · Tux's lil' helper Joined: 25 May 2011 Posts: 83

Same here. Under high disk load, the server throws similar message and then stops all disk I/O. It is not even able to write an error log, so it took me days to track it down. But I managed to log errors remotely, as I noticed that the networking lives a bit longer. There is no RAID5/6 on the server, only RAID1, but the error seems md_raid related.

I am able to reproduce the crash pretty regularly on this hardware, so if you have anything non-destructive that can be tried, I may be able to test it.

snIP3r · l33t Joined: 21 May 2004 Posts: 853 Location: germany

yes, it's md related. i switched back to my former used kernel - no such errors. so for me i am waiting for the next stable kernel...
_________________
Intel i3-4130T on ASUS P9D-X
Kernel 5.15.88-gentoo SMP
-----------------------------------------------
if your problem is fixed please add something like [solved] to the topic!

ecko · Tux's lil' helper Joined: 04 Jul 2010 Posts: 102

My bissection lead this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d5e72dfdf0fa29a21143fd72746c6f43295ce9f "This update includes the usual round of major driver updates".

I did some limited testing with 4.13-rc7 and for now the problem did not show up. I'll test for longer with 4.13 before declaring it solved.

ecko · Tux's lil' helper Joined: 04 Jul 2010 Posts: 102

After several days of tests, the problem does not happen with kernel 4.13.

araxon · Tux's lil' helper Joined: 25 May 2011 Posts: 83

masc · n00b Joined: 29 Dec 2008 Posts: 29

peppev · n00b Joined: 10 Aug 2009 Posts: 26 Location: Italy

masc · n00b Joined: 29 Dec 2008 Posts: 29

peppev · n00b Joined: 10 Aug 2009 Posts: 26 Location: Italy

araxon · Tux's lil' helper Joined: 25 May 2011 Posts: 83

araxon · Tux's lil' helper Joined: 25 May 2011 Posts: 83

peppev · n00b Joined: 10 Aug 2009 Posts: 26 Location: Italy

araxon · Tux's lil' helper Joined: 25 May 2011 Posts: 83

peppev · n00b Joined: 10 Aug 2009 Posts: 26 Location: Italy

Hu · Administrator Joined: 06 Mar 2007 Posts: 21844

This is the typical problem caused by different definitions of "stable." Upstream stable kernels start as the most recent Linus release (excluding release-candidates and snapshots), then add patches tagged as fixes (usually, but not always, tagged as such by the patch's author). Upstream typically performs basic build tests, but relies on the authors of the individual fixes to test functionality. There is typically some overlap where a previous stable kernel will receive additional fixes after a newer major series is available, but the same caveat applies. Users and, to some extent, distributions want to treat "stable" as implying a lack of serious new bugs. In a general sense, the stable series kernels from upstream are more stable than the base Linus kernel from which they derive, since they only take fixes on top of that kernel rather than big new features. However, each new Linus kernel features extensive changes relative to the prior Linus kernel, any of which could be bad if its respective author did not adequately test it.