Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Kernel 6.1.81 smartd won't run; trace in dmesg [SOLVED]
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2963
Location: Edge of marsh USA

PostPosted: Wed Mar 13, 2024 4:09 am    Post subject: Kernel 6.1.81 smartd won't run; trace in dmesg [SOLVED] Reply with quote

I've just upgraded two machines to gentoo-sources-6.1.81. On both machines now, upon boot, I get a trace that looks like the following:
Code:
[Tue Mar 12 23:54:41 2024] ------------[ cut here ]------------
[Tue Mar 12 23:54:41 2024] WARNING: CPU: 7 PID: 4020 at drivers/scsi/scsi_lib.c:214 scsi_execute_cmd+0x3a/0x240
[Tue Mar 12 23:54:41 2024] Modules linked in: uas x86_pkg_temp_thermal mei_hdcp kvm_intel rt2800pci eeprom_93cx6 rt2x00pci rt2800mmio rt2x00mmio rt2800lib kvm crc_ccitt rt2x00lib at24 mac80211 regmap_i2c libarc4 cfg80211 irqbypass mei_me e1000e firewire_ohci firewire_core mei f71882fg coretemp
[Tue Mar 12 23:54:41 2024] CPU: 7 PID: 4020 Comm: smartd Not tainted 6.1.81-gentoo #1
[Tue Mar 12 23:54:41 2024] Hardware name: Hewlett-Packard h8-1260t/2AB5, BIOS 7.12 10/12/2011
[Tue Mar 12 23:54:41 2024] RIP: 0010:scsi_execute_cmd+0x3a/0x240
[Tue Mar 12 23:54:41 2024] Code: f4 89 d6 55 44 89 c5 53 48 83 ec 10 48 8b 5c 24 50 48 89 0c 24 48 85 db 0f 84 9b 01 00 00 48 83 3b 00 74 24 83 7b 08 60 74 1e <0f> 0b 41 bd ea ff ff ff 48 83 c4 10 44 89 e8 5b 5d 41 5c 41 5d 41
[Tue Mar 12 23:54:41 2024] RSP: 0018:ffffaa6f0173fcc0 EFLAGS: 00010287
[Tue Mar 12 23:54:41 2024] RAX: ffffaa6f0173fd20 RBX: ffffaa6f0173fd20 RCX: ffff99b3939c9400
[Tue Mar 12 23:54:41 2024] RDX: 0000000000000022 RSI: 0000000000000022 RDI: ffff99b380d45000
[Tue Mar 12 23:54:41 2024] RBP: 0000000000000200 R08: 0000000000000200 R09: 0000000000002710
[Tue Mar 12 23:54:41 2024] R10: ffffaa6f0173fee8 R11: 0000000000000000 R12: ffffaa6f0173fd50
[Tue Mar 12 23:54:41 2024] R13: ffff99b380d45000 R14: 0000000000002710 R15: 0000000000000000
[Tue Mar 12 23:54:41 2024] FS:  00007f1023662480(0000) GS:ffff99b6af5c0000(0000) knlGS:0000000000000000
[Tue Mar 12 23:54:41 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Mar 12 23:54:41 2024] CR2: 00007ffce0074ff8 CR3: 0000000118fd4005 CR4: 00000000000606e0
[Tue Mar 12 23:54:41 2024] Call Trace:
[Tue Mar 12 23:54:41 2024]  <TASK>
[Tue Mar 12 23:54:41 2024]  ? scsi_execute_cmd+0x3a/0x240
[Tue Mar 12 23:54:41 2024]  ? __warn+0x74/0xc0
[Tue Mar 12 23:54:41 2024]  ? scsi_execute_cmd+0x3a/0x240
[Tue Mar 12 23:54:41 2024]  ? report_bug+0xe2/0x150
[Tue Mar 12 23:54:41 2024]  ? handle_bug+0x3a/0x70
[Tue Mar 12 23:54:41 2024]  ? exc_invalid_op+0x13/0x60
[Tue Mar 12 23:54:41 2024]  ? asm_exc_invalid_op+0x16/0x20
[Tue Mar 12 23:54:41 2024]  ? scsi_execute_cmd+0x3a/0x240
[Tue Mar 12 23:54:41 2024]  ? ata_cmd_ioctl+0x1dd/0x2f0
[Tue Mar 12 23:54:41 2024]  ata_cmd_ioctl+0x13f/0x2f0
[Tue Mar 12 23:54:41 2024]  scsi_ioctl+0x32b/0x900
[Tue Mar 12 23:54:41 2024]  ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x140
[Tue Mar 12 23:54:41 2024]  ? scsi_block_when_processing_errors+0x1d/0xf0
[Tue Mar 12 23:54:41 2024]  blkdev_ioctl+0x100/0x280
[Tue Mar 12 23:54:41 2024]  __x64_sys_ioctl+0x88/0xc0
[Tue Mar 12 23:54:41 2024]  do_syscall_64+0x38/0x90
[Tue Mar 12 23:54:41 2024]  entry_SYSCALL_64_after_hwframe+0x64/0xce
[Tue Mar 12 23:54:41 2024] RIP: 0033:0x7f102332c98b
[Tue Mar 12 23:54:41 2024] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[Tue Mar 12 23:54:41 2024] RSP: 002b:00007ffce0074990 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Tue Mar 12 23:54:41 2024] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f102332c98b
[Tue Mar 12 23:54:41 2024] RDX: 00007ffce0074bf0 RSI: 000000000000031f RDI: 0000000000000003
[Tue Mar 12 23:54:41 2024] RBP: 000055b9c3a9c760 R08: 0000000000000000 R09: 0000000000000000
[Tue Mar 12 23:54:41 2024] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[Tue Mar 12 23:54:41 2024] R13: 00007ffce0075620 R14: 00007ffce0074bf0 R15: 00007ffce0075620
[Tue Mar 12 23:54:41 2024]  </TASK>
[Tue Mar 12 23:54:41 2024] ---[ end trace 0000000000000000 ]---

What's up with that? It's Greek to me (not to disparage any Greeks who read this).
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi


Last edited by figueroa on Tue Mar 19, 2024 3:12 am; edited 2 times in total
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2963
Location: Edge of marsh USA

PostPosted: Thu Mar 14, 2024 3:33 am    Post subject: Reply with quote

I greatly dislike answering my own question, but afraid I must. smartd (/etc/init.d/smartd) will not start under gentoo-sources-6.1.81.

Recompiling smartmontools didn't help. Rebooting to gentoo-sources-6.1.74 all back to normal.

This failure to start smartd happened on two VERY different systems. One system is an x86 Gigabyte mother board running an AMD Phenom(tm) 8650 Triple-Core Processor, and the second system is an x86_64 HP h8-1260t motherboard running an Intel(R) Core(TM) i7-2600 CPU.

I'm assuming many people are running this kernel, so I'm not calling it SOLVED. It's just solved for me for the time being.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21637

PostPosted: Thu Mar 14, 2024 2:36 pm    Post subject: Reply with quote

Although that failure to start is inconvenient for you, it is a very good thing for debugging. First, it means that this is a user-visible regression, and fixing those is considered second only to preserving security fixes. If you can find the specific commit that broke it, I think it highly likely you can either argue to get it reverted or argue to get a fix so that smartd works on future 6.1.x kernels. Second, since it fails to start, this is a relatively straightforward test case: (1) boot a suspect kernel; (2) try to start smartd; (3) check whether smartd started without error and without kernel warnings. This is much nicer than bugs where the kernel must run for hours before it fails, or where a failure happens only once out of every several reboots.

To your specific problem: Normally, the next step would be to try to find the specific commit (not just kernel release) that broke this. I see 1512 commits present in v6.1.81 and not present in v6.1.74. If we assume this was broken upstream, rather than by a Gentoo-specific patch, then we need to find which of those 1512 commits is at fault. A straight git bisect will need log2(1512) =~ 11 steps to find this. However, after the first draft of this post, I did some basic analysis of the faulting function, and I see the WARN introduced in a commit that is present in v6.1.81 and absent in v6.1.74. Therefore, without any attempt to understand what the code is meant to do, I posit that this check is just wrong as committed. The WARN was introduced in scsi: core: Add struct for args to execution functions, which makes it v6.1.80-4-gcf33e6ca12d8. That is, it is the fourth commit added after 6.1.80, so I expect you would find 6.1.80 to be good and 6.1.81 to be bad. It may (or may not) be the case that the corresponding upstream commit was correct as written, due to a change present in the later kernel that is absent in 6.1. That would require either testing a newer kernel series, or getting input from someone who understands this code and can explain what the check was meant to prevent.
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2963
Location: Edge of marsh USA

PostPosted: Thu Mar 14, 2024 3:44 pm    Post subject: Reply with quote

Thanks, Hu. That's a good lead and findings. I guess bugzilla is my friend.

ADDED 20240315: Reported: https://bugs.gentoo.org/927079 (date edited 20240323)
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi


Last edited by figueroa on Sun Mar 24, 2024 3:29 am; edited 1 time in total
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2963
Location: Edge of marsh USA

PostPosted: Tue Mar 19, 2024 3:11 am    Post subject: Reply with quote

The problem that began with an update to gentoo-sources-6.1.81 preventing smartd from running was resolved by upgrading the kernel to gentoo-sources-6.8.1.

You can follow the saga at the bug report https://bugs.gentoo.org/927079#c9 but the reason this upgrade solved the problem has not yet been revealed.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21637

PostPosted: Tue Mar 19, 2024 3:10 pm    Post subject: Reply with quote

figueroa wrote:
ADDED 20040215
You reported this bug 20 years ago? ;)

figueroa wrote:
The problem that began with an update to gentoo-sources-6.1.81 preventing smartd from running was resolved by upgrading the kernel to gentoo-sources-6.8.1.

You can follow the saga at the bug report https://bugs.gentoo.org/927079#c9 but the reason this upgrade solved the problem has not yet been revealed.
This may be a case of what I speculated above. I posit that in some kernel released after v6.1, but before v6.8.1, some as-yet-unidentified commit changed the behavior of this code path such that this new WARN is valid when that commit is present. That hypothetical commit was not backported to v6.1.81, so the commit containing the WARN is wrong as committed in v6.1.81, but is correct in v6.8.1. This would not be the first time that has happened, since backport decisions do not necessarily receive a detailed review and approval from the author of the backported commit, whom we can assume is best positioned to understand its prerequisites. (Although even there, the author may not be aware that the commit depends on a change which has not been backported.) When the community is lucky, a "bad" backport like this causes a build failure, and the stable kernel maintainers readily recognize that this is a bad change. When we are not, the bad backport causes a runtime bug, possibly much more subtle than the one that prompted this thread.

It's also possible that this is a simple case of a bad commit in the v6.2..v6.8.1 range, which was wrong as written even when submitted to Linus, which was fixed by something present by the time of v6.8.1, and that the bad commit was backported, but the fix for the bad commit was not backported - possibly because no one has yet recognized which commit needs to be backported.

You could still try to get this change reverted in v6.1.x, but as I read the commit log message, it was brought in because it was a known prerequisite for something else that was supposed to go in to v6.1.x. Thus, you may face more pushback, or there may be a stronger interest in finding what in v6.8.1 fixed this, and backporting that fix to v6.1.x.
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2963
Location: Edge of marsh USA

PostPosted: Sun Mar 24, 2024 3:32 am    Post subject: Reply with quote

Hu wrote:
figueroa wrote:
ADDED 20040215
You reported this bug 20 years ago? ;)

That was embarrassing. I did just fix it in quoted post. :oops:
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2963
Location: Edge of marsh USA

PostPosted: Sun Apr 07, 2024 6:37 pm    Post subject: Reply with quote

Hu recommended that I install kernel gentoo-sources-6.8.1 since I'd found that 6.1.81 would not start smartd. Kernel 6.8.1 solved the problem and I've been running it for two weeks now without and issue. But thinking long term, what kernel series do I follow now to keep from being left behind? I'm not comfortable with and have no need on my older hardware to stay on the bleeding edge of kernel development.

Noticing that the newest LTS kernel is the 6.6 series, I merged, built, and installed gentoo-sources-6.6.21, the current stable LTS version. That seems to have hit the sweet spot for me. SMART (/etc/init.d/smartd) runs as intended, and dmesg reveals no important errors or warnings, and /var/log/boot shows all processes called for have started normally.

That is all. Original problem still solved. :D
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum