Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
MCELOG, AMD and rasdaemon
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2007

PostPosted: Thu Jul 12, 2018 5:04 pm    Post subject: MCELOG, AMD and rasdaemon Reply with quote

While clearing lint in my kernel config I came across the MCE handling, and did some digging to understand what I should have, (which was not what I had). It's just about worth sharing the results.

Please assume "IIUC" in front of every sentence following - I've only a superficial understanding.

The kernel supports Machine Check Exceptions for such things as memory errors, hardware glitches, and possibly thermal problems. The "old" support put something in /dev/mcelog, and the package app-admin/mcelog contained a program to do something with the results. "Something" could be log it, or print a diagnosis on the terminal. The package contains a daemon that can be added to a run level (boot or default, I guess) and a command line version.

To work with current kernels, you need to enable
Code:
Processor type and features
  [*] Machine Check / overheating reporting
     [ ]   Support for deprecated /dev/mcelog character device
     [ ]   Intel MCE features
     [*]   AMD MCE features

and the help for the deprecated support says
Quote:
Enable support for /dev/mcelog which is needed by the old mcelog userspace logging daemon. Consider switching to the new generation rasdaemon solution.


I don't know about Intel boxes, but on all but ancient AMD boxes, this is all pointless. The mcelog package doesn't support anything since K8. If you try to use it, it says "CPU is unsupported" and "Please load edac_mce_amd module". That second message confuses everyone - mcelog doesn't need the module; the mcelog daemon simply doesn't work, and the edac_mce_amd module provides a substitute function. However, it's not much of a substitute.

To get edac_mce_amd module, you need to configure:
Code:
Device Drivers
    <*> EDAC (Error Detection And Correction) reporting  --->
        <*>   Decode MCEs in human-readable form (only on AMD for now)
        <M>   AMD64 (Opteron, Athlon64)

I think the "Decode" option puts a readable diagnostic in syslog. The edac_mce_amd module handles ECC memory errors, but I don't have that sort of memory.

As far as I can tell, the rasdaemon mentioned as a replacement for the mcelog daemon is also a weak substitute. It's supposed to be the beginning of a complete Reliability Availability Serviceability infrastructure, but like the module, it currently only handles ECC memory.

You can enable the kernel bits, but the daemons only handle ECC memory, and I don't have any. The kernel might report overheating, but I wouldn't count on it, and I thought ACPI and its friends were supposed to handle that anyway.

TL;DR So as far as I can tell, on AMD almost all of this is useless. In particular, the handbook probably should say mcelog is only for Intel systems.
_________________
Greybeard
Back to top
View user's profile Send private message
janos666
n00b
n00b


Joined: 15 Nov 2015
Posts: 30

PostPosted: Sun Oct 29, 2023 12:52 pm    Post subject: Reply with quote

I guess it's better to resurrect this thread instead of starting a new one.
I had ECC memory with a motherboard chipset and CPU that supported it for a long time but I never seemed to have access to error statistics.
I recently found rasdaemon but I have trouble setting it up. The only expected readings I get out of it are the motherboard name (that is a bit confusing but it has Intel C232 chipset):
Code:
ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: Gigabyte Technology Co., Ltd. model X150M-PRO ECC-CF

The CPU is: Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz
The memory is: 4x Micron 18ASF2G72AZ-2G3B1
and the guesstimated memory layout (this seems to be fine for a 2-channel, 4-stick config):
Code:
ras-mc-ctl --guess-labels
memory stick 'ChannelA-DIMM0' is located at 'BANK 0'
memory stick 'ChannelA-DIMM1' is located at 'BANK 1'
memory stick 'ChannelB-DIMM0' is located at 'BANK 2'
memory stick 'ChannelB-DIMM1' is located at 'BANK 3'


As for the rest of the rasdeamon outputs...
rasdaemon is added to the default runlevel of OpenRC but it crashes. The following reveals why:
Code:
rasdaemon -f
rasdaemon: Can't locate a mounted debugfs

even though I compile the kernel with DEBUG_FS=y and DEBUG_FS_ALLOW_ALL=y and it is confirmed to be mounted according to the mount command's output:
Code:
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)

And then there is this:
Code:
ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.

I am not sure if I have the correct EDAC driver(s) [and if that exist in the mainline kernel at all] for this particular chipset/CPU(-IMC) because going by the kernel menuconfig help text, it looks like this platform might not be supported. But I enabled all EDAC drivers that seem to be at least somewhat close:
CONFIG_EDAC_I7CORE (the description says it's for very old, Nehalem based CPU-IMC)
CONFIG_EDAC_SBRIDGE (yet again, it is described to support somewhat older CPU generations than mine)
CONFIG_EDAC_SKX (now, this is the one I probably need because my CPU is SkyLake generation and somewhat workstation/server class, although at the low-end --- basically, this same CPU architecture is sold for both low-end desktop and workstation/server under various names from Pentium to Xeon with some features and cache sizes enabled/disabled accordingly, but this driver is probably intended for the HEDT and "real server" variants, SkyLake -E or -S that are basically still the same core architecure with similar feature set but a different silicon chip, not merely different laser cuts on caches and microcode config only).
CONFIG_EDAC_I10NM (this is for later architectures but I enabled it anyway, similar to how I enabled older variants, just in case...)
dmesg confirms EDAC MC is loaded (but not much else):
Code:
dmesg | grep -i edac
[    0.396047] EDAC MC: Ver: 3.0.0


It can't find labels (not very surprising but I would expect the sane-looking guesstimated labels would be used in the absence of preconfigured ones).
Code:
ras-mc-ctl --print-labels
ras-mc-ctl: Error: No dimm labels for Gigabyte Technology Co., Ltd. model X150M-PRO ECC-CF

--register-labels prints the same error.

Code:
ras-mc-ctl --layout
ras-mc-ctl: Error: No memories found at via edac.

May be because simply no available EDAC driver is compatible with my hardware (but that would be rather strange since there seems to be a great coverage for both desktop and workstation/server parts in general and this is a workstation/server chipset with a workstation/server Xeon CPU).

This is a confusing error (for me) because Gentoo installed the SQLite server and there is no mention of a need for manual initialization:
Code:
ras-mc-ctl --summary
DBD::SQLite::db prepare failed: no such table: mc_event at /usr/sbin/ras-mc-ctl line 1172.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1173.


These are not surprising after the above:
Code:
ras-mc-ctl --errors
DBD::SQLite::db prepare failed: no such table: mc_event at /usr/sbin/ras-mc-ctl line 1332.
ras-mc-ctl: Error: mc_event table missing from /var/lib/rasdaemon/ras-mc_event.db. Run 'rasdaemon --record'.


This is not surprising if no driver is compatible with the hardware:
Code:
ras-mc-ctl --error-count
ras-mc-ctl: Error: No DIMMs found in /sys or new sysfs EDAC interface not found.


The rest are irrelevant.
Back to top
View user's profile Send private message
janos666
n00b
n00b


Joined: 15 Nov 2015
Posts: 30

PostPosted: Tue Oct 31, 2023 2:21 am    Post subject: Reply with quote

Ahh, I finally figured the correct EDAC driver for my CPU/IMC:
Code:
dmesg | grep -i edac
[    0.394848] EDAC MC: Ver: 3.0.0
[    5.387759] EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (POLLED)

Although rasdaemon still crashes and running it in the foreground still gives an error about debugfs not being mounted (even though it certainly is and there is a ras folder on it).
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2007

PostPosted: Tue Oct 31, 2023 2:28 pm    Post subject: Reply with quote

IMHO rasdaemon has an asinine design for most users. I can see the benefit of storing EDAC error information in a database if you're running a cloud service with hundreds of boxes, but for most of us, that database is unlikely to receive 50 records before we junk the PC! Then to tie it to the kernel debugging API is, to be polite, weird.

After a fair bit of thought I've come to the following conclusions:
  • the kernel's existing EDAC error handling - to write an error to syslog (I don't remember the severity, but it's noticeable) provides adequate information - telling us which chip on which board is unlikely to be that important, as the my first response would be to run memtest-86 anyway;
  • a database needs looking at, so that's pointless; and
  • I'd like an alert when there are any (or more than x within a certain time period, but I suspect x=1 is enough) soft errors, so I can get a feel if there is a trend.

From which I've decided the thing to use is a syslog filter (depending on your logging setup) that sends me an email if various MCE records appear, including EDAC errors.

Haven't gotten around to writing it yet!
_________________
Greybeard
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum