Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
rasdaemon/ras-mc-ctl --status says "drivers not loaded."
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
c00l.wave
Apprentice
Apprentice


Joined: 24 Aug 2003
Posts: 264

PostPosted: Sat May 14, 2022 10:12 am    Post subject: ras-mc-ctl --status says "drivers not loaded." Reply with quote

I got a new AMD machine that has ECC RAM. On previous machines I used edac-utils in addition to mcelog but those tools are supposedly deprecated (?) and replaced by the all-in-one-tool app-admin/rasdaemon. I am not sure about its operational status, though:

Code:

# ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.


It seems to be missing something but unfortunately doesn't tell me what exactly. EDAC and MCE support is compiled directly into the kernel (not as a module) and, judging from dmesg, looks to me like it probably is working:

Code:

# dmesg | grep -iE '(edac|mce)'
[    0.353539] MCE: In-kernel MCE decoding enabled.
[    0.427186] EDAC MC: Ver: 3.0.0
[    0.896907] EDAC amd64: MCT channel count: 2
[    0.897048] EDAC MC0: Giving out device to module amd64_edac controller F17h_M70h: DEV 0000:00:18.3 (INTERRUPT)
[    0.898863] EDAC amd64: F17h_M70h detected (node 0).
[    0.898982] EDAC MC: UMC0 chip selects:
[    0.898983] EDAC amd64: MC: 0:     0MB 1:     0MB
[    0.899101] EDAC amd64: MC: 2: 16384MB 3: 16384MB
[    0.899219] EDAC MC: UMC1 chip selects:
[    0.899219] EDAC amd64: MC: 0:     0MB 1:     0MB
[    0.899335] EDAC amd64: MC: 2: 16384MB 3: 16384MB
[    0.899452] EDAC amd64: using x16 syndromes.
[    0.899569] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[    0.899735] AMD64 EDAC driver v3.5.0


Code:

# grep -iE '(edac|mce|ras)' /usr/src/linux/.config
CONFIG_X86_MCE=y
# CONFIG_X86_MCELOG_LEGACY is not set
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_CRASH_CORE=y
# CONFIG_DRM_PANEL_RASPBERRYPI_TOUCHSCREEN is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=y
CONFIG_EDAC_GHES=y
CONFIG_EDAC_AMD64=y
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_IE31200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I7CORE is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_I7300 is not set
# CONFIG_EDAC_SBRIDGE is not set
# CONFIG_EDAC_SKX is not set
# CONFIG_EDAC_I10NM is not set
# CONFIG_EDAC_PND2 is not set
# CONFIG_EDAC_IGEN6 is not set
CONFIG_RAS=y
CONFIG_RAS_CEC=y
# CONFIG_RAS_CEC_DEBUG is not set


Did I miss something or is ras-mc-ctl --status broken? Maybe it only checks for loaded modules and cannot detect if support was compiled into the kernel?

Overall, I seem to get some data out of ras-mc-ctl despite the report that some driver(s) would be missing, although it looks like ras-mc-ctl has some bugs (script error messages):

Code:

# ras-mc-ctl --layout
Use of uninitialized value $max_pos[3] in modulus (%) at /usr/sbin/ras-mc-ctl line 905.
Use of uninitialized value $d in numeric ge (>=) at /usr/sbin/ras-mc-ctl line 906.
Use of uninitialized value $d in sprintf at /usr/sbin/ras-mc-ctl line 909.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 828.
    +-----------------------------------------------------------------------------------------------+
    |                                              mc0                                              |
    |        csrow0         |        csrow1         |        csrow2         |        csrow3         |
    | channel0  | channel1  | channel0  | channel1  | channel0  | channel1  | channel0  | channel1  |
----+-----------------------------------------------------------------------------------------------+

0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----+-----------------------------------------------------------------------------------------------+


Code:

# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.
Disk errors summary:
        0:2048 has 2 errors
No MCE errors.


Code:

# ras-mc-ctl --guess-labels
memory stick 'DIMM 0' is located at 'P0 CHANNEL A'
memory stick 'DIMM 1' is located at 'P0 CHANNEL A'
memory stick 'DIMM 0' is located at 'P0 CHANNEL B'
memory stick 'DIMM 1' is located at 'P0 CHANNEL B'


Code:

# ras-mc-ctl --register-labels
Use of uninitialized value in lc at /usr/sbin/ras-mc-ctl line 796.
ras-mc-ctl: Error: No dimm labels for ASRockRack model B565D4-V1L


Code:

# ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: ASRockRack model B565D4-V1L


Code:

# ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

Disk errors
1 2022-04-19 12:17:49 +0200 error: dev=0:2048, sector=149021912, nr_sector=256, error='unknown block error', rwbs='R', cmd='',
2 2022-04-19 14:40:59 +0200 error: dev=0:2048, sector=17181184000, nr_sector=256, error='unknown block error', rwbs='R', cmd='',

No MCE errors.


Restarting /etc/init.d/rasdaemon also leaves some weird log messages: ("apache2 has detected an error"?! I thought I am working with rasdaemon?)

Code:

May 14 12:08:04 [/etc/init.d/rasdaemon] apache2 has detected an error in your setup:
May 14 12:08:05 [rasdaemon] Huh! something got wrong. Aborting._
May 14 12:08:05 [rasdaemon] ras:mc_event event disabled_
May 14 12:08:05 [rasdaemon] ras:aer_event event disabled_
May 14 12:08:05 [rasdaemon] mce:mce_record event disabled_
May 14 12:08:05 [rasdaemon] Can't write to set_event_
May 14 12:08:05 [rasdaemon] ras:non_standard_event event disabled_
May 14 12:08:05 [rasdaemon] ras:arm_event event disabled_
May 14 12:08:05 [rasdaemon] Can't write to set_event_
May 14 12:08:05 [rasdaemon] block:block_rq_complete event disabled_
May 14 12:08:05 [rasdaemon] ras:mc_event event enabled_
May 14 12:08:05 [rasdaemon] ras:aer_event event enabled_
May 14 12:08:05 [rasdaemon] ras:mc_event event enabled_
May 14 12:08:05 [rasdaemon] Enabled event ras:mc_event_
May 14 12:08:05 [rasdaemon] mce:mce_record event enabled_
May 14 12:08:05 [rasdaemon] Can't write to set_event_
May 14 12:08:05 [rasdaemon] ras:non_standard_event event enabled_
May 14 12:08:05 [rasdaemon] ras:aer_event event enabled_
May 14 12:08:05 [rasdaemon] Enabled event ras:aer_event_
May 14 12:08:05 [rasdaemon] ras:arm_event event enabled_
May 14 12:08:05 [rasdaemon] Can't write to set_event_
May 14 12:08:05 [rasdaemon] block:block_rq_complete event enabled_
May 14 12:08:05 [rasdaemon] ras:non_standard_event event enabled_
May 14 12:08:05 [rasdaemon] Enabled event ras:non_standard_event_
May 14 12:08:05 [rasdaemon] ras:arm_event event enabled_
May 14 12:08:05 [rasdaemon] Enabled event ras:arm_event_
May 14 12:08:05 [rasdaemon] mce:mce_record event enabled_
May 14 12:08:05 [rasdaemon] Enabled event mce:mce_record_
May 14 12:08:05 [rasdaemon] Can't get traces from ras:extlog_mem_event_
May 14 12:08:05 [rasdaemon] net:net_dev_xmit_timeout event enabled_
May 14 12:08:05 [rasdaemon] Enabled event net:net_dev_xmit_timeout_
May 14 12:08:05 [rasdaemon] Can't get traces from devlink:devlink_health_report_
May 14 12:08:05 [rasdaemon] block:block_rq_complete event enabled_
May 14 12:08:05 [rasdaemon] Enabled event block:block_rq_complete_


I am a little in doubt about the reliability of that tool/daemon. Is that how it's supposed to look like? Will it correctly report errors from EDAC and MCE or am I left blind if I only rely on that tool?
_________________
nohup nice -n -20 cp /dev/urandom /dev/null &
Back to top
View user's profile Send private message
stefantalpalaru
n00b
n00b


Joined: 11 Jan 2009
Posts: 24
Location: Italy

PostPosted: Tue Aug 30, 2022 10:17 am    Post subject: Reply with quote

This article covers everything you need to know: https://www.setphaserstostun.org/posts/monitoring-ecc-memory-on-linux-with-rasdaemon/

I would add that "ras-mc-ctl --status" only looks inside "/proc/modules" to see if relevant kernel modules were loaded or not. If those drivers have been compiled directly into the kernel, instead of built as modules, it will wrongly report "drivers not loaded".
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum