View previous topic :: View next topic |
Author |
Message |
Myu Apprentice
Joined: 22 Oct 2014 Posts: 164 Location: Belgium
|
Posted: Thu Nov 02, 2017 5:10 pm Post subject: MCE Errors |
|
|
Hello everyone,
I got a log of "MCE Errors" thrown at me at boot and on all TTY since a day, I installed app-admin/mcelog and added CONFIG_X86_MCELOG_LEGACY=y to my kernel to be able to look at them at ease and it doesn't look good :
Code: | CPUID Vendor Intel Family 6 Model 58
mcelog: Trigger `cache-error-trigger' exited with status 1
mcelog: Trigger `cache-error-trigger' exited with status 1
mcelog: Cannot collect child 7516: No child processes
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 8
MISC 3022024086 ADDR 1031680
TIME 1509642356 Thu Nov 2 18:05:56 2017
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
Threshold based error status: yellow
Large number of corrected cache errors. System operating, but might lead
to uncorrected errors soon
MCA: corrected filtering (some unreported errors in same region)
Instruction CACHE Level-2 Instruction-Fetch Error
CPU 1 on socket 0 has large number of corrected cache errors in Level-2 Instruction
System operating correctly, but might lead to uncorrected cache errors soon
Cannot find sysfs cache for CPU 1Running trigger `cache-error-trigger'
STATUS cc56550000071152 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 58
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 8
MISC 3022024086 ADDR 1031680
TIME 1509642356 Thu Nov 2 18:05:56 2017
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
Threshold based error status: yellow
Large number of corrected cache errors. System operating, but might lead
to uncorrected errors soon
MCA: corrected filtering (some unreported errors in same region)
Instruction CACHE Level-2 Instruction-Fetch Error
CPU 0 on socket 0 has large number of corrected cache errors in Level-2 Instruction
System operating correctly, but might lead to uncorrected cache errors soon
Cannot find sysfs cache for CPU 0Running trigger `cache-error-trigger'
STATUS cc5654c000071152 MCGSTATUS 0
MCGCAP c09 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 58 |
Anyone has experience with that kind of errors ? Does it means my CPU is dying ?
I got two system freezes so far (while in a VM and while starting firefox)
Appreciate any insight _________________ Gentoo stable with bits of ~amd64 // Xfce 4.13 + Compiz Reloaded. |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9678 Location: almost Mile High in the USA
|
Posted: Thu Nov 02, 2017 5:20 pm Post subject: |
|
|
Yes, usually it's a bad sign if it had been working before.
If it has always been doing MCE errors, you should look into new BIOS, trying another kernel (or another distribution) (mostly to rule out kernel config issues).
If it just started doing it with a known stable configuration, check fan for cleanliness and heatsink/compound, stop overclocking or underclock it, else you may well be looking into hardware replacements (motherboard, CPU, possibly PSU). _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
Myu Apprentice
Joined: 22 Oct 2014 Posts: 164 Location: Belgium
|
Posted: Thu Nov 02, 2017 5:39 pm Post subject: |
|
|
Thanks for the reply eccerr0r (username almost checks out ! )
It never did that kind of errors before, the thing is sometimes it boots with the errors, sometimes not even with no hardware change in between...
Indeed some other post suggested various overheating issues or bad PSU, I cleared the dust and put another fan to work, recheck all connections, the errors are still there but seems my system can run (at least for what I tried so far : light browsing)
I'll look into underclocking and yes, maybe it's time to keep an eye on thrift store deals. _________________ Gentoo stable with bits of ~amd64 // Xfce 4.13 + Compiz Reloaded. |
|
Back to top |
|
|
Myu Apprentice
Joined: 22 Oct 2014 Posts: 164 Location: Belgium
|
Posted: Thu Nov 02, 2017 5:52 pm Post subject: |
|
|
Damn it's pretty crazy, htop & cat /proc/cpuinfo were showing 3 cores then 2 then one... instead of 4 (i5-3470)
Edit : Ok, mcelog is disabling CPU cores due to L2 cache errors... makes sense. _________________ Gentoo stable with bits of ~amd64 // Xfce 4.13 + Compiz Reloaded. |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9678 Location: almost Mile High in the USA
|
Posted: Thu Nov 02, 2017 10:37 pm Post subject: |
|
|
I'd look into M/B or PSU issues and not the CPU if more than one core is failing... though it's still possible for a CPU failure, it's not likely. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Thu Nov 02, 2017 11:07 pm Post subject: |
|
|
Ouch, that's bad. Neat that it manages to keep running like that though... |
|
Back to top |
|
|
krinn Watchman
Joined: 02 May 2003 Posts: 7470
|
Posted: Fri Nov 03, 2017 1:52 am Post subject: |
|
|
eccerr0r wrote: | I'd look into M/B or PSU issues and not the CPU if more than one core is failing... though it's still possible for a CPU failure, it's not likely. |
Generally heat can lead to malfunction, which mean randomness, when you have always the same core, and always the same error type, well, you better but any bet on a lottery ticket... |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9678 Location: almost Mile High in the USA
|
Posted: Fri Nov 03, 2017 2:54 am Post subject: |
|
|
Well unless the clock distribution or all cores share the same l2, a bug that shows up in one core should remain in that core.
However for that second case or any other shared resource, the cores should know an error is coming from the l2/shared resource and would know it's not the core's fault... that is, if the detection logic is smart, it should be disabling l2 chunks instead.
Of course you likely would not get the same error each time but it should happen on the same core - if it were a normal chip problem. Heating the chip is actually a "normal" problem and the error should stay on the same core. All bets are off it was a pre-production chip that had other issues that are not "normal" problems... _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
Myu Apprentice
Joined: 22 Oct 2014 Posts: 164 Location: Belgium
|
Posted: Fri Nov 03, 2017 8:26 am Post subject: |
|
|
Quote: | Ouch, that's bad. Neat that it manages to keep running like that though... |
Yes, pretty neat to see the cores hot plugging functionality working flawlessly =)
Quote: | I'd look into M/B or PSU issues and not the CPU if more than one core is failing... though it's still possible for a CPU failure, it's not likely. |
I also think there's a good chance it's the PSU, right now I lack another compatible one to test.
Quote: | Generally heat can lead to malfunction, which mean randomness, when you have always the same core, and always the same error type, well, you better but any bet on a lottery ticket... |
Indeed, the thing is right now my temps are just fine
After the cleaning, I've been able to play 2 hours on The Witcher 3 on a Windows VM (via KVM/VFIO) so I guess it's stressful enough to the hardware, no problems at all but the errors are still there. _________________ Gentoo stable with bits of ~amd64 // Xfce 4.13 + Compiz Reloaded. |
|
Back to top |
|
|
Myu Apprentice
Joined: 22 Oct 2014 Posts: 164 Location: Belgium
|
Posted: Thu Nov 23, 2017 9:31 pm Post subject: |
|
|
Ok so since then I got one dead motherboard... just brutally stopped working never to be bootable (POST) again
I purchased a second MB (different chipset but compatible) to try to salvage the hardware (it's an i5 and 16GB of RAM after all) but the MCE errors are still present after setting up the new MB and my old Gentoo.
Time for another CPU I guess ? _________________ Gentoo stable with bits of ~amd64 // Xfce 4.13 + Compiz Reloaded. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54232 Location: 56N 3W
|
Posted: Thu Nov 23, 2017 10:11 pm Post subject: |
|
|
Myu,
If you have moved all the bits to a mew motherboard, it does not leave much.
That assumes new, not just different.
The CPU cache has single bit error detection and correction, hence it can work correctly with one bit errors.
Two bit errors will always be detected but cannot be corrected, so the core will be shut down. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
krinn Watchman
Joined: 02 May 2003 Posts: 7470
|
Posted: Fri Nov 24, 2017 12:22 am Post subject: |
|
|
Myu wrote: | Ok so since then I got one dead motherboard... |
Might also be worth seeing some exorcist |
|
Back to top |
|
|
Myu Apprentice
Joined: 22 Oct 2014 Posts: 164 Location: Belgium
|
Posted: Fri Nov 24, 2017 6:38 am Post subject: |
|
|
@NeddySeagoon
Yes indeed, moved all the bits into the new MB but that's not a brand new one since it's hard to find those new at a decent price these days (Intel Core 3rd Gen)
Most of the time the system works fine, correcting errors along the way according to mcelog (as you said, under a certain limit, all goes well), but I'm not keen on keeping it like that if it has a tendency to fry my hardware eventually
I've another test in mind, swap the CPU to another one I have lying around and which is socket-compatible, if I stop getting MCE -> The i5 is probably dead, If not, another piece is faulty.
@krinn
I know right, this is getting higher and higher on my troubleshooting list _________________ Gentoo stable with bits of ~amd64 // Xfce 4.13 + Compiz Reloaded. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54232 Location: 56N 3W
|
Posted: Fri Nov 24, 2017 10:12 am Post subject: |
|
|
Myu,
Look at the Vcore regulator on the motherboard next to the CPU. It takes in the 12v from the PSU, via the dedicated connector with the black and yellow wires, on converts it to the voltages used by the CPU and RAM. Look for bulging or leaking capacitors.
When they begin to fail the CPU operating voltages are no longer properly regulated.
A motherboard swap and a CPU swap are both good tests.
You would be unlucky to get two failing Vcore regulators _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Myu Apprentice
Joined: 22 Oct 2014 Posts: 164 Location: Belgium
|
Posted: Fri Nov 24, 2017 5:43 pm Post subject: |
|
|
Hello Neddy,
From what I can tell, the VRM's on the MB looks sane, nothing special about them which is good
I swapped the CPU and so far, not a single MCE error in sight, I'll keep this monitored but it seems a clear indication it's the CPU after all.
Thank for your help, very appreciated as always _________________ Gentoo stable with bits of ~amd64 // Xfce 4.13 + Compiz Reloaded. |
|
Back to top |
|
|
|