Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
MCE Errors
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Myu
Tux's lil' helper
Tux's lil' helper


Joined: 22 Oct 2014
Posts: 130
Location: Belgium

PostPosted: Thu Nov 02, 2017 5:10 pm    Post subject: MCE Errors Reply with quote

Hello everyone,

I got a log of "MCE Errors" thrown at me at boot and on all TTY since a day, I installed app-admin/mcelog and added CONFIG_X86_MCELOG_LEGACY=y to my kernel to be able to look at them at ease and it doesn't look good :

Code:
CPUID Vendor Intel Family 6 Model 58
mcelog: Trigger `cache-error-trigger' exited with status 1
mcelog: Trigger `cache-error-trigger' exited with status 1
mcelog: Cannot collect child 7516: No child processes
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 8
MISC 3022024086 ADDR 1031680
TIME 1509642356 Thu Nov  2 18:05:56 2017
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
Threshold based error status: yellow
Large number of corrected cache errors. System operating, but might lead
to uncorrected errors soon
MCA: corrected filtering (some unreported errors in same region)
Instruction CACHE Level-2 Instruction-Fetch Error
CPU 1 on socket 0 has large number of corrected cache errors in Level-2 Instruction
System operating correctly, but might lead to uncorrected cache errors soon
Cannot find sysfs cache for CPU 1Running trigger `cache-error-trigger'
STATUS cc56550000071152 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 58
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 8
MISC 3022024086 ADDR 1031680
TIME 1509642356 Thu Nov  2 18:05:56 2017
MCG status:
MCi status:
Error overflow
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
Threshold based error status: yellow
Large number of corrected cache errors. System operating, but might lead
to uncorrected errors soon
MCA: corrected filtering (some unreported errors in same region)
Instruction CACHE Level-2 Instruction-Fetch Error
CPU 0 on socket 0 has large number of corrected cache errors in Level-2 Instruction
System operating correctly, but might lead to uncorrected cache errors soon
Cannot find sysfs cache for CPU 0Running trigger `cache-error-trigger'
STATUS cc5654c000071152 MCGSTATUS 0
MCGCAP c09 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 58


Anyone has experience with that kind of errors ? Does it means my CPU is dying ?

I got two system freezes so far (while in a VM and while starting firefox)

Appreciate any insight :)
_________________
Gentoo stable (with bits of ~amd64) / Games ! (Linux & vfio-pci ) // Xfce

Feel free to PM me if you would like a simple ebuild and I'll see what I can do :]
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 6339
Location: almost Mile High in the USA

PostPosted: Thu Nov 02, 2017 5:20 pm    Post subject: Reply with quote

Yes, usually it's a bad sign if it had been working before.

If it has always been doing MCE errors, you should look into new BIOS, trying another kernel (or another distribution) (mostly to rule out kernel config issues).

If it just started doing it with a known stable configuration, check fan for cleanliness and heatsink/compound, stop overclocking or underclock it, else you may well be looking into hardware replacements (motherboard, CPU, possibly PSU).
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Myu
Tux's lil' helper
Tux's lil' helper


Joined: 22 Oct 2014
Posts: 130
Location: Belgium

PostPosted: Thu Nov 02, 2017 5:39 pm    Post subject: Reply with quote

Thanks for the reply eccerr0r (username almost checks out ! :mrgreen:)

It never did that kind of errors before, the thing is sometimes it boots with the errors, sometimes not even with no hardware change in between...

Indeed some other post suggested various overheating issues or bad PSU, I cleared the dust and put another fan to work, recheck all connections, the errors are still there but seems my system can run (at least for what I tried so far : light browsing)

I'll look into underclocking and yes, maybe it's time to keep an eye on thrift store deals.
_________________
Gentoo stable (with bits of ~amd64) / Games ! (Linux & vfio-pci ) // Xfce

Feel free to PM me if you would like a simple ebuild and I'll see what I can do :]
Back to top
View user's profile Send private message
Myu
Tux's lil' helper
Tux's lil' helper


Joined: 22 Oct 2014
Posts: 130
Location: Belgium

PostPosted: Thu Nov 02, 2017 5:52 pm    Post subject: Reply with quote

Damn it's pretty crazy, htop & cat /proc/cpuinfo were showing 3 cores then 2 then one... instead of 4 (i5-3470)

Edit : Ok, mcelog is disabling CPU cores due to L2 cache errors... makes sense.
_________________
Gentoo stable (with bits of ~amd64) / Games ! (Linux & vfio-pci ) // Xfce

Feel free to PM me if you would like a simple ebuild and I'll see what I can do :]
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 6339
Location: almost Mile High in the USA

PostPosted: Thu Nov 02, 2017 10:37 pm    Post subject: Reply with quote

I'd look into M/B or PSU issues and not the CPU if more than one core is failing... though it's still possible for a CPU failure, it's not likely.
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Ant P.
Advocate
Advocate


Joined: 18 Apr 2009
Posts: 4507

PostPosted: Thu Nov 02, 2017 11:07 pm    Post subject: Reply with quote

Ouch, that's bad. Neat that it manages to keep running like that though...
_________________
*.ebuild // /etc/service/*
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 5966

PostPosted: Fri Nov 03, 2017 1:52 am    Post subject: Reply with quote

eccerr0r wrote:
I'd look into M/B or PSU issues and not the CPU if more than one core is failing... though it's still possible for a CPU failure, it's not likely.

Generally heat can lead to malfunction, which mean randomness, when you have always the same core, and always the same error type, well, you better but any bet on a lottery ticket...
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 6339
Location: almost Mile High in the USA

PostPosted: Fri Nov 03, 2017 2:54 am    Post subject: Reply with quote

Well unless the clock distribution or all cores share the same l2, a bug that shows up in one core should remain in that core.
However for that second case or any other shared resource, the cores should know an error is coming from the l2/shared resource and would know it's not the core's fault... that is, if the detection logic is smart, it should be disabling l2 chunks instead.

Of course you likely would not get the same error each time but it should happen on the same core - if it were a normal chip problem. Heating the chip is actually a "normal" problem and the error should stay on the same core. All bets are off it was a pre-production chip that had other issues that are not "normal" problems...
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Myu
Tux's lil' helper
Tux's lil' helper


Joined: 22 Oct 2014
Posts: 130
Location: Belgium

PostPosted: Fri Nov 03, 2017 8:26 am    Post subject: Reply with quote

Quote:
Ouch, that's bad. Neat that it manages to keep running like that though...


Yes, pretty neat to see the cores hot plugging functionality working flawlessly =)

Quote:
I'd look into M/B or PSU issues and not the CPU if more than one core is failing... though it's still possible for a CPU failure, it's not likely.


I also think there's a good chance it's the PSU, right now I lack another compatible one to test.

Quote:
Generally heat can lead to malfunction, which mean randomness, when you have always the same core, and always the same error type, well, you better but any bet on a lottery ticket...


Indeed, the thing is right now my temps are just fine

After the cleaning, I've been able to play 2 hours on The Witcher 3 on a Windows VM (via KVM/VFIO) so I guess it's stressful enough to the hardware, no problems at all but the errors are still there.
_________________
Gentoo stable (with bits of ~amd64) / Games ! (Linux & vfio-pci ) // Xfce

Feel free to PM me if you would like a simple ebuild and I'll see what I can do :]
Back to top
View user's profile Send private message
Myu
Tux's lil' helper
Tux's lil' helper


Joined: 22 Oct 2014
Posts: 130
Location: Belgium

PostPosted: Thu Nov 23, 2017 9:31 pm    Post subject: Reply with quote

Ok so since then I got one dead motherboard... just brutally stopped working never to be bootable (POST) again

I purchased a second MB (different chipset but compatible) to try to salvage the hardware (it's an i5 and 16GB of RAM after all) but the MCE errors are still present after setting up the new MB and my old Gentoo.

Time for another CPU I guess ?
_________________
Gentoo stable (with bits of ~amd64) / Games ! (Linux & vfio-pci ) // Xfce

Feel free to PM me if you would like a simple ebuild and I'll see what I can do :]
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 39261
Location: 56N 3W

PostPosted: Thu Nov 23, 2017 10:11 pm    Post subject: Reply with quote

Myu,

If you have moved all the bits to a mew motherboard, it does not leave much.
That assumes new, not just different.

The CPU cache has single bit error detection and correction, hence it can work correctly with one bit errors.
Two bit errors will always be detected but cannot be corrected, so the core will be shut down.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 5966

PostPosted: Fri Nov 24, 2017 12:22 am    Post subject: Reply with quote

Myu wrote:
Ok so since then I got one dead motherboard...

Might also be worth seeing some exorcist :)
Back to top
View user's profile Send private message
Myu
Tux's lil' helper
Tux's lil' helper


Joined: 22 Oct 2014
Posts: 130
Location: Belgium

PostPosted: Fri Nov 24, 2017 6:38 am    Post subject: Reply with quote

@NeddySeagoon

Yes indeed, moved all the bits into the new MB but that's not a brand new one since it's hard to find those new at a decent price these days (Intel Core 3rd Gen)

Most of the time the system works fine, correcting errors along the way according to mcelog (as you said, under a certain limit, all goes well), but I'm not keen on keeping it like that if it has a tendency to fry my hardware eventually :o

I've another test in mind, swap the CPU to another one I have lying around and which is socket-compatible, if I stop getting MCE -> The i5 is probably dead, If not, another piece is faulty.

@krinn

:mrgreen: I know right, this is getting higher and higher on my troubleshooting list
_________________
Gentoo stable (with bits of ~amd64) / Games ! (Linux & vfio-pci ) // Xfce

Feel free to PM me if you would like a simple ebuild and I'll see what I can do :]
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 39261
Location: 56N 3W

PostPosted: Fri Nov 24, 2017 10:12 am    Post subject: Reply with quote

Myu,

Look at the Vcore regulator on the motherboard next to the CPU. It takes in the 12v from the PSU, via the dedicated connector with the black and yellow wires, on converts it to the voltages used by the CPU and RAM. Look for bulging or leaking capacitors.
When they begin to fail the CPU operating voltages are no longer properly regulated.

A motherboard swap and a CPU swap are both good tests.

You would be unlucky to get two failing Vcore regulators
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Myu
Tux's lil' helper
Tux's lil' helper


Joined: 22 Oct 2014
Posts: 130
Location: Belgium

PostPosted: Fri Nov 24, 2017 5:43 pm    Post subject: Reply with quote

Hello Neddy,

From what I can tell, the VRM's on the MB looks sane, nothing special about them which is good

I swapped the CPU and so far, not a single MCE error in sight, I'll keep this monitored but it seems a clear indication it's the CPU after all.

Thank for your help, very appreciated as always :)
_________________
Gentoo stable (with bits of ~amd64) / Games ! (Linux & vfio-pci ) // Xfce

Feel free to PM me if you would like a simple ebuild and I'll see what I can do :]
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum