Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Hardware Error?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Zepp
Veteran
Veteran


Joined: 15 Mar 2004
Posts: 1246
Location: Ontario, Canada

PostPosted: Sun Sep 14, 2008 3:31 am    Post subject: Hardware Error? Reply with quote

My computer crashed several times today, and it happened in windows and linux. Anyway I was suspecting a hardware error possibly. Well it crashed again in Linux and went I tried to shut it down and bring it back up I got this


Code:

...
HARDWARE ERROR
CPU 0: Machine Check Exception         4 Bank 4: b200000000070f0f
TSC 1b51cc9213
This is not a software problem!
Run through mcelog --asci to decode and contact your hardware vendor
Kernel Panic - not syncing: Machine check


Umm what is broken? Did my computer just go up in a ball of flames? :|
Back to top
View user's profile Send private message
Akkara
Bodhisattva
Bodhisattva


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Sun Sep 14, 2008 6:27 am    Post subject: Reply with quote

It is possible the machine just died.

But usually there's a cause.

Was there anything different going on than usual? Warmer than usual? A lot more humid? Power spike? Recently changed any hardware? Bumped into the case? Dropped something into the case? Ventilation still OK? Dust accumulation?

Make sure all the connectors are fully inserted and try wiggling them. Try re-seating the RAM, and if it still machine-checks, reseating the processor.
Back to top
View user's profile Send private message
Zepp
Veteran
Veteran


Joined: 15 Mar 2004
Posts: 1246
Location: Ontario, Canada

PostPosted: Sun Sep 14, 2008 2:02 pm    Post subject: Reply with quote

Akkara wrote:
It is possible the machine just died.

But usually there's a cause.

Was there anything different going on than usual? Warmer than usual? A lot more humid? Power spike? Recently changed any hardware? Bumped into the case? Dropped something into the case? Ventilation still OK? Dust accumulation?

Make sure all the connectors are fully inserted and try wiggling them. Try re-seating the RAM, and if it still machine-checks, reseating the processor.


It's been a bit warmer but nothing significant, it's connect to a UPS so I hope it wasn't a surge. Haven't changed anything, no big bumps or anything really.

i ran memtest86 last night, it passed 6 passes when i got up and is still going, no fails.
Back to top
View user's profile Send private message
Zepp
Veteran
Veteran


Joined: 15 Mar 2004
Posts: 1246
Location: Ontario, Canada

PostPosted: Sun Sep 14, 2008 5:15 pm    Post subject: Reply with quote

Hmm so I am trying just the second stick of ram now and it hasn't MCE'd yet, it's been about 30 mins. What are the odds it passed memtest86 7 full passes but still had a bad stick of memory?
Back to top
View user's profile Send private message
Akkara
Bodhisattva
Bodhisattva


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Mon Sep 15, 2008 2:09 am    Post subject: Reply with quote

Quote:
What are the odds it passed memtest86 7 full passes but still had a bad stick of memory?


I've seen that happen. Try the user-level memory tester, memtester (it's in portage). Run it as root, and have it test all but a few 100MB (or whatever a freshly-booted system + X or whatever you run ends up using).

Edit/addendum: also it might not be a bad stick. Are you overclocking anything? Try turning it down some.

Or mobo capacitors might be starting to age and the Vcore or Vmemory regulator could be getting iffy.
Back to top
View user's profile Send private message
Zepp
Veteran
Veteran


Joined: 15 Mar 2004
Posts: 1246
Location: Ontario, Canada

PostPosted: Mon Sep 15, 2008 4:29 am    Post subject: Reply with quote

Akkara wrote:
Quote:
What are the odds it passed memtest86 7 full passes but still had a bad stick of memory?


I've seen that happen. Try the user-level memory tester, memtester (it's in portage). Run it as root, and have it test all but a few 100MB (or whatever a freshly-booted system + X or whatever you run ends up using).

Edit/addendum: also it might not be a bad stick. Are you overclocking anything? Try turning it down some.

Or mobo capacitors might be starting to age and the Vcore or Vmemory regulator could be getting iffy.


I don't overclock anything.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9679
Location: almost Mile High in the USA

PostPosted: Mon Sep 15, 2008 8:23 pm    Post subject: Reply with quote

Just for reference, a machine check exception is when the processor found itself in an impossible-to-recover state, a lot of the times it's due to some of the cache tag or parity protected processor state tables on the cpu. Usually this happens due to bad power/cooling, cosmic ray strike, cpu failure (overclocking falls into this bucket), or *really* poorly written software. Definitely should try to rule out what you can, usually the machine is stuck in some state that you can't do too much post-mortem without special tools when a MCE occurs.

I'd definitely start out by checking power supply, and motherboard capacitors... how old is the m/b?

Usually non-ECC/parity protected memory would not produce a MCE as it would have no way to determine whether a fatal bit flip occurred. However There may be a chipset flag that may tell the cpu to take a machine check depending on what behavior the CPU was programmed to do - but this is kind of iffy, as on a commodity machine, not much is protected from random bit flips that could propagate into the cpu as a fatal error.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Zepp
Veteran
Veteran


Joined: 15 Mar 2004
Posts: 1246
Location: Ontario, Canada

PostPosted: Mon Sep 15, 2008 8:40 pm    Post subject: Reply with quote

eccerr0r wrote:
Just for reference, a machine check exception is when the processor found itself in an impossible-to-recover state, a lot of the times it's due to some of the cache tag or parity protected processor state tables on the cpu. Usually this happens due to bad power/cooling, cosmic ray strike, cpu failure (overclocking falls into this bucket), or *really* poorly written software. Definitely should try to rule out what you can, usually the machine is stuck in some state that you can't do too much post-mortem without special tools when a MCE occurs.

I'd definitely start out by checking power supply, and motherboard capacitors... how old is the m/b?

Usually non-ECC/parity protected memory would not produce a MCE as it would have no way to determine whether a fatal bit flip occurred. However There may be a chipset flag that may tell the cpu to take a machine check depending on what behavior the CPU was programmed to do - but this is kind of iffy, as on a commodity machine, not much is protected from random bit flips that could propagate into the cpu as a fatal error.


The entire computer was purchased in April 2006. I looked at the capacitors on the motherboard, I didn't notice any that looked damaged but beyond that I am not sure how to test if it is the motherboard or psu?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum