Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Linux identification of hardware errors
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Off the Wall
View previous topic :: View next topic  
Author Message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Mon May 06, 2013 1:40 am    Post subject: Linux identification of hardware errors Reply with quote

Is this a capability within linux / os tools, or is it unique to enterprise hardware capable of providing the information?

I'm used to Solaris & HP-UX having this information available, but when it came to an RHEL server, I was at a loss.

What do you use to track down hardware problems on linux systems?
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
notageek
Tux's lil' helper
Tux's lil' helper


Joined: 05 Jun 2008
Posts: 120
Location: Bangalore, India

PostPosted: Mon May 06, 2013 2:13 am    Post subject: Reply with quote

Usually only dmesg logs.
_________________
The problem is not the problem. The problem is your attitude about the problem. Do you understand? --Capt Jack Sparrow.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Mon May 06, 2013 2:35 am    Post subject: Reply with quote

Does it specifically report memory errors for example, or is it more vague? What about CPU errors?
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
notageek
Tux's lil' helper
Tux's lil' helper


Joined: 05 Jun 2008
Posts: 120
Location: Bangalore, India

PostPosted: Mon May 06, 2013 2:52 am    Post subject: Reply with quote

It will report memory, cpu, hdd or any other error. It depends on whether it will be vague or specific or how much you know about the Linux kernel or at-least how the code is organized and will come from the driver that is running the hardware and you'll have to figure out if it is hardware or the driver.

For disks, it will report a bunch of information that will help you determine if it's bad.

On CPU for instance, I have posted quite a while back, where all my cores will not show up. Used dmesg to troubleshoot, the issue is still unresolved though.
_________________
The problem is not the problem. The problem is your attitude about the problem. Do you understand? --Capt Jack Sparrow.
Back to top
View user's profile Send private message
energyman76b
Advocate
Advocate


Joined: 26 Mar 2003
Posts: 2026
Location: Germany

PostPosted: Mon May 06, 2013 5:20 pm    Post subject: Reply with quote

CPU: mce - available in consumer hardware
memory: ECC - available in consumer hardware
pcie has error reporting facilities - but I don't know if they are available in consumer hardware.
_________________
AidanJT wrote:

Libertardian denial of reality is wholly unimpressive and unconvincing, and simply serves to demonstrate what a bunch of delusional fools they all are.

Satan's got perfectly toned abs and rocks a c-cup.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Mon May 06, 2013 11:34 pm    Post subject: Reply with quote

notageek wrote:
It depends on whether it will be vague or specific or how much you know about the Linux kernel or at-least how the code is organized and will come from the driver that is running the hardware and you'll have to figure out if it is hardware or the driver.
OK, that doesn't really help. Are you aware of any example search terms? "There might or might not be something reported" isn't exactly helpful, so I'm trying to identify what exactly it is I need to know so I may identify the errors.

notageek wrote:
On CPU for instance, I have posted quite a while back, where all my cores will not show up. Used dmesg to troubleshoot, the issue is still unresolved though.
I'll see if I can find the thread.

Thanks.


energyman76b wrote:
CPU: mce - available in consumer hardware
memory: ECC - available in consumer hardware
pcie has error reporting facilities - but I don't know if they are available in consumer hardware.
Right, but that doesn't explain how to track down or observe reported errors from linux. Are the errors reported through the kernel to system logs, or some other inconsistent solution?
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Mon May 06, 2013 11:59 pm    Post subject: Reply with quote

notageek wrote:
On CPU for instance, I have posted quite a while back, where all my cores will not show up. Used dmesg to troubleshoot, the issue is still unresolved though.
This one? Not all CPU cores detected intermittently.

Interesting. That it doesn't occur with Fedora 11 makes it appear to be a kernel issue.

I'd have to see if I can track down an example, but I was thinking more along the lines of bit errors which could be memory or possibly a CPU. Otherwise both memory & CPU are functional, at least until the error is encountered another time.
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
PaulBredbury
Watchman
Watchman


Joined: 14 Jul 2005
Posts: 7310

PostPosted: Tue May 07, 2013 12:02 am    Post subject: Reply with quote

pjp wrote:
to system logs

mcelog outputs to the system log (although this is customizable), e.g.:
Code:
mcelog: failed to prefill DIMM database from DMI data
Back to top
View user's profile Send private message
notageek
Tux's lil' helper
Tux's lil' helper


Joined: 05 Jun 2008
Posts: 120
Location: Bangalore, India

PostPosted: Tue May 07, 2013 1:26 am    Post subject: Reply with quote

pjp wrote:
notageek wrote:
On CPU for instance, I have posted quite a while back, where all my cores will not show up. Used dmesg to troubleshoot, the issue is still unresolved though.
This one? Not all CPU cores detected intermittently.

Interesting. That it doesn't occur with Fedora 11 makes it appear to be a kernel issue.

I'd have to see if I can track down an example, but I was thinking more along the lines of bit errors which could be memory or possibly a CPU. Otherwise both memory & CPU are functional, at least until the error is encountered another time.
That was the example.

Yes, it is a kernel issue (probably) or a hardware issue. I was castigated in these forums, when I suggested Fedora has an almost proprietary kernel.

The other point is, it depends on the message on how useful it is. dmesg gets your work done most of the time and if you see an obscure error, it is either a hardware issue or a driver/kernel issue.
_________________
The problem is not the problem. The problem is your attitude about the problem. Do you understand? --Capt Jack Sparrow.
Back to top
View user's profile Send private message
Bones McCracker
Veteran
Veteran


Joined: 14 Mar 2006
Posts: 1564
Location: U.S.A.

PostPosted: Tue May 07, 2013 2:18 am    Post subject: Reply with quote

energyman76b wrote:
CPU: mce - available in consumer hardware
memory: ECC - available in consumer hardware
pcie has error reporting facilities - but I don't know if they are available in consumer hardware.

There are any number of interacting means (buses and protocols) by which hardware errors may be internally or externally communicated by a PC (ECC, MCE, ACPI, SMBUS PMBUS, I2C, DMI, SNMP, WMI, IPMI, etc.). The bottom line from a single machine user perspective is that it all gets dumped into the logs. From a multi-machine admin perspective, the enterprise tools can collect and handle it (gathering from DMI, SNMP, WMI, IPMI, network logging, and handing by any of the various systems management suites).
_________________
juniper wrote:
I use ubuntu, which is why I am posting here.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Tue May 07, 2013 2:25 am    Post subject: Reply with quote

@PaulBredbury: Thanks. I'll put mcelog on my list.


notageek wrote:
I was castigated in these forums, when I suggested Fedora has an almost proprietary kernel.
IMO RH is very proprietary-like. CentOS is not RHEL, so RHEL is not truly available. Things which work on RHEL do NOT always work on CentOS, further proving the point (as far as I'm concerned). I wouldn't be shocked if they tweaked the kernel with "inside knowledge" which was still "made available" even if obscure. I'm not a fan of RH. IMO they meet the "letter of the law" but not the intent.


notageek wrote:
The other point is, it depends on the message on how useful it is. dmesg gets your work done most of the time and if you see an obscure error, it is either a hardware issue or a driver/kernel issue.
That makes a little more sense. I'll see if I can find some examples.
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Tue May 07, 2013 2:29 am    Post subject: Reply with quote

BoneKracker wrote:
The bottom line from a single machine user perspective is that it all gets dumped into the logs.
If true, then it should be a matter of just identifying how the information is logged, which seems difficult to track down. Admittedly I haven't spent a long time searching, and I don't have an actual error I'm looking for, but my initial searches didn't reveal much useful. I'll have to spend some time guessing at keywords.
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
Bones McCracker
Veteran
Veteran


Joined: 14 Mar 2006
Posts: 1564
Location: U.S.A.

PostPosted: Tue May 07, 2013 3:07 am    Post subject: Reply with quote

There are error injectors you can load as a module and then use to simulate some types of hardware errors. Some log analysis programs come with examples, and people share their rules. Most commercial systems management products come preconfigured to deal with common problems.

I'm not really clear on what you're trying to accomplish.
_________________
juniper wrote:
I use ubuntu, which is why I am posting here.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Tue May 07, 2013 3:21 am    Post subject: Reply with quote

I can think of two examples, both memory related, mainly because those are most common. One is when a random reboot occurs and memory is the culprit. The server otherwise runs fine, until a sudden reboot.

On Solaris, I can track the number of errors on a particular memory module so I know exactly which module to replace. Recurring instances could be an indication of a bad CPU. The module is also easily associated with a specific CPU.

I'm trying to identify similar means of identifying hardware problems under linux.
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
Bones McCracker
Veteran
Veteran


Joined: 14 Mar 2006
Posts: 1564
Location: U.S.A.

PostPosted: Tue May 07, 2013 3:36 am    Post subject: Reply with quote

Okay, I suggest you read kernel's documentation pertaining to Error Detection And Correction.
Code:
/usr/src/linux/Documentation/edac.txt

Then, go here:
http://buttersideup.com/edacwiki/Main_Page

Then Google-hunt for more recent information on the tools and terms mentioned.
_________________
juniper wrote:
I use ubuntu, which is why I am posting here.
Back to top
View user's profile Send private message
energyman76b
Advocate
Advocate


Joined: 26 Mar 2003
Posts: 2026
Location: Germany

PostPosted: Tue May 07, 2013 4:32 pm    Post subject: Reply with quote

pjp wrote:
I can think of two examples, both memory related, mainly because those are most common. One is when a random reboot occurs and memory is the culprit. The server otherwise runs fine, until a sudden reboot.

On Solaris, I can track the number of errors on a particular memory module so I know exactly which module to replace. Recurring instances could be an indication of a bad CPU. The module is also easily associated with a specific CPU.

I'm trying to identify similar means of identifying hardware problems under linux.


random reboot = triple fault. There is nothing to log because it is an automatism outside of the control of the kernel.
http://en.wikipedia.org/wiki/Triple_fault
_________________
AidanJT wrote:

Libertardian denial of reality is wholly unimpressive and unconvincing, and simply serves to demonstrate what a bunch of delusional fools they all are.

Satan's got perfectly toned abs and rocks a c-cup.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Tue May 07, 2013 11:38 pm    Post subject: Reply with quote

Thank you, both.

The triple fault sounds less like hardware...
Quote:
Possible causes of triple faults

Triple faults indicate a problem with the operating system kernel or device drivers. In modern operating systems, a triple fault is typically caused by a buffer overflow or underflow in a device driver which writes over the interrupt descriptor table. When the next interrupt happens, the processor cannot call either the needed interrupt handler or the double fault handler because the descriptors in the IDT are corrupted.[citation needed]

_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
energyman76b
Advocate
Advocate


Joined: 26 Mar 2003
Posts: 2026
Location: Germany

PostPosted: Wed May 08, 2013 3:55 pm    Post subject: Reply with quote

pjp wrote:
Thank you, both.

The triple fault sounds less like hardware...
Quote:
Possible causes of triple faults

Triple faults indicate a problem with the operating system kernel or device drivers. In modern operating systems, a triple fault is typically caused by a buffer overflow or underflow in a device driver which writes over the interrupt descriptor table. When the next interrupt happens, the processor cannot call either the needed interrupt handler or the double fault handler because the descriptors in the IDT are corrupted.[citation needed]


I don't know where you got that. But triple fault is typical for memory errors and power fluctuations.
_________________
AidanJT wrote:

Libertardian denial of reality is wholly unimpressive and unconvincing, and simply serves to demonstrate what a bunch of delusional fools they all are.

Satan's got perfectly toned abs and rocks a c-cup.
Back to top
View user's profile Send private message
eccerr0r
Advocate
Advocate


Joined: 01 Jul 2004
Posts: 3898
Location: USA

PostPosted: Wed May 08, 2013 7:37 pm    Post subject: Reply with quote

Recent hardware have started incorporating enterprise features, but yes, these error detection hardware features used to be in the realm of only high availability/enterprise machines. And Linux being ported over to these enterprise hardware now means code is trickling down on how to deal with these errors. A lot of the time the code has to be tailored for the hardware.

But even still, not all failure modes are detected. Plus a lot of the failures are machine specific, CPU specific even, and decoding any data that caused the problem sometimes isn't always available...

I do get some MCE logs on my AthlonXP that I have not been able to find documentation on how to decode the bits... Grr...
Code:
Dec 15 02:00:03 doujima mcelog: Unknown CPU type vendor 2 family 6 model 8
Dec 15 02:00:03 doujima HARDWARE ERROR. This is *NOT* a software problem!
Dec 15 02:00:03 doujima Please contact your hardware vendor
Dec 15 02:00:03 doujima MCE 0
Dec 15 02:00:03 doujima CPU 0 BANK 2
Dec 15 02:00:03 doujima ADDR 3fe26240
Dec 15 02:00:03 doujima TIME 1355562003 Sat Dec 15 02:00:03 2012
Dec 15 02:00:03 doujima STATUS 940040000000017a MCGSTATUS 0
Dec 15 02:00:03 doujima MCGCAP 104 APICID 0 SOCKETID 0
Dec 15 02:00:03 doujima CPUID Vendor AMD Family 6 Model 8

WtF?
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed to be advocating?
Back to top
View user's profile Send private message
energyman76b
Advocate
Advocate


Joined: 26 Mar 2003
Posts: 2026
Location: Germany

PostPosted: Wed May 08, 2013 8:03 pm    Post subject: Reply with quote

eccerr0r wrote:
Recent hardware have started incorporating enterprise features, but yes, these error detection hardware features used to be in the realm of only high availability/enterprise machines. And Linux being ported over to these enterprise hardware now means code is trickling down on how to deal with these errors. A lot of the time the code has to be tailored for the hardware.

But even still, not all failure modes are detected. Plus a lot of the failures are machine specific, CPU specific even, and decoding any data that caused the problem sometimes isn't always available...

I do get some MCE logs on my AthlonXP that I have not been able to find documentation on how to decode the bits... Grr...
Code:
Dec 15 02:00:03 doujima mcelog: Unknown CPU type vendor 2 family 6 model 8
Dec 15 02:00:03 doujima HARDWARE ERROR. This is *NOT* a software problem!
Dec 15 02:00:03 doujima Please contact your hardware vendor
Dec 15 02:00:03 doujima MCE 0
Dec 15 02:00:03 doujima CPU 0 BANK 2
Dec 15 02:00:03 doujima ADDR 3fe26240
Dec 15 02:00:03 doujima TIME 1355562003 Sat Dec 15 02:00:03 2012
Dec 15 02:00:03 doujima STATUS 940040000000017a MCGSTATUS 0
Dec 15 02:00:03 doujima MCGCAP 104 APICID 0 SOCKETID 0
Dec 15 02:00:03 doujima CPUID Vendor AMD Family 6 Model 8

WtF?


that one is easy:
STATUS 940040000000017a

this setup is really old. Contact vendor to buy some new one.

MCGCAP 104

I really hate you.

You are lucky. 105 would have meant 'I am going to kill your duck'.
_________________
AidanJT wrote:

Libertardian denial of reality is wholly unimpressive and unconvincing, and simply serves to demonstrate what a bunch of delusional fools they all are.

Satan's got perfectly toned abs and rocks a c-cup.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Wed May 08, 2013 11:24 pm    Post subject: Reply with quote

lol


energyman76b wrote:
I don't know where you got that. But triple fault is typical for memory errors and power fluctuations.
It came from your wikipedia link to Triple fault.
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
energyman76b
Advocate
Advocate


Joined: 26 Mar 2003
Posts: 2026
Location: Germany

PostPosted: Thu May 09, 2013 12:00 am    Post subject: Reply with quote

pjp wrote:
lol


energyman76b wrote:
I don't know where you got that. But triple fault is typical for memory errors and power fluctuations.
It came from your wikipedia link to Triple fault.


I just gave you the link, I never read it.

Seriously, what is more likely: that some driver running on million of boxes is behaving just for you?
or
some hardware fault?
_________________
AidanJT wrote:

Libertardian denial of reality is wholly unimpressive and unconvincing, and simply serves to demonstrate what a bunch of delusional fools they all are.

Satan's got perfectly toned abs and rocks a c-cup.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Thu May 09, 2013 3:05 am    Post subject: Reply with quote

Well, since it is new to me, I can't say one way or the other. Given what you have indicated, it would seem like a hardware error.

What it sounded like it was describing wasn't inherently a common problem everyone using a driver would encounter. So if it can be a driver, or it can be hardware, that helps. A crash with a driver not known to have problems would indicate hardware or a rare condition bug. Obviously hardware would be easier to test in that case.

I've seen an arrangement of hardware result in discovery of a driver bug not otherwise encountered, but given that hardware arrangement, the bug was observable under repeatable conditions.
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
energyman76b
Advocate
Advocate


Joined: 26 Mar 2003
Posts: 2026
Location: Germany

PostPosted: Thu May 09, 2013 2:51 pm    Post subject: Reply with quote

I had my choice of random reboots in the past. Everytime it boilt down to:
memory.
or
power.

With my personal boxes, family, friends, at work. So, if someone tells me about random reboots, first thing I do today:
get a different psu
then
start testing the ram
_________________
AidanJT wrote:

Libertardian denial of reality is wholly unimpressive and unconvincing, and simply serves to demonstrate what a bunch of delusional fools they all are.

Satan's got perfectly toned abs and rocks a c-cup.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 16106
Location: Colorado

PostPosted: Fri May 10, 2013 12:06 am    Post subject: Reply with quote

Makes sense. My original hope with the thread was to identify errors rather than random replacement of parts and lengthy memory testing. Not for personal use, but business use. But it seems like the hardware and/or software features aren't yet in place.
_________________
lolgov. 'cause where we're going, you don't have civil liberties.

In Loving Memory
1787 - 2008
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Off the Wall All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum