Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED] Machine Check Exception on new Opteron server
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Thu Sep 21, 2006 2:39 pm    Post subject: [SOLVED] Machine Check Exception on new Opteron server Reply with quote

I set up a new amd64 gentoo server yesterday on an opteron but within a few hours of it being up I got a "Machine Check Exception" and the thing froze up. I had to go to the the local console to see this and then had to hard reboot the machine. It wasn't really doing much at the time other than compiling a couple of things. The server is a dual-cpu dual-core machine (4 cores that is) with 8GB ram and 12 SCSI disks + 2 satas for OS.

The error from the console is below:

Code:
HARDWARE ERROR
CPU 2: Machine Check Exception:                                    4 Bank 4:  f615200133000813
TSC 5ac60e50b6a ADDR 1d251ec00
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check


I have been googling around since yesterday but haven't found anything conclusive

I've tried running mcelog and got the following:
Code:
# mcelog --k8 /dev/mcelog
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC a4d0cd72d5a8
ADDR 23c400000
  Northbridge GART error
       bit61 = error uncorrected
  TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC a56b2eba7649
ADDR 23c400000
  Northbridge GART error
       bit61 = error uncorrected
  TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC a60591585bda
ADDR 23c400000
  Northbridge GART error
       bit61 = error uncorrected
  TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0
MCE 3
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC a69ff2a635e8
ADDR 23c400000
  Northbridge GART error
       bit61 = error uncorrected
  TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0
MCE 4
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC a73a53f42ca9
ADDR 23c400000
  Northbridge GART error
       bit61 = error uncorrected
  TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0
MCE 5
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC a7d4b6934fdf
ADDR 23c400000
  Northbridge GART error
       bit61 = error uncorrected
  TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0
MCE 6
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 4 northbridge TSC a86f17e0a6a8
ADDR 191b0b000
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = c12f
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d417c000c1080a13 MCGSTATUS 0
MCE 7
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC a86f17e0c311
ADDR 23c400000
  Northbridge GART error
       bit61 = error uncorrected
  TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0



Does anybody know anything about this?
_________________
The Human Equation:

value(geeks) > value(mundanes)


Last edited by humbletech99 on Sat Dec 09, 2006 8:11 pm; edited 1 time in total
Back to top
View user's profile Send private message
Keruskerfuerst
Advocate
Advocate


Joined: 01 Feb 2006
Posts: 2246
Location: near Augsburg, Germany

PostPosted: Fri Sep 22, 2006 5:50 am    Post subject: Reply with quote

What exact type of AMD processor and mainboard do you have?
Back to top
View user's profile Send private message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Fri Sep 22, 2006 8:46 am    Post subject: Reply with quote

The full specs from the purchase order say

Processor : Dual AMD Opteron 275 2.2 GHz DUAL CORE, s940 1MB cache 64 bit (2 way)95watt
Motherboard : Tyan K8SRE,S2892G3NR,nForce4,1xPCI-e x16,1xPCI-e x4, 4 3xPCI-X,S-ATAII Raid, 2xGB, 8xDimm


Code:
 # cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 275
stepping        : 2
cpu MHz         : 2200.000
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4422.95
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 275
stepping        : 2
cpu MHz         : 2200.000
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4420.51
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 275
stepping        : 2
cpu MHz         : 2200.000
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4420.53
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 275
stepping        : 2
cpu MHz         : 2200.000
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4420.41
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Code:
# lspci
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
01:08.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 10)
08:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
08:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
08:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
08:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
09:02.0 RAID bus controller: 3ware Inc 7xxx/8xxx-series PATA/SATA-RAID (rev 01)
09:03.0 PCI bridge: IBM PCI-X to PCI-X Bridge (rev 03)
0a:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID (rev 02)
0b:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)
0b:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)

_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
Keruskerfuerst
Advocate
Advocate


Joined: 01 Feb 2006
Posts: 2246
Location: near Augsburg, Germany

PostPosted: Fri Sep 22, 2006 1:29 pm    Post subject: Reply with quote

Northbridge Chipkill ECC error
Chipkill ECC syndrome = c12f
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d417c000c1080a13 MCGSTATUS 0
MCE 7

memory error

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC a4d0cd72d5a8
ADDR 23c400000
Northbridge GART error
bit61 = error uncorrected
TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0

mainboard error

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 4 northbridge TSC a86f17e0a6a8
ADDR 191b0b000
Northbridge Chipkill ECC error
Chipkill ECC syndrome = c12f
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d417c000c1080a13 MCGSTATUS 0

memory error

I think, you should begin with replacing the mainbaord, then check the memory and if nesscary, replace the modules.
And at last there are also the CPUs.
Maybe, the power supply is defective.<--- you should check this first


Last edited by Keruskerfuerst on Fri Sep 22, 2006 2:16 pm; edited 2 times in total
Back to top
View user's profile Send private message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Fri Sep 22, 2006 1:36 pm    Post subject: Reply with quote

oh come on! The memory and the mobo can't both be defective, it's a new machine. I'd bet the mobo is slightly defective instead, but I've read messages on a kernel mailing list about a guy who changed his mobo with another one of the exact same model and the same thing happened, it only stopped when he changed to a different brand of mobo, which would seem to indicate a subtle defect in design.

I haven't had this problem a second time despite me crunching all the disks simultaneous and compiling a fair amount of software as well...

perhaps it was a one-off fluke and I'll be ok....

or perhaps that's wishful thinking. This is supposed to be an important server when it goes into production (which should be any day now)


EDIT: actually, the GART to which the error refers was using a virtual IOMMU since I hadn't switched the IOMMU function on in the BIOS (I had something in the log complaining about this and so I switched on the IOMMU today see below)
Code:
kern-warning   2006-09-21 14:34:34   Checking aperture...
kern-warning   2006-09-21 14:34:34   CPU 0: aperture @ 0 size 32 MB
kern-warning   2006-09-21 14:34:34   No AGP bridge found
kern-warning   2006-09-21 14:34:34   Your BIOS doesn't leave a aperture memory hole
kern-warning   2006-09-21 14:34:34   Please enable the IOMMU option in the BIOS setup
kern-warning   2006-09-21 14:34:34   This costs you 64 MB of RAM
kern-warning   2006-09-21 14:34:34   Mapping aperture over 65536 KB of RAM @ 4000000


Therefore the error must have occurred entirely in RAM. So hopefully if there is a real hardware problem then it will be in the RAM which is more easily replaced.

I am going to leave it running memtest86 over the weekend to try to see if the memory is ok.
_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
Keruskerfuerst
Advocate
Advocate


Joined: 01 Feb 2006
Posts: 2246
Location: near Augsburg, Germany

PostPosted: Fri Sep 22, 2006 3:19 pm    Post subject: Reply with quote

I had a mainboard in my computer, which was defective from the beginning.
Back to top
View user's profile Send private message
feld
Guru
Guru


Joined: 29 Aug 2004
Posts: 593
Location: WI, USA

PostPosted: Fri Sep 22, 2006 7:24 pm    Post subject: Reply with quote

humbletech99 wrote:
oh come on! The memory and the mobo can't both be defective, it's a new machine.


I had a bad stick of ram in my first batch when I built my Opteron machine. I've had situations where both were bad, too. It happens more than you might think.
_________________
< bmg505> I think the first line in reiserfsck is

if (random(65535)< 65500) { hose(partition); for (i=0;i<100000000;i++) print_crap(); }
Back to top
View user's profile Send private message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Fri Sep 22, 2006 9:05 pm    Post subject: Reply with quote

yeah I know, I've had loads of hardware problems both at home and at work over time. anyway, back to this stupid machine check exception. I will run memtest86 this weekend and feed the results back to the hardware supplier. I think I will have to have them come round to replace the ram at least and possibly one processor.
_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
OldTango
Guru
Guru


Joined: 21 Feb 2004
Posts: 596

PostPosted: Sat Dec 09, 2006 5:35 pm    Post subject: Reply with quote

humbletech99 wrote:
yeah I know, I've had loads of hardware problems both at home and at work over time. anyway, back to this stupid machine check exception. I will run memtest86 this weekend and feed the results back to the hardware supplier. I think I will have to have them come round to replace the ram at least and possibly one processor.
I have an older Tyan Tiger S2875 mobo with dual-opteron-246's on it. I get the exact same memory errors as you are receiving. I have ran memtest on this system for 2 days running different tests. ECC off and ECC on. With ECC off absoultely zero errors were reported. With ECC on 2 errors were reported on the 3rd time through. Both errors were on bank0 and both were corrected. They were not reported again on 6 more consecutive passes.

I assume you have solved the gart errors..........................................:?:

As for the memeroy errors it is very possible you have some heat issues, which is what is happening in my case. When I removed the side cover of my pc the errors dissappeared. I took a few steps to improve cooling and that has helped a great deal with these errors. I only receive them now when the system has been on for a few hours and being loaded heavaly, however the system never locks or crashes as a result of these errors.

My mobo is poorly designed and the cpu's sit to close together and to close to bank0 ram slot, making it difficult to get cpu coolers that will fit and do their job. This is where most of the heat is generated.

A poor or bad power supply can also cause these errors.

This is a guess on my part, but the first 2 items on many check lists for these errors is heat and power.
Back to top
View user's profile Send private message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Sat Dec 09, 2006 8:11 pm    Post subject: Reply with quote

actually it was due to the ram which is what i first suspected. Memtest didn't show any errors. I could only force the issue under heavy load. After changing the memory the issue disappeared and hasn't recurred for the last 2 months so I think it's safe to say that was the problem.

We changed the memory for a different brand as well.
_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum