Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Deciphering mcelog after hard lock up
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
ksool
Guru
Guru


Joined: 27 May 2006
Posts: 337
Location: Cambridge, MA

PostPosted: Thu Nov 29, 2007 3:10 am    Post subject: Deciphering mcelog after hard lock up Reply with quote

I've got a dual opteron with 8GB ram that occasionally locks under heavy load (no ethernet, no terminal).
I thought the problem was bad ram so I ran memtest86 for three days but it failed to find anything.

Here's the mcelog:

Code:

cat /var/log/mcelog
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 0 data cache TSC 56d93012b7c9
ADDR 1f61bfe40
  Data cache ECC error (syndrome 51)
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      data read mem transaction
      memory access, level generic'
STATUS d428c00000000833 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 2 bus unit TSC 56d93012c175
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      prefetch mem transaction
      memory access, level generic'
STATUS d000400000000863 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 56d93012c535
ADDR 1f61bfe78
  Northbridge ECC error
  ECC syndrome = 51
       bit32 = err cpu0
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 9428c00100000813 MCGSTATUS 0
MCE 3
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 741f0900e073
RIP 33:5cabec ADDR 1f4a77bf0
  Northbridge ECC error
  ECC syndrome = 5
       bit45 = uncorrected ecc error
       bit61 = error uncorrected
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS f402a00000000a13 MCGSTATUS 7
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 73ac2df94de49
ADDR 1d2b97c78
  Northbridge ECC error
  ECC syndrome = 5b
       bit32 = err cpu0
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 942dc00100000813 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 0 data cache TSC 76e2f811f9c8e
ADDR 1d09b7f40
  Data cache ECC error (syndrome a4)
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      data read mem transaction
      memory access, level generic'
STATUS 9452400000000833 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 7742a4bf1d2c3
ADDR 1f307ff70
  Northbridge ECC error
  ECC syndrome = 57
       bit32 = err cpu0
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 942bc00100000813 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC d4a65cbc90c
RIP 33:5ce0b4 ADDR 1f321f778
  Northbridge ECC error
  ECC syndrome = 6
       bit45 = uncorrected ecc error
       bit61 = error uncorrected
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS f403200000000a13 MCGSTATUS 7
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 8b4f97d40194f
RIP 33:5caceb ADDR 1dd5b74b8
  Northbridge ECC error
  ECC syndrome = c
       bit45 = uncorrected ecc error
       bit61 = error uncorrected
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS b406200000000a13 MCGSTATUS 7
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 2 bus unit TSC 229d6e2c9cb09
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      prefetch mem transaction
      memory access, level generic'
STATUS d000400000000863 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 229d6e2c9d58c
ADDR 1f42777b8
  Northbridge ECC error
  ECC syndrome = 5d
       bit32 = err cpu0
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 942ec00100000813 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC c178d1d8781
ADDR 1dd2bff78
  Northbridge ECC error
  ECC syndrome = 5b
       bit46 = corrected ecc error
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 942dc00000000a13 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 60cafdf541181
ADDR 15fab7770
  Northbridge ECC error
  ECC syndrome = 54
       bit46 = corrected ecc error
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 942a400000000a13 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 0 data cache TSC 62a48ee35c73b
ADDR 1938d7cc0
  Data cache ECC error (syndrome bc)
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      data read mem transaction
      memory access, level generic'
STATUS d45e400000000833 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 2 bus unit TSC 7dd527768b691
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      prefetch mem transaction
      memory access, level generic'
STATUS 9000400000000863 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 723f90fe3872
RIP 23:818e05b ADDR 1f91577b8
  Northbridge ECC error
  ECC syndrome = a
       bit45 = uncorrected ecc error
       bit61 = error uncorrected
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS b405200000000a13 MCGSTATUS 7


Any ideas what it could be if its not bad ram? TIA
Back to top
View user's profile Send private message
BradN
Advocate
Advocate


Joined: 19 Apr 2002
Posts: 2391
Location: Wisconsin (USA)

PostPosted: Thu Nov 29, 2007 3:33 am    Post subject: Reply with quote

I'm seeing a lot of "Northbridge ECC error" - these would seem to suggest a bad motherboard or processor, or possibly a bad power supply.

Also, I'm not sure if "CPU 1" means the second CPU is indicating the error, or what, but maybe it'd be worth disabling the second processor and seeing if the problem goes away. If that fixes it, you could then swap the physical processors and mostly rule out a bad processor.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum