Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Two months of debugging - unstable computer
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
uraes
Tux's lil' helper
Tux's lil' helper


Joined: 28 Nov 2002
Posts: 135
Location: Estonia

PostPosted: Thu Sep 26, 2013 8:52 am    Post subject: Two months of debugging - unstable computer Reply with quote

Long story short: I have tried to stabilize my Gentoo installation for almost 2 months and no success. Until now I thought that this is problem in Gentoo kernel (tried: 3.8.13, 3.10.7, 3.10.10, 3.11.0, 3.11.1) as I was unable to reproduce unstability on any other installation I tried - Win7, ubuntu, kubuntu, mint, fedora, estobuntu. I tried to remove all hard disks, removed nvidia videocard and pci ethernet card. ran memtest for day.. With live OS'es nothing, but Gentoo was crashing (and before os crash sometimes programs also crashed, like gkrellm, firefox, konsole, ..) randomly as sometimes just GUI halted, sometimes keyboard lights were blinking and sometimes I got some randome kernel traces also on screen.

But now, after 1d long running Estobuntu (almost all time some movie was running, shared some ubuntu iso over bittorrent, etc - computer was doing smth all the time) I got some new bits - in dmesg were

Code:

[ 9026.127373] [Hardware Error]: Machine check events logged
[12869.614938] [Hardware Error]: Machine check events logged


user was kicked off from GUI and in mcelog I found this:

Code:

root@buntu:~# more /var/log/mcelog
mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors
mcelog: failed to prefill DIMM database from DMI data
Kernel does not support page offline interface
mcelog: mcelog read: No such device
mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 0
TIME 1380103761 Wed Sep 25 13:09:21 2013
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 2 BANK 0
TIME 1380107608 Wed Sep 25 14:13:28 2013
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60


Can somebody explain this? Got not good help from Google, as related posts seemed to be in relation to overcloking. My computer is not overcloked. Fans are running normally and no overheating.

Motherboard: Gigabyte GA-Z87X-UD3H s1150
Memory: 4x 8G DDR3 1600C11 Kingston
CPU: Intel Core i7-4770K 3.5G/8M
Back to top
View user's profile Send private message
Hypnos
Advocate
Advocate


Joined: 18 Jul 2002
Posts: 2889
Location: Omnipresent

PostPosted: Thu Sep 26, 2013 9:46 am    Post subject: Reply with quote

Post the output of emerge --info
_________________
Personal overlay | Simple backup scheme
Back to top
View user's profile Send private message
uraes
Tux's lil' helper
Tux's lil' helper


Joined: 28 Nov 2002
Posts: 135
Location: Estonia

PostPosted: Thu Sep 26, 2013 10:56 am    Post subject: Reply with quote

My "emerge --info" output is here : http://pastebin.ca/2458674

I don't think anymore, that it is purely Gentoo's problem, as this mcelog in my first post was produced under Estobuntu (Estonian version of Ubuntu) AND Gentoo livecd (20121221) was also unstable - three hangs in 24 hours. Just Gentoo is somehow more intense or active in some areas and crashes may happen in 30minutes.
Back to top
View user's profile Send private message
ulenrich
Veteran
Veteran


Joined: 10 Oct 2010
Posts: 1480

PostPosted: Thu Sep 26, 2013 11:35 am    Post subject: Reply with quote

CFLAGS=" -march=native -O2 -pipe "
CXXFLAGS=" -march=native -O2 -pipe "
Further you could try Gentoo~unstable release!
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21490

PostPosted: Fri Sep 27, 2013 1:50 am    Post subject: Reply with quote

ulenrich wrote:
CFLAGS=" -march=native -O2 -pipe "
CXXFLAGS=" -march=native -O2 -pipe "
Further you could try Gentoo~unstable release!
The OP's current CFLAGS and CXXFLAGS are reasonable. Adding -march=native might improve performance in some cases, but will not correct problems caused by failing hardware. Suggesting that he switch to newer packages is also not helpful. According to the mcelog output, there is a hardware fault. The particular error claims to have been corrected, but there may be related errors that are not correctable. The faulty component must be replaced.
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Fri Sep 27, 2013 3:34 am    Post subject: Reply with quote

It's happening on different cores, so either the CPU as a whole has some non-thermal issue like bad power, or you've got bad RAM.
Back to top
View user's profile Send private message
uraes
Tux's lil' helper
Tux's lil' helper


Joined: 28 Nov 2002
Posts: 135
Location: Estonia

PostPosted: Fri Sep 27, 2013 6:48 am    Post subject: Reply with quote

This -march flag should make no difference, as this is default. And problem is not anymore only Gentoos as I was able to produce problems on other distro too. Question is - what does this mcelog mean? As I quess, this is some weird hardware problem?

I add also three images of traces I have been able to capture as computer hung up (and as seen - they are random and in pretty "weird" places):

http://picpaste.com/IMG_5822_s-Kj5CxlzB.JPG
http://picpaste.com/IMG_5966_s-rH0bWC6U.JPG
http://picpaste.com/IMG_5976_r-W0ML5OzV.JPG
Back to top
View user's profile Send private message
uraes
Tux's lil' helper
Tux's lil' helper


Joined: 28 Nov 2002
Posts: 135
Location: Estonia

PostPosted: Fri Sep 27, 2013 7:05 am    Post subject: Reply with quote

Ant P. wrote:
It's happening on different cores, so either the CPU as a whole has some non-thermal issue like bad power, or you've got bad RAM.


I'm trying now to monitor temperatures also, but shouldn't it affect fans also, e.g. they should run at maximum speed if CPU thiks that its too hot?

And with RAM - just made new run, with two chips removed (so, computer with 16G's)
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9645
Location: almost Mile High in the USA

PostPosted: Fri Sep 27, 2013 7:15 am    Post subject: Reply with quote

As MCELOG says, this is a hardware error.
Check your chipset to make sure it's not overheating too. Sometimes I wonder about Gigabyte boards, they don't have fans on their chipsets but that heatsink gets quite hot. (I have a Gigabyte Z68AP-D3 and EP43-UD3L boards, neither have fans on the chipset)
Checking with RAM chips removed was a good idea.

Since you have a K-series chip, try to underclock to see if it helps, especially try to see what it does if you reduce BCLK from 100MHz. Also possibly increasing DRAM and/or chipset voltage.

It's weird that an ubuntu doesn't work, though you should try a stock ubuntu if you can. Their optimizations tend to allow any CPU to work.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
uraes
Tux's lil' helper
Tux's lil' helper


Joined: 28 Nov 2002
Posts: 135
Location: Estonia

PostPosted: Tue Oct 01, 2013 7:27 am    Post subject: Reply with quote

Just got call from warranty repairs.. motherboard was broken, changed to MSI. Gotta see, how it works :)
Thanks of every bit of advice.
Back to top
View user's profile Send private message
lecbee
n00b
n00b


Joined: 29 Oct 2013
Posts: 1

PostPosted: Tue Oct 29, 2013 10:36 am    Post subject: Reply with quote

Hello,

I have pretty much the same error, many many times:

TIME 1383039501 Tue Oct 29 10:38:21 2013
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 0
TIME 1383039550 Tue Oct 29 10:39:10 2013
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60

This is on CentOS 6.4 x86-64
Motherboard: ASUS B85M-E s1150
Memory: 2x 4G DDR3 Crucial
CPU: Intel Core i7-4770 3.4G/8M

The mcelog is not up-to-date on CentOS, I recompiled it, and the "MCA: Unknown Error 5" in in fact a "MCA: Internal parity error" as you can see in this commit:
https://kernel.googlesource.com/pub/scm/utils/cpu/mce/mcelog/+/bec51ee686f29abd48c6ee4b67cff72135e80156%5E!/

Anyway that doesn't help to fix the error.

@uraes
Since you have your new motherboard, does that fix the problem?
Back to top
View user's profile Send private message
kheper
n00b
n00b


Joined: 19 Nov 2013
Posts: 1

PostPosted: Tue Nov 19, 2013 5:39 pm    Post subject: Same errors for Xeon E3-1275v3 (Haswell) Reply with quote

I'm having the same errors as previous poster with a Xeon E3-1275v3 (Haswell) but it only happens when I'm running Virtualbox with IO-APIC enabled with FreeBSD/OpenBSD while compiling ports, if I disable IO-APIC, no errors. It also happen under VMware player with FreeBSD. I have yet to see those errors while not running a VM and it doesn't happen while running a Linux VM, for example, I have emerge world on Gentoo VM over 400 packages without a single error and I did it twice to be sure. After weeks of uptime, no MCE event of this kind if not in a VM context, memtest and prime95 torture tests runs without errors and I compiled various things on the Linux host, no errors. I'm using Kernel 3.12.
Back to top
View user's profile Send private message
l3u
Advocate
Advocate


Joined: 26 Jan 2005
Posts: 2540
Location: Konradsreuth (Germany)

PostPosted: Mon Apr 14, 2014 2:54 pm    Post subject: Reply with quote

I'm also seeing machine check events when running a Windows SBS 2003 32 bit virtual machine with qemu on my Xeon E3 Haswell system. I found a thread about this on the vmware forums: https://communities.vmware.com/thread/452344 – but even after changing the qemu machine, I only got less machine check events, they were not gone.

I filed a bug about this in qemu's bugzilla: https://bugs.launchpad.net/qemu/+bug/1307225 – perhaps, somebody who experiences the same problems wants to confirm the problem.
Back to top
View user's profile Send private message
pa1983
Tux's lil' helper
Tux's lil' helper


Joined: 09 Jan 2004
Posts: 101

PostPosted: Thu Apr 17, 2014 10:43 pm    Post subject: Reply with quote

uraes wrote:
Just got call from warranty repairs.. motherboard was broken, changed to MSI. Gotta see, how it works :)
Thanks of every bit of advice.


The same happened to me on a K8WE tyan board with dual opteron 280 and 8x1Gb PC3200 ECC/REG. Was surfing when the systam locked up. Rebooted and was greted by a kernel crash saying it was a hardware error and no software error. After some testing I discovered that one memory channel on the board had broken
Removed both dimms in that channel and the kernel booted. If I added it back i got the same errors you had. Got my hands on a second K8WE board and both CPU and RAM worked in that. Tough that board died after the capacitors started leaking when I had it in storage so in the end I never realy got around to butting it back together other then for testing. Ended up getting new components.
_________________
NAS: i3 4360 3.7Ghz, 20Gb ram, 256Gb SSD, 65Tb HDD, NIC: Intel 2x1Gbit, Realtek 2.5Gbit
ROUTER: J1900 2Ghz, 8Gb ram, 128Gb SSD, NIC: 2x1Gbit, WIFI: Atheros AR9462 and AR5005G
Back to top
View user's profile Send private message
l3u
Advocate
Advocate


Joined: 26 Jan 2005
Posts: 2540
Location: Konradsreuth (Germany)

PostPosted: Fri Apr 18, 2014 2:40 pm    Post subject: Reply with quote

But in contrast to the virtualization issue, this has been a real hardware problem …
Back to top
View user's profile Send private message
hp3325
n00b
n00b


Joined: 20 Dec 2014
Posts: 1

PostPosted: Sat Dec 20, 2014 1:51 pm    Post subject: This is a spurious MCE events Reply with quote

This is Intel erratum HSD131. From http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf:
Quote:

HSD131. Spurious Corrected Errors May be Reported
Problem: Due this erratum, spurious corrected errors may be logged in the IA32_MC0_STATUS
register with the valid field (bit 63) set, the uncorrected error field (bit 61) not set, a
Model Specific Error Code (bits [31:16]) of 0x000F, and an MCA Error Code (bits
[15:0]) of 0x0005. If CMCI is enabled, these spurious corrected errors also signal
interrupts.
Implication: When this erratum occurs, software may see corrected errors that are benign. These
corrected errors may be safely ignored.
Workaround: None identified.
Status: For the steppings affected, see the Summary Table of Changes.

Ideally, the benign check events would be filtered in the kernel. At least in FreeBSD, the problem has already been addressed:
http://svnweb.freebsd.org/base?view=revision&revision=269052

Code:

/*
 * Skip spurious corrected parity errors generated by desktop Haswell
 * (see HSD131 erratum) unless reporting is enabled.
 * Note that these errors also have been observed with DO-stepping,
 * while the revision 014 desktop Haswell specification update only
 * talks about CO-stepping.
 */
 if (rec->mr_cpu_vendor_id == CPU_VENDOR_INTEL &&
   rec->mr_cpu_id == 0x306c3 && rec->mr_bank == 0 &&
   rec->mr_status == 0x90000040000f0005 && !intel6h_HSD131)
     return (1);
   return (0);


To turn off mce, ubuntu using the mce=ce_ignore kernel boot option, /etc/default/grub.cfg:
GRUB_CMDLINE_LINUX_DEFAULT="mce=ignore_ce"

for redhat, Add following item in /boot/grub/grub.conf
mce=mce=ignore_ce
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum