View previous topic :: View next topic |
Author |
Message |
uraes Tux's lil' helper
Joined: 28 Nov 2002 Posts: 135 Location: Estonia
|
Posted: Thu Sep 26, 2013 8:52 am Post subject: Two months of debugging - unstable computer |
|
|
Long story short: I have tried to stabilize my Gentoo installation for almost 2 months and no success. Until now I thought that this is problem in Gentoo kernel (tried: 3.8.13, 3.10.7, 3.10.10, 3.11.0, 3.11.1) as I was unable to reproduce unstability on any other installation I tried - Win7, ubuntu, kubuntu, mint, fedora, estobuntu. I tried to remove all hard disks, removed nvidia videocard and pci ethernet card. ran memtest for day.. With live OS'es nothing, but Gentoo was crashing (and before os crash sometimes programs also crashed, like gkrellm, firefox, konsole, ..) randomly as sometimes just GUI halted, sometimes keyboard lights were blinking and sometimes I got some randome kernel traces also on screen.
But now, after 1d long running Estobuntu (almost all time some movie was running, shared some ubuntu iso over bittorrent, etc - computer was doing smth all the time) I got some new bits - in dmesg were
Code: |
[ 9026.127373] [Hardware Error]: Machine check events logged
[12869.614938] [Hardware Error]: Machine check events logged
|
user was kicked off from GUI and in mcelog I found this:
Code: |
root@buntu:~# more /var/log/mcelog
mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors
mcelog: failed to prefill DIMM database from DMI data
Kernel does not support page offline interface
mcelog: mcelog read: No such device
mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 0
TIME 1380103761 Wed Sep 25 13:09:21 2013
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
mcelog: Unsupported new Family 6 Model 3c CPU: only decoding architectural errors
Hardware event. This is not a software error.
MCE 1
CPU 2 BANK 0
TIME 1380107608 Wed Sep 25 14:13:28 2013
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
|
Can somebody explain this? Got not good help from Google, as related posts seemed to be in relation to overcloking. My computer is not overcloked. Fans are running normally and no overheating.
Motherboard: Gigabyte GA-Z87X-UD3H s1150
Memory: 4x 8G DDR3 1600C11 Kingston
CPU: Intel Core i7-4770K 3.5G/8M |
|
Back to top |
|
|
Hypnos Advocate
Joined: 18 Jul 2002 Posts: 2889 Location: Omnipresent
|
|
Back to top |
|
|
uraes Tux's lil' helper
Joined: 28 Nov 2002 Posts: 135 Location: Estonia
|
Posted: Thu Sep 26, 2013 10:56 am Post subject: |
|
|
My "emerge --info" output is here : http://pastebin.ca/2458674
I don't think anymore, that it is purely Gentoo's problem, as this mcelog in my first post was produced under Estobuntu (Estonian version of Ubuntu) AND Gentoo livecd (20121221) was also unstable - three hangs in 24 hours. Just Gentoo is somehow more intense or active in some areas and crashes may happen in 30minutes. |
|
Back to top |
|
|
ulenrich Veteran
Joined: 10 Oct 2010 Posts: 1480
|
Posted: Thu Sep 26, 2013 11:35 am Post subject: |
|
|
CFLAGS=" -march=native -O2 -pipe "
CXXFLAGS=" -march=native -O2 -pipe "
Further you could try Gentoo~unstable release! |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21595
|
Posted: Fri Sep 27, 2013 1:50 am Post subject: |
|
|
ulenrich wrote: | CFLAGS=" -march=native -O2 -pipe "
CXXFLAGS=" -march=native -O2 -pipe "
Further you could try Gentoo~unstable release! | The OP's current CFLAGS and CXXFLAGS are reasonable. Adding -march=native might improve performance in some cases, but will not correct problems caused by failing hardware. Suggesting that he switch to newer packages is also not helpful. According to the mcelog output, there is a hardware fault. The particular error claims to have been corrected, but there may be related errors that are not correctable. The faulty component must be replaced. |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Fri Sep 27, 2013 3:34 am Post subject: |
|
|
It's happening on different cores, so either the CPU as a whole has some non-thermal issue like bad power, or you've got bad RAM. |
|
Back to top |
|
|
uraes Tux's lil' helper
Joined: 28 Nov 2002 Posts: 135 Location: Estonia
|
|
Back to top |
|
|
uraes Tux's lil' helper
Joined: 28 Nov 2002 Posts: 135 Location: Estonia
|
Posted: Fri Sep 27, 2013 7:05 am Post subject: |
|
|
Ant P. wrote: | It's happening on different cores, so either the CPU as a whole has some non-thermal issue like bad power, or you've got bad RAM. |
I'm trying now to monitor temperatures also, but shouldn't it affect fans also, e.g. they should run at maximum speed if CPU thiks that its too hot?
And with RAM - just made new run, with two chips removed (so, computer with 16G's) |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9677 Location: almost Mile High in the USA
|
Posted: Fri Sep 27, 2013 7:15 am Post subject: |
|
|
As MCELOG says, this is a hardware error.
Check your chipset to make sure it's not overheating too. Sometimes I wonder about Gigabyte boards, they don't have fans on their chipsets but that heatsink gets quite hot. (I have a Gigabyte Z68AP-D3 and EP43-UD3L boards, neither have fans on the chipset)
Checking with RAM chips removed was a good idea.
Since you have a K-series chip, try to underclock to see if it helps, especially try to see what it does if you reduce BCLK from 100MHz. Also possibly increasing DRAM and/or chipset voltage.
It's weird that an ubuntu doesn't work, though you should try a stock ubuntu if you can. Their optimizations tend to allow any CPU to work. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
uraes Tux's lil' helper
Joined: 28 Nov 2002 Posts: 135 Location: Estonia
|
Posted: Tue Oct 01, 2013 7:27 am Post subject: |
|
|
Just got call from warranty repairs.. motherboard was broken, changed to MSI. Gotta see, how it works
Thanks of every bit of advice. |
|
Back to top |
|
|
lecbee n00b
Joined: 29 Oct 2013 Posts: 1
|
Posted: Tue Oct 29, 2013 10:36 am Post subject: |
|
|
Hello,
I have pretty much the same error, many many times:
TIME 1383039501 Tue Oct 29 10:38:21 2013
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 0
TIME 1383039550 Tue Oct 29 10:39:10 2013
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Unknown Error 5
STATUS 90000040000f0005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0
CPUID Vendor Intel Family 6 Model 60
This is on CentOS 6.4 x86-64
Motherboard: ASUS B85M-E s1150
Memory: 2x 4G DDR3 Crucial
CPU: Intel Core i7-4770 3.4G/8M
The mcelog is not up-to-date on CentOS, I recompiled it, and the "MCA: Unknown Error 5" in in fact a "MCA: Internal parity error" as you can see in this commit:
https://kernel.googlesource.com/pub/scm/utils/cpu/mce/mcelog/+/bec51ee686f29abd48c6ee4b67cff72135e80156%5E!/
Anyway that doesn't help to fix the error.
@uraes
Since you have your new motherboard, does that fix the problem? |
|
Back to top |
|
|
kheper n00b
Joined: 19 Nov 2013 Posts: 1
|
Posted: Tue Nov 19, 2013 5:39 pm Post subject: Same errors for Xeon E3-1275v3 (Haswell) |
|
|
I'm having the same errors as previous poster with a Xeon E3-1275v3 (Haswell) but it only happens when I'm running Virtualbox with IO-APIC enabled with FreeBSD/OpenBSD while compiling ports, if I disable IO-APIC, no errors. It also happen under VMware player with FreeBSD. I have yet to see those errors while not running a VM and it doesn't happen while running a Linux VM, for example, I have emerge world on Gentoo VM over 400 packages without a single error and I did it twice to be sure. After weeks of uptime, no MCE event of this kind if not in a VM context, memtest and prime95 torture tests runs without errors and I compiled various things on the Linux host, no errors. I'm using Kernel 3.12. |
|
Back to top |
|
|
l3u Advocate
Joined: 26 Jan 2005 Posts: 2545 Location: Konradsreuth (Germany)
|
Posted: Mon Apr 14, 2014 2:54 pm Post subject: |
|
|
I'm also seeing machine check events when running a Windows SBS 2003 32 bit virtual machine with qemu on my Xeon E3 Haswell system. I found a thread about this on the vmware forums: https://communities.vmware.com/thread/452344 – but even after changing the qemu machine, I only got less machine check events, they were not gone.
I filed a bug about this in qemu's bugzilla: https://bugs.launchpad.net/qemu/+bug/1307225 – perhaps, somebody who experiences the same problems wants to confirm the problem. |
|
Back to top |
|
|
pa1983 Tux's lil' helper
Joined: 09 Jan 2004 Posts: 101
|
Posted: Thu Apr 17, 2014 10:43 pm Post subject: |
|
|
uraes wrote: | Just got call from warranty repairs.. motherboard was broken, changed to MSI. Gotta see, how it works
Thanks of every bit of advice. |
The same happened to me on a K8WE tyan board with dual opteron 280 and 8x1Gb PC3200 ECC/REG. Was surfing when the systam locked up. Rebooted and was greted by a kernel crash saying it was a hardware error and no software error. After some testing I discovered that one memory channel on the board had broken
Removed both dimms in that channel and the kernel booted. If I added it back i got the same errors you had. Got my hands on a second K8WE board and both CPU and RAM worked in that. Tough that board died after the capacitors started leaking when I had it in storage so in the end I never realy got around to butting it back together other then for testing. Ended up getting new components. _________________ NAS: i3 4360 3.7Ghz, 20Gb ram, 256Gb SSD, 65Tb HDD, NIC: Intel 2x1Gbit, Realtek 2.5Gbit
ROUTER: J1900 2Ghz, 8Gb ram, 128Gb SSD, NIC: 2x1Gbit, WIFI: Atheros AR9462 and AR5005G |
|
Back to top |
|
|
l3u Advocate
Joined: 26 Jan 2005 Posts: 2545 Location: Konradsreuth (Germany)
|
Posted: Fri Apr 18, 2014 2:40 pm Post subject: |
|
|
But in contrast to the virtualization issue, this has been a real hardware problem … |
|
Back to top |
|
|
hp3325 n00b
Joined: 20 Dec 2014 Posts: 1
|
Posted: Sat Dec 20, 2014 1:51 pm Post subject: This is a spurious MCE events |
|
|
This is Intel erratum HSD131. From http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf:
Quote: |
HSD131. Spurious Corrected Errors May be Reported
Problem: Due this erratum, spurious corrected errors may be logged in the IA32_MC0_STATUS
register with the valid field (bit 63) set, the uncorrected error field (bit 61) not set, a
Model Specific Error Code (bits [31:16]) of 0x000F, and an MCA Error Code (bits
[15:0]) of 0x0005. If CMCI is enabled, these spurious corrected errors also signal
interrupts.
Implication: When this erratum occurs, software may see corrected errors that are benign. These
corrected errors may be safely ignored.
Workaround: None identified.
Status: For the steppings affected, see the Summary Table of Changes.
|
Ideally, the benign check events would be filtered in the kernel. At least in FreeBSD, the problem has already been addressed:
http://svnweb.freebsd.org/base?view=revision&revision=269052
Code: |
/*
* Skip spurious corrected parity errors generated by desktop Haswell
* (see HSD131 erratum) unless reporting is enabled.
* Note that these errors also have been observed with DO-stepping,
* while the revision 014 desktop Haswell specification update only
* talks about CO-stepping.
*/
if (rec->mr_cpu_vendor_id == CPU_VENDOR_INTEL &&
rec->mr_cpu_id == 0x306c3 && rec->mr_bank == 0 &&
rec->mr_status == 0x90000040000f0005 && !intel6h_HSD131)
return (1);
return (0);
|
To turn off mce, ubuntu using the mce=ce_ignore kernel boot option, /etc/default/grub.cfg:
GRUB_CMDLINE_LINUX_DEFAULT="mce=ignore_ce"
for redhat, Add following item in /boot/grub/grub.conf
mce=mce=ignore_ce |
|
Back to top |
|
|
|