View previous topic :: View next topic |
Author |
Message |
lmegliol n00b

Joined: 12 Sep 2005 Posts: 68
|
Posted: Fri Nov 10, 2006 8:20 pm Post subject: System Instability... (no longer resolved, new crash output) |
|
|
I recently installed Gentoo on a system I've had running Linux for 5 years. It was running RedHat previously, but a hard drive crashed required a fresh Linux install and I went with Gentoo. The only changes between the system before Gentoo and after are that a 3ware SATA RAID controller has been installed and 2GB of RAM added. I ran memtest86 on the RAM before going further with the installation and it checked out.
The system runs Apache2 with PHP5 and mod_perl, and I recently experienced 2 kernel panics in a 6 hour period. I'm nowhere near the system, so I had to have someone take a picture of the console so I could see the errors. Here is a transciption of the errors as I can read them.
Code: |
EIP: 0060:[<f8957275>] Not tainted VLI
EFLAGS: 00010286 (2.6.17-gentoo-r8 <unreadable>)
eax: 00000000 ebx: f7f80080 ecx: f791cea0 edx: 00000000
esi: f7e07400 edi: c276be84 ebp: 00000000 esp: c276bd88
ds: 007b es: 007b ss: 0068
Process: apache2 (pid: 5565, threadinfo=c276a000 task=f7a84ad0)
Stack: c0110529 00000246 f7ff5700 c26bd260 23000246 00000023 c26bd000 f791cea0
f791ceb0 f8924ea0 000005ea 000000eb f7f80000 00000eb0 00000001 f8924000
c26bd260 00000000 c276be04 00000000 f89551d5 c0244995 80000000 f6026b9c
Call Trace:
<c0118529> <f89551d5>
<c0244995> <c024539e>
<c0243150> <c012efc8>
<c012961d> <c01295a6>
<c0104a94> <c01930c6>
<c013007b> <c01363e4>
<c0112d28> <c013557e>
<c0113718> <c0100ec3>
<c0102693>
Code: 73 24 8b 7b 14 81 c7 d8 00 00 00 b9 06 00 00 00 fc f3 a6 0f 85 0e fd ff ff
80 63 75 f8 e9 05 fd ff ff 83 42 20 01 e9 fc fc ff ff <0f> 0b e9 03 fe ff ff 0f
0b 89 f6 e9 80 fc ff ff b9 8a 72 95 f8
EIP: [<f8957275>] SS:ESP 0068:c276bd88
<0>Kernel panic - not syncing: Fatal exception in interrupt
|
I had to transcribe that by hand. Hopefully someone out there can make heads or tails of it. This is far beyond my ability.
Any thoughts? Any help on figuring out the source of the problem or how to fix it? I'm at a loss here.
Thanks in advance.
Last edited by lmegliol on Wed Dec 06, 2006 6:08 pm; edited 4 times in total |
|
Back to top |
|
 |
msalerno Veteran


Joined: 17 Dec 2002 Posts: 1338 Location: Sweating in South Florida
|
Posted: Fri Nov 10, 2006 10:07 pm Post subject: |
|
|
Wow, typing that out warrents a reply.
what is the output of your emerge --info and emerge -pv apache ?
Have you re-emerged apache since you started getting the problems? Has it helped? |
|
Back to top |
|
 |
bonbons Apprentice

Joined: 04 Sep 2004 Posts: 250
|
Posted: Fri Nov 10, 2006 10:34 pm Post subject: |
|
|
In order to get a more useful trace, could you recompile your kernel with debugging symbols enabled?
From just the addresses it's hard to determine what caused the panic or where it was caused (the addresses to code mapping would at best be possible with the kernel config)
To do so, enable the following:
Code: | Kernel Hacking ->
[*] Kernel Debugging
[*] Compile the kernel with frame pointers |
In addition you may want to setup netconsole on that machine so you can skip the pictures and transscribing:
Code: | Device Drivers ->
Network device support ->
[m] Network console logging support |
This module should then be modprobed when the system is up with a line like the following:
Code: | modprobe netconsole netconsole=10000@192.168.0.123/eth0,10000@192.168.0.124/00:11:22:33:44:55 |
The option to netconsole is like "netconsole=<src-port>@<src-ip>/<net-dev>,<dst-port>@<dst-ip>/<dst-mac>", <dst-mac> is either the MAC address of the router or the one of the logging host in case that host is on the same subnet (no router inbetween)
To capture the output, listen on UDP port <dst-port> on the host with <dst-ip> (e.g. with "nc -u -p <dst-port> -s <dst-ip> | tee /path/to/log/file")
The output you will get with netconsole should be the same as what you get on your physical console |
|
Back to top |
|
 |
lmegliol n00b

Joined: 12 Sep 2005 Posts: 68
|
Posted: Fri Nov 10, 2006 11:06 pm Post subject: |
|
|
I'll answer the first post first, and then get on top of the kernel and netconsole afterward. (Thanks for the netconsole info.)
Code: |
Portage 2.1.1 (default-linux/x86/2006.1/desktop, gcc-4.1.1, glibc-2.4-r3, 2.6.17-gentoo-r8 i686)
=================================================================
System uname: 2.6.17-gentoo-r8 i686 Intel(R) Pentium(R) 4 CPU 1.80GHz
Gentoo Base System version 1.12.5
Last Sync: Fri, 27 Oct 2006 18:30:01 +0000
app-admin/eselect-compiler: [Not Present]
dev-java/java-config: 1.3.7, 2.0.30
dev-lang/python: 2.4.3-r4
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache: [Not Present]
dev-util/confcache: [Not Present]
sys-apps/sandbox: 1.2.17
sys-devel/autoconf: 2.13, 2.59-r7
sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2
sys-devel/binutils: 2.16.1-r3
sys-devel/gcc-config: 1.3.13-r4
sys-devel/libtool: 1.5.22
virtual/os-headers: 2.6.17-r1
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-march=i686 -O2 -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc"
CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/gconf /etc/java-config/vms/ /etc/revdep-rebuild /etc/terminfo"
CXXFLAGS="-march=i686 -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig distlocks metadata-transfer sandbox sfperms strict"
GENTOO_MIRRORS="http://ftp.uoi.gr/mirror/OS/gentoo/ http://mirror.usu.edu/mirrors/gentoo/ http://ftp.rhnet.is/pub/gentoo/"
LINGUAS=""
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="acpi apache2 apm avi berkdb cracklib crypt curl dbm elibc_glibc fam gdbm gif hal imagemagick innodb input_devices_evdev input_devices_keyboard input_devices_mouse ipv6 java jpeg kernel_linux libwww mysql mysqli ncurses nls nptl nptlonly openssl pam pcre pdflib perl png python readline search spell ssl tcpd unicode userland_GNU video_cards_apm video_cards_ark video_cards_ati video_cards_chips video_cards_cirrus video_cards_cyrix video_cards_dummy video_cards_fbdev video_cards_glint video_cards_i128 video_cards_i740 video_cards_i810 video_cards_imstt video_cards_mga video_cards_neomagic video_cards_nsc video_cards_nv video_cards_rendition video_cards_s3 video_cards_s3virge video_cards_savage video_cards_siliconmotion video_cards_sis video_cards_sisusb video_cards_tdfx video_cards_tga video_cards_trident video_cards_tseng video_cards_v4l video_cards_vesa video_cards_vga video_cards_via video_cards_vmware video_cards_voodoo wddx x86 xml zlib"
Unset: CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, PORTAGE_RSYNC_EXTRA_OPTS
|
Code: | [ebuild R ] net-www/apache-2.0.58-r2 USE="apache2 ssl -debug -doc -ldap -mpm-itk -mpm-leader -mpm-peruser -mpm-prefork -mpm-threadpool -mpm-worker (-selinux) -static-modules -threads" 0 kB
|
|
|
Back to top |
|
 |
lmegliol n00b

Joined: 12 Sep 2005 Posts: 68
|
Posted: Sat Nov 11, 2006 12:43 am Post subject: |
|
|
In which package does one get the nc program? I thought it might be cancd, but that doesn't seem to be the case. |
|
Back to top |
|
 |
lmegliol n00b

Joined: 12 Sep 2005 Posts: 68
|
Posted: Sat Nov 11, 2006 1:10 am Post subject: |
|
|
Answer to my own question... nc is in net-analyzer/netcat. |
|
Back to top |
|
 |
lmegliol n00b

Joined: 12 Sep 2005 Posts: 68
|
Posted: Sat Nov 11, 2006 8:12 pm Post subject: |
|
|
I'm changing the subject on this one to something much more general.
Here's the deal: As I stated, this system ran fine for 5 years with RedHat. When the hard drive crashed, the system was shipped out to me so I could replace the hard drive. I installed the 3ware controller and drives and started the process of installing Gentoo.
The Gentoo install process was a nightmare. I've installed Gentoo before, and never had these types of problems. The system kept crashing mid-install. At the time I chalked this up to the installation program not being entirely compatible with my hardware. This didn't make much sense to me, but this system had run non-stop for 5 years without any strange crashes, until the hard drive crashed.
I had to restart the install process dozens of times. I suppose this should have been my first indication that something else was going on, but I finally pushed through thinking that once I got it installed I'd recompile the kernel to something a little more stable for my hardware. Before I compiled a new kernel I tried installing some other necessary packages. I got a few random segmentation faults. Some I could duplicate a few times, so I assumed they were code problems, but then they'd work. Strange. But I pushed through thinking that a kernel recompile would fix it.
Once the kernel was recompiled, the system ran beautifully. No other segmentation faults or compilation problems occurred. I figured the problem was fixed. I had no crashes, no problems, nothing. I installed some new RAM and ran memtest86 on it and everything looked great. I gave the system a clean bill of health and shipped it back.
Once it arrived, the system was hooked up, fired up and visitors to my web site starting applying a load to it. 36 hours passed without incident, and I had no reason to believe anything was wrong. Then it crashed with a kernel panic.
I don't have a copy of the messages from that crash. We rebooted the system without trouble and 5 hours later it crashed with the messages I entered above. Reboot, but now things get interesting.
On reboot the first time the reboot hangs somewhere in the middle of starting services. Reboot again. This time I get some drive errors and the boot process stops mid way through. I try to tackle those and reboot again. This time it gets to a prompt, but for some reason the hostname is wrong. It says "none login:". Strange. Reboot again. Works this time.
Then I tried to following recommendation above regarding compiling the kernel with netconsole. The kernel recompiled, but I couldn't get netconsole to work. I realized a couple of mistakes in the recompile, so I recompiled again. I tried to reboot. /sbin/shutdown appeared to no longer be working. The system just sits there. I have someone manually reboot the system. Hangs mid boot again, right near starting eth.lo. Try again. Stops with another drive inconsistency. Probably from the nasty reboots.
Long story short, something is royally messed up here. And it makes no sense to me at all that the system was working fine for 5 years and then the entire thing goes out after a drive crash. I suppose it is possible that the original drive crash was not a drive crash at all, but rather a motherboard problem of some sort.
So basically that's where I am with this. I'm not sure whether I should try replacing all major components one at a time until this thing works, or whether I should scrap the whole system, or what. My first step will be to run memtest86 again. After that... any thoughts? |
|
Back to top |
|
 |
lmegliol n00b

Joined: 12 Sep 2005 Posts: 68
|
Posted: Sat Nov 11, 2006 8:15 pm Post subject: |
|
|
One more thing. I did have one problem when compiling kernels. My system has three network adapters. When I tried to compile the drivers for all three of the adapters into the kernel, the system would not boot. I don't remember the exact problem, but it just wouldn't work. When I compiled them as modules, the system worked fine. |
|
Back to top |
|
 |
bonbons Apprentice

Joined: 04 Sep 2004 Posts: 250
|
Posted: Sun Nov 12, 2006 11:49 am Post subject: |
|
|
From your story above there could be some issue with disk access...
If memtest reveals no issue, you may want to do some load-test against your disk drives (check the disks and the controller), like running a benchmark, or doing mixtures of reading and writing from filesystem (extract kernel sources, search for files and such, even "cat /dev/<partition> > /dev/null")
During this process, looking at kernel output may be helpful (dmesg -n
When the system freezes again, you may hopefully see some complaints from the kernel on your console. (with that level any message should also be sent out using netconsole... if you have the opportunity to do so, you may want to experiment with netconsole on your LAN first to get some feeling. As you have many interface on your failing box, be careful selecting the right one for netconsole, you can modprobe and rmmod netconsole to change its ettings)
If possible, monitor temperatures and voltages as well |
|
Back to top |
|
 |
lmegliol n00b

Joined: 12 Sep 2005 Posts: 68
|
Posted: Sun Nov 19, 2006 11:51 pm Post subject: |
|
|
I finally got netconsole working on this system and just started some load testing. Netconsole apparently only works with eth0 on my system, and I was trying to use it over eth2, a back-end network for a few of my servers. I changed it to eth0 and all is well.
So, in an attempt to force this system to crash, I've been doing all-of-the-above recommended methods of putting load on the system. I've got three shell scripts running infinite loops unpacking kernel source, searching for files on the system, and cat'ing /dev/sda7 to /dev/null. The load on the system is around 4.00, and has been that way for the last 6 hours. So far no crashes.
I noticed that the memory is only 75% used during these load tests, and I'm not sure what implications that may have regarding the effectiveness of this load testing.
By the way, this is the same system that I am working to debug in this thread also.
Briefly going under the assumption that because the memory all seems to check out, and because the load testing isn't causing a crash, then the system must be fine... It can't be. I'd say that about 50% of the time, the boot process fails unexpectedly. (I haven't seen this happen since getting netconsole to work, so I can't report on it.) Something strange is up.
I appreciate everyone who has helped me thus far and will report more when I know more. Any new thoughts or tips are greatly appreciated.
Thanks. |
|
Back to top |
|
 |
lmegliol n00b

Joined: 12 Sep 2005 Posts: 68
|
Posted: Mon Dec 04, 2006 4:06 pm Post subject: |
|
|
System stability has improved on this system, though I don't have an exact reason why. There were some memory problems that may have been the source. I removed an older, slower memory chip that may have conflicted with two newer chips.
System is now in production and there have been no crashes since the extra chip was removed. Marking this thread as resolved. |
|
Back to top |
|
 |
lmegliol n00b

Joined: 12 Sep 2005 Posts: 68
|
Posted: Wed Dec 06, 2006 6:07 pm Post subject: |
|
|
I've removed the [solved] off this topic, because the system instability is back. The system ran perfectly for the last week or two. I rebooted earlier today and within an hour the system started crashing.
Luckily I have netconsole working. The kernel was also recompiled as described above. Here is the output from the crash...
Code: |
------------[ cut here ]------------
kernel BUG at <bad filename>:60395!
invalid opcode: 0000 [#1]
Modules linked in: floppy 3c59x e1000
CPU: 0
EIP: 0060:[<c013b2e4>] Not tainted VLI
EFLAGS: 00010286 (2.6.17-gentoo-r8 #31)
eax: ffffffff ebx: f77b2ba8 ecx: c1f170c0 edx: c1f170c0
esi: 086ea000 edi: c1f170c0 ebp: f767bdc4 esp: f767bdc4
ds: 007b es: 007b ss: 0068
Process apache2 (pid: 6397, threadinfo=f767a000 task=f7b945e0)
Stack: f767be1c c0135f9d 00000000 f754f804 f767be34 00000000 00000001 0879a000
f7894084 f7e16ac0 c032fb3c fffffdc2 ffffffff f7894084 78b86067 08799fff
001c0ffa 0879a000 00000000 f767be34 f754f8b4 f7e16ac0 f767be44 c01386e5
Call Trace:
<c0103655> <c01039ae>
<c0103b2a> <c0103cf8>
<c01044c7> <c010328b>
<c0135f9d> <c01386e5>
<c0112c87> <c0115c9b>
<c0116fc3> <c0117665>
<c011e150> <c0101eaf>
<c0102862>
Code: 55 89 e5 89 c2 83 40 08 ff 0f 98 c0 84 c0 75 02 5d c3 8b 42 08 83 c0 01 78 11 ba ff ff ff ff b
8 10 00 00 00 e8 fa 3a ff ff 5d c3 <0f> 0b eb eb 55 89 e5 57 56 53 83 ec 14 89 c7 89 d3 89 4d e4 8b
EIP: [<c013b2e4>] SS:ESP 0068:f767bdc4
<1>Fixing recursive fault but reboot is needed!
Bad page state in process 'apache2'
page:c1f170c0 flags:0xc0000014 mapping:00000000 mapcount:-1 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
<c01036b4> <c0103dbb>
<c012f8e7> <c0130198>
<c013022e> <c0136a53>
<c0110425> <c010328b>
BUG: unable to handle kernel paging request
|
Can anyone make sense of this? |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|