View previous topic :: View next topic |
Author |
Message |
Heissi n00b
Joined: 22 Feb 2008 Posts: 5
|
Posted: Fri Feb 22, 2008 11:31 am Post subject: Random kernel panics |
|
|
I recently have problems with random kernel panics on my server/router.
Some background information:
It is a EPIA PD with a VIA C3 CPU.
On it I only installed few daemons and there is nothing special.
At first, the server just froze and I recognized that the CPU fan isn't moving anymore. So I replaced the fan and the hardware seems to be ok, but then there were these kernel panics at random times (10-200 minutes after boot).
Also the BIOS battery was low on voltage (BIOS settings and system clock were resetted), so I replaced it, but the kernel panics were still there.
Then I tested the CPU with cpuburn and some emerge, but... nope... It happened after the emerge process (no CPU load there).
The RAM seems to be ok too (memtest86+).
Finally I replaced the harddisk (I cloned the system) but that didn't resolve anything either.
I don't think the kernel is broken, because the system was running 90 days without any problems.
What should I do now?
I'm really inexperienced with kernel panics.
Is there a way to trace back the source of the problem (maybe the mainboard)?
I don't know which informations of the system are relevant, so if you need informations, just ask.
Thanks. |
|
Back to top |
|
|
pathfinder l33t
Joined: 19 Jan 2006 Posts: 731 Location: Barcelona, Spain
|
Posted: Fri Feb 22, 2008 1:15 pm Post subject: |
|
|
try recompiling the kernel from the config file.
was your config file changed lately?
backup it, then cd usr/src/linux and make menuconfig
you can t boot on your computer, isn t it?
maybe it is due to the clock because in the handbook I think i remember that when you had to compile your kernel for the first time, there was a warning saying taht you ought to be sure the date is correct before proceeding. Maybe the fact your date was not ok made a huge mess.
I would definitely try to set the correct date now the cell has been changed, and then recompile as it it now your kernel. just to see what happens. |
|
Back to top |
|
|
Heissi n00b
Joined: 22 Feb 2008 Posts: 5
|
Posted: Fri Feb 22, 2008 2:42 pm Post subject: |
|
|
I upgraded from hardened-sources-2.6.23 to hardened-sources-2.6.23-r7.
While I was testing something I got this message:
Code: | invalid opcode: 0000 [#1]
Modules linked in: thermal button processor
CPU: 0
EIP: 0060:[<c04e0739>] Not tainted VLI
EFLAGS: 00010002 (2.6.23-hardened-r7 #1)
EIP is at elv_rb_add+0x1/0x51
eax: ddf9bd64 ebx: ddf9bd4c ecx: ddf9bd4c edx: d231be84
esi: ddf8f9c0 edi: d231be84 ebp: 00000000 esp: c6887b3c
ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068
Process mconf (pid: 29677, ti=c6886000 task=c4bf8ab0 task.ti=c6886000)
Stack: ddf8f9c0 c04e8772 d231be84 ddf9bd4c ddf8f9c0 c04e9808 d231be84 ddf92ad0
00000008 c04e0b22 ddf92b30 0005ffbe 00000086 d231be84 ddf92ad0 00000008
00000000 c04e3c8e 00000000 00000000 d231be84 c14512a0 c14512a0 00000008
Call Trace:
[<c04e8772>] cfq_add_rq_rb+0x3c/0x74
[<c04e9808>] cfq_insert_request+0x1c/0x3a
[<c04e0b22>] elv_insert+0xa4/0x141
[<c04e3c8e>] __make_request+0x28c/0x2b6
[<c04e3eb0>] generic_make_request+0x17e/0x1ab
[<c046af47>] bio_add_page+0x31/0x37
[<c046dad8>] mpage_end_io_read+0x0/0x5e
[<c04e3f82>] submit_bio+0xa5/0xac
[<c046dad8>] mpage_end_io_read+0x0/0x5e
[<c046dbaf>] mpage_bio_submit+0x19/0x1d
[<c046e104>] mpage_readpages+0x10f/0x11c
[<c04852d4>] ext3_get_block+0x0/0xbe
[<c05f82fd>] io_schedule+0xe/0x16
[<c05f8421>] __wait_on_bit+0x4a/0x51
[<c05f8496>] out_of_line_wait_on_bit+0x6e/0x76
[<c0467d3f>] sync_buffer+0x0/0x2e
[<c0439ffd>] buffered_rmqueue+0xbf/0xd7
[<c043bc31>] read_pages+0x28/0xd3
[<c04852d4>] ext3_get_block+0x0/0xbe
[<c043a1b6>] __alloc_pages+0x51/0x2a4
[<c043bde5>] __do_page_cache_readahead+0x109/0x123
[<c043bef4>] ra_submit+0x20/0x25
[<c043c054>] page_cache_sync_readahead+0x2a/0x2f
[<c043703d>] do_generic_mapping_read+0xda/0x3ff
[<c04375c7>] generic_file_aio_read+0x11f/0x14a
[<c0437362>] file_read_actor+0x0/0xda
[<c044ea5b>] do_sync_read+0xbe/0xfb
[<c0423a42>] autoremove_wake_function+0x0/0x33
[<c04106f4>] do_page_fault+0x2a7/0x5c7
[<c044e282>] nameidata_to_filp+0x23/0x32
[<c044eb21>] vfs_read+0x89/0x104
[<c044edde>] sys_read+0x41/0x67
[<c0403c9d>] sysenter_past_esp+0x66/0x99
[<c0403cb6>] sysenter_past_esp+0x7f/0x99
=======================
Code: 48 04 c7 42 3c 00 00 00 00 c7 43 04 00 00 00 00 eb 0e 8b 42 24 03 42 1c 39 f0 75 04 89 d0 eb 06 89 f8 eb a9 31 c0 5b 5e 5f c3 56 <89> c1 89 c6 53 31 db 83 38 00 74 22 8b 19 8d 4b bc 8b 41 1c 39 |
Then I had to reboot, because the system was screwed up (like 10 defunct processes).
Looks like an Memory or CPU error, doesn't it?
But I trust memtest86+ and the radiator of the CPU wasn't really hot (why there isn't a sensor on the CPU?) so I removed the heat-conductive paste and put on some new one - just to be sure.
Unfortunately recompiling the kernel doesn't solve the problem. |
|
Back to top |
|
|
pathfinder l33t
Joined: 19 Jan 2006 Posts: 731 Location: Barcelona, Spain
|
Posted: Fri Feb 22, 2008 2:50 pm Post subject: |
|
|
well, have you tried with another distro? with windows?
just to detect whether it is an hardware problem, or software related?
cat /proc/cpuinfo gices you something?
try to see cat /proc/whatever just to get some extra info.
Also dmesg might say something, and the /var/log/messages.
I can t really tell you anything else right now. |
|
Back to top |
|
|
Heissi n00b
Joined: 22 Feb 2008 Posts: 5
|
Posted: Sat Feb 23, 2008 3:21 pm Post subject: |
|
|
pathfinder wrote: | well, have you tried with another distro? with windows?
just to detect whether it is an hardware problem, or software related?
cat /proc/cpuinfo gices you something?
try to see cat /proc/whatever just to get some extra info.and s
Also dmesg might say something, and the /var/log/messages.
I can t really tell you anything else right now. |
I tried to install windows (I installed it before, so it has to work) and i got a bluescreen. Some interrupt error (IRQL_NOT_LESS_...).
The kernel panic message was similar to this (interrupt exception).
So i can't do anything but buy a new mini-itx mainboard, right? |
|
Back to top |
|
|
pathfinder l33t
Joined: 19 Jan 2006 Posts: 731 Location: Barcelona, Spain
|
Posted: Sun Feb 24, 2008 2:16 pm Post subject: |
|
|
well, that looks like a hard hardware failure... :S
can t really tell you what.
is your Mobo guaranteed? could be useful here... |
|
Back to top |
|
|
gundelgauk n00b
Joined: 01 Oct 2007 Posts: 40
|
Posted: Sun Feb 24, 2008 5:26 pm Post subject: |
|
|
Yes, sounds like faulty hardware. Since you already ruled out RAM and hard drive, it could be the CPU or mainboard. You said yourself that the first time your system froze was when the CPU fan died. Maybe the processor took some damage when that happened.
Apart from that: memtest showing no errors can not guarantee that your RAM is 100% OK. If it does show errors, your RAM is faulty. But it doesn't work the other way round. It might be that your RAM only produces errors when a very specific pattern gets written (or read) to a very specific address. And if memtest does not test exactly this pattern, no error will show up but you still have faulty RAM. |
|
Back to top |
|
|
|