View previous topic :: View next topic |
Author |
Message |
aldimond n00b

Joined: 05 Mar 2006 Posts: 12
|
Posted: Tue Mar 07, 2006 7:20 am Post subject: Diagnosing failures |
|
|
I have recently started getting lockups when I try to compile big packages (problem first manifested itself when emerging gcc, but it's happened also with Linux and with what is perhaps the granddaddy of painful emerges, blas-atlas). During the time the computer is locked up, nothing is running; that is, I have left it sitting there for hours and checked the logs upon reboot, and the normal hourly cron jobs stop showing up after the time of the lockup.
I am having a very hard time diagnosing this issue; I have configured the kernel for "soft lockup detection" and also set up an NMI watchdog in case it's a hard lockup, but unfortunately I don't have any good way to get a log of the panic if and when these detectors are triggered. The Linux crash-dump-to-disk/swap facilities are not doing it for me; I've tried "mini-kernel dumps" (uses a method similar to kexec to boot another kernel without resetting hardware) but it doesn't seem to actually trigger the kexec, but I can't tell whether that's because it's not actually panicking or because that patch is broken.
I think perhaps the best option might be to use netconsole (I have no serial ports or I'd use a serial console); it seems like a much more-tested solution than any of the crash dump saving utilities, and I'm pretty sure it's supported in the mainline kernel, which helps a lot. I have a computer I could use for it, but I'm about 150 miles away from it. Before spending a day on the road just to pick up this computer, I'd like to know if anyone 'round these parts has had any luck with netconsole, or has any tips on how to diagnose kernel lockups with it.
I guess before I hit the road I should probably also test my hardware more extensively; actually, this feels a lot like a hardware problem, because the first lockup was paired with a compile failing due to a "floating point error" (floating point during a compile? I don't know everything about compilers but that just smells fishy), and because I never had this problem until recently, and when I downgrade to my previous kernel, under which I never had any problems, I still get the lockups. Yeah. But I still would be interested in hearing experiences with netconsole and other ways to retrieve panic info. Thanks for reading this tome.
- Al |
|
Back to top |
|
 |
bunder Bodhisattva

Joined: 10 Apr 2004 Posts: 5956
|
Posted: Tue Mar 07, 2006 7:34 am Post subject: |
|
|
try memtest86 for memory issues? _________________
Neddyseagoon wrote: | The problem with leaving is that you can only do it once and it reduces your influence. |
banned from #gentoo since sept 2017 |
|
Back to top |
|
 |
aldimond n00b

Joined: 05 Mar 2006 Posts: 12
|
Posted: Tue Mar 07, 2006 3:38 pm Post subject: |
|
|
Have done. Left memtest up for 8 hours with no errors (I should also mention that if I'm not doing compiles or running xscreensaver the box is as stable as a table... I can leave it up indefinitely without having to restart, as I did last night, and had for weeks before starting to compile things). I'd probably start by pulling all the PCI cards and disabling the built-in stuff on the motherboard (including sound and my backup NIC). It seems to me that a likely point of failure is the CPU, but I don't have a backup I can throw in there at the moment; I'd also be suspicious of northbridge overheat/malfunction, since I just installed a new northbridge fan, but I'm not sure just how to test that. Well, if I find out anything cool (or not-so-cool) about Linux in the process I'll post it. |
|
Back to top |
|
 |
bunder Bodhisattva

Joined: 10 Apr 2004 Posts: 5956
|
Posted: Tue Mar 07, 2006 10:18 pm Post subject: |
|
|
aldimond wrote: | Have done. Left memtest up for 8 hours with no errors (I should also mention that if I'm not doing compiles or running xscreensaver the box is as stable as a table... I can leave it up indefinitely without having to restart, as I did last night, and had for weeks before starting to compile things). I'd probably start by pulling all the PCI cards and disabling the built-in stuff on the motherboard (including sound and my backup NIC). It seems to me that a likely point of failure is the CPU, but I don't have a backup I can throw in there at the moment; I'd also be suspicious of northbridge overheat/malfunction, since I just installed a new northbridge fan, but I'm not sure just how to test that. Well, if I find out anything cool (or not-so-cool) about Linux in the process I'll post it. |
If you really think its CPU, why not try something like prime95 or cpuburn? If they crash your box, its definitely the cpu. As for the northbridge, does your motherboard have a MB sensor? If you see it lock up, hit the BIOS asap and check.
Also, ever tried the magic sysrq keys?
Quote: | If you have this enabled, it can be useful in the case where the system has escaped your control and nothing else is working. The following sequence may be better than just hitting the power button:
Alt+SysRq+s - sync the disk
Alt+SysRq+e - try to nicely kill processes
(wait a little bit here)
Alt+SysRq+i - no more mister nice guy
Alt+SysRq+u - unmount disks
(wait a bit here, too)
Alt+SysRq+b - reboot
|
I had to do this for a while when I was diagnosing system problems (but that was nvidia's drivers ) It could be possible that the system is still running, but X or another app is stealing so much CPU power that it just sits there doing nothing. _________________
Neddyseagoon wrote: | The problem with leaving is that you can only do it once and it reduces your influence. |
banned from #gentoo since sept 2017 |
|
Back to top |
|
 |
aldimond n00b

Joined: 05 Mar 2006 Posts: 12
|
Posted: Tue Mar 07, 2006 10:28 pm Post subject: |
|
|
Thanks for the suggestions for CPU testing. I'll try them as soon as I'm home and don't need my box for anything useful. I haven't been able to find much documentation about just which magic sysrq keys do what, so thanks for the synopsis (it's probably somewhere in /usr/src/linux/Documentation, come to think of it...); just a question, would they still work on a USB keyboard (I don't have a PS/2 or AT keyboard, or ports for them, or serial ports with which I could send break signals as I've read can be done).
Unfortunately my motherboard does not have a temperature sensor for the northbridge, only one for the CPU and one that measures "system temperature". It probably would be a good idea to actually read up on what kind of activity stresses the northbridge rather than just making random wild guesses as I've been doing. |
|
Back to top |
|
 |
bunder Bodhisattva

Joined: 10 Apr 2004 Posts: 5956
|
Posted: Tue Mar 07, 2006 10:59 pm Post subject: |
|
|
aldimond wrote: | Thanks for the suggestions for CPU testing. I'll try them as soon as I'm home and don't need my box for anything useful. I haven't been able to find much documentation about just which magic sysrq keys do what, so thanks for the synopsis (it's probably somewhere in /usr/src/linux/Documentation, come to think of it...); just a question, would they still work on a USB keyboard (I don't have a PS/2 or AT keyboard, or ports for them, or serial ports with which I could send break signals as I've read can be done).
Unfortunately my motherboard does not have a temperature sensor for the northbridge, only one for the CPU and one that measures "system temperature". It probably would be a good idea to actually read up on what kind of activity stresses the northbridge rather than just making random wild guesses as I've been doing. |
the magic sysrq key should work with any keyboard attached to the PC.
the system temperature might actually be under the northbridge. check your manual for the location of the thermistor. don't hold me to that though, i could be wrong. _________________
Neddyseagoon wrote: | The problem with leaving is that you can only do it once and it reduces your influence. |
banned from #gentoo since sept 2017 |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|