Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Advice troubleshooting seemingly random crashes?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
NickDaFish
Tux's lil' helper
Tux's lil' helper


Joined: 12 Sep 2002
Posts: 112
Location: Boston, USA

PostPosted: Mon Jul 28, 2003 6:48 pm    Post subject: Advice troubleshooting seemingly random crashes? Reply with quote

At work I recently found an out of use dual PIII500 and decided to play with gentoo on a dual CPU box.

So far so good apart from some random crashes. I say random but I'm guessing there has to be some reason. My problem is that I can't work out what the heck it is. There is little load on the machine (Headless test server) and I never see an error message. It likes to crash during compiles but never in the same place twice. The only thing I've managed to find that is the following from boot:
Code:
Jul 28 10:10:52 [kernel] PCI: Unable to handle 64-bit address for device 00:0e.0
Jul 28 10:10:52 [kernel] PCI: Cannot allocate resource region 0 of device 00:0e.0
Jul 28 10:10:52 [kernel] PCI: Failed to allocate resource 1(0-ffffffff) for 00:0e.0
Jul 28 10:10:52 [kernel] Limiting direct PCI/PCI transfers.


Does anyone know what that's about? Anyone got any good advice or links about trouble shooting this type of crash under gentoo? Any ideas (no matter how daft) welcome. I'm quite clueless on how to proceed.
Back to top
View user's profile Send private message
BonezTheGoon
Bodhisattva
Bodhisattva


Joined: 14 Jun 2002
Posts: 1375
Location: Albuquerque, NM -- birthplace of Microsoft and Gentoo

PostPosted: Mon Jul 28, 2003 9:25 pm    Post subject: Reply with quote

You got basically two likely suspects here and a VERY unlikely third.

First and formost heat. Heat causes many a crash when compiling on lots of machines. Let's face it, most average computer users don't even put a 30% CPU load on thier machines -- and the manufacturers know it!! So lots of systems have sub-par cooling out of the box. Check your cooling setup, try to prove that it could not possibly be a heat related issue. (Besides it's usually more likely with dual CPU systems -- BTW are these Coppermine CPU's or the older Kalamath (sp?))

Second is memory timings and or bad ram. Run memtest86!!!!!! If you fail any tests at all after four COMPLETE run's you have some sort of RAM issue. First try less aggressive settings in the BIOS, next try new/different RAM sticks (if you have multiple RAM sticks you can try to just remove one at a time to determine which are good and which are bad.)

Third and damned rare, your CPU could be flakey/funky/borked. Just ignore/discard this idea -- it's so rare there are probably a hundred other reasonable things to check before even entertaining this idea!

There is a good article about this kind of stuff written by Daniel Robbins (of Gentoo fame) posted on the IBM site you might care to read.

Regards,
BonezTheGoon
Back to top
View user's profile Send private message
NickDaFish
Tux's lil' helper
Tux's lil' helper


Joined: 12 Sep 2002
Posts: 112
Location: Boston, USA

PostPosted: Tue Aug 05, 2003 9:39 pm    Post subject: Reply with quote

First off thanks for the post. It was most helpfull. :)
It's taken me a while to be able to do all the test... unfortunatly I'm still stuck :(

You suggested 3 things....

1) Heat....
The machine is in a server room with AC. Just to be on the safe side I installed lm-sensors. Apart from the one sensor reading 208c that I'm ignoring (The box would be on fire) everything looks fine. I've been doing builds and the temp doesn't rise that much. I've been getting crashes at ~38c. I would imagine I would have to be a bit warmer to be causing real heat problems. Today I managed to crash the thing 25 secs into it's boot. Also as you asked the chips are Kalamaths.
Here is the lm_sensors output from a min before the most recent crash....
Code:

w83781d-isa-0290
Adapter: ISA adapter
Algorithm: ISA algorithm
VCore 1:   +2.01 V  (min =  +1.80 V, max =  +2.20 V)
VCore 2:   +2.00 V  (min =  +1.80 V, max =  +2.20 V)
+3.3V:     +3.31 V  (min =  +2.97 V, max =  +3.63 V)
+5V:       +4.97 V  (min =  +4.50 V, max =  +5.48 V)
+12V:     +12.16 V  (min = +10.79 V, max = +13.11 V)
-12V:     -12.06 V  (min = -13.18 V, max = -10.78 V)
-5V:       -5.18 V  (min =  -5.48 V, max =  -4.50 V)
fan1:     4218 RPM  (min = 3000 RPM, div = 2)
fan2:     4115 RPM  (min = 3000 RPM, div = 2)
fan3:        0 RPM  (min = 3000 RPM, div = 2)              ALARM
temp1:       +36°C  (limit =  +60°C)
temp2:     +40.0°C  (limit =  +60°C, hysteresis =  +50°C)
temp3:    +208.0°C  (limit =  +60°C, hysteresis =  +50°C)
vid:      +2.000 V
alarms:
beep_enable:
          Sound alarm disabled


2) RAM....
I ram memtest86 for a few days, I'm not sure howmany runs that comes down to but the box has 1.5GB or ram. I think it did more than four. No problems found.

3) Dodgy Chips.....
Again I find this a little hard to accept. The machine had been a happy windows box untill I got my hands on it. If one of the chips was dodgy I would have thought that the windows install would have fallen over occationaly for no reason. It never did.

The only other lead I currently have is that ntpd will not stay up. It starts and then just disapears a few mins later.
I also enabled console logging but there is nothing intresting in the logs (On the screen) when the machine locks up.

Anymore ideas anyone?
Back to top
View user's profile Send private message
agent_jdh
Veteran
Veteran


Joined: 08 Aug 2002
Posts: 1779
Location: Scotland

PostPosted: Tue Aug 05, 2003 9:49 pm    Post subject: Reply with quote

Just from the kernel boot error, it looks as if a device isn't being recognised (ie you've not compiled in a driver for it or there is none). If it's a 64-bit PCI device it _could_ be an ethernet adapter. Have a poke around in /proc/bus and see if you can find anything unusual. You might be able to disable some stuff in the box's bios.

What make of box is it? Most server oem's, especially in the days of 500MHz P3's, use custom (or at least customised) motherboards, so there might be something there that "just doesn't like Linux". Again, try disabling stuff in the bios that you don't really need to get up and running and see if you get any joy.

You could try - and I'm expecting flames for this :wink: - sticking the latest redhat on it for a bit just to see if some flavour of linux will work, and taking notes on the devices etc it finds and what sort of kernel config is in use. Then at least you know you're not banging your head against a wall of duff hardware or a box that just doesn't like Linux.
Back to top
View user's profile Send private message
Janne Pikkarainen
Veteran
Veteran


Joined: 29 Jul 2003
Posts: 1143
Location: Helsinki, Finland

PostPosted: Wed Aug 06, 2003 11:30 am    Post subject: Reply with quote

You get random hangs - I make random guesses. Fair play, huh? ;-)

- If the network adapter is Intel EtherExpress (Pro) 10/100 and it's currently using eepro100 driver, try using e100 driver. Ok, usually eepro100 driver only causes problems under heavy network load, but you never know...

- Try running cpuburn.

- Have you disabled all the power saving options?

- How about CPU VCore setting in BIOS? My box at home had a tendency to crash every now and then, but then I raised VCore a bit higher and everything's been fine since then. That's with AMD Thunderbird and VIA based mobo.

- Maybe not related, but recently my poor old Amiga 1200T became very crashy. Then I noticed the battery in the battery-backed up clock had died, causing the system time jump back and forth. Nursing the battery cured my belowed Miggy.
Back to top
View user's profile Send private message
NickDaFish
Tux's lil' helper
Tux's lil' helper


Joined: 12 Sep 2002
Posts: 112
Location: Boston, USA

PostPosted: Wed Aug 06, 2003 9:45 pm    Post subject: Reply with quote

Quote:
If it's a 64-bit PCI device it _could_ be an ethernet adapter.

I don't think so, there is only one and it's being detected. There is an onboard SCSI card that I've disabled in BIOS. I'm guessing that's what it is......

Quote:
Have a poke around in /proc/bus and see if you can find anything unusual. You might be able to disable some stuff in the box's bios.

Um..... I understand about 5% of what's in there. Any good links to go with this?

Quote:
What make of box is it?

It's a homebrew made of fairly std parts. It's based on a SuperMicro P6DBU (Rev1.1) with BIOS v3.1.

Quote:
You get random hangs - I make random guesses. Fair play, huh?

Cool! Let um fly.....

Quote:
Try running cpuburn.

I found this little app yesterday, it's really quite cool (Pun! Ha!). After a few hours of burning CPU0 leveled off at 49c and CPU1 at 61c. The box seemed happy as a clam. No crashing.

Quote:
Have you disabled all the power saving options?

Oh yes.

Quote:
How about CPU VCore setting in BIOS?

Not available I'm afraid.

Quote:
Maybe not related, but recently my poor old Amiga 1200T became very crashy

Yey for Amigas! I had one of the first A1200s to be shiped to the UK. If it hadn't been stolen I doubt I ever woudl have started *really* using PCs. Anyhow...
I think the clock is fine. I'm not 100% sure how you would tell. I've been running hwclock --show and seeing what it says. Incidently I figured out my problems with ntp, I just hadn't set the clock to *near* the right time before I started ntp.

Quote:
You could try - and I'm expecting flames for this - sticking the latest redhat on it for a bit just to see if some flavour of linux will work

This is infact my last resort (Tomorrow's project). I'm fairly sure that the problem is somewhere in my software at this point, rather than the hardware. I've done everything I can think of to cause a hardware failure. So it's time to look at the software. Problem being.... how do I get RedHat to crash the box? The biggest *biatch* of this whole thing is that I can't find a repeatable crash. I can never get this sucker to go down twice in the same way. All I can do is setoff a big emerge and wait, 9/10 that does it.

Thanks for all the help so far..... and if anyone else has any daft ideas please do add them to the thread. If nothing else this is an intresting learning experiance :wink:
Back to top
View user's profile Send private message
agent_jdh
Veteran
Veteran


Joined: 08 Aug 2002
Posts: 1779
Location: Scotland

PostPosted: Wed Aug 06, 2003 11:59 pm    Post subject: Reply with quote

Ahhh .... A1200's ... I too got one of the first ones shipped to the UK, but sold it to get a PC

/wipes tear from eye/

Just as well I satisfied my geeky self by getting one (and an HDD) off eBay about a month ago - it's ACE fun.

Sorry for being OT...
Back to top
View user's profile Send private message
agent_jdh
Veteran
Veteran


Joined: 08 Aug 2002
Posts: 1779
Location: Scotland

PostPosted: Thu Aug 07, 2003 12:02 am    Post subject: Reply with quote

Just to get back on topic, remove everything from the box and plug it in again. Maybe there's a dicky connector or wire somewhere, and try different IDE cables as well. Have you tested the drives btw? And tried memtest for a few hours?
Back to top
View user's profile Send private message
NickDaFish
Tux's lil' helper
Tux's lil' helper


Joined: 12 Sep 2002
Posts: 112
Location: Boston, USA

PostPosted: Fri Aug 08, 2003 9:06 pm    Post subject: Reply with quote

I'm now fairly convinced it's something in the kernel. I did a rebuild yesterday on the same hardware using an old 1.4_rc2 liveCD. It was up and compiling for at least 12 hours. Not one crash. As soon as I boot up the system that I'd built and ran a few compiles it died. So the question shifts slightly... from how I find an unknown problem in the hardware to how I find an unknown problem in the software.....

I'm not by any strech a kernel hacker, but perhaps there is some form of debugging that I can turn on.... Something that will at least leave me with a meaningfull message that I can read off the console or something?

I could just ripoff the liveCD's kernel but that wouldn't be any fun! I wanna figure this out....... :twisted:
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum