View previous topic :: View next topic |
Author |
Message |
JustAnother Apprentice

Joined: 23 Sep 2016 Posts: 209
|
Posted: Mon Apr 14, 2025 5:44 pm Post subject: |
|
|
I had forgotten about this, but the cpu in this computer is not a regular cpu. It is a heat-abused cpu.
About 8 years ago I heard a loud bang from what I thought was the computer case, like a bb bouncing off of metal. A few minutes later the computer shut down.
My first mistake was to power it up again. It booted, but shut down again after a while.
OK, so it was the computer case. I opened it, expecting to find maybe a blown up capacitor.
Instead, everything looked ok, except there was a small piece of plastic at the bottom.
Was it there before? I dunno, so I booted again and it shut down again.
At that point a much more careful inspection showed that the plastic guard around the cpu socket that has two small hooks
for the cpu cooler had broken off one of the hooks, which went flying.
The cooler plate had separated on one side, leading to a small tilt between the cpu plate and the cooler plate.
A small tilt, but a total thermal decoupling. It was hard to spot this with all the stuff around there.
So I ordered a new plastic guard piece, and after that no problems. Until now.
How may times did the computer overheat and shut down? Two to four times, but even one time is a big mistake.
Maybe this is where the bill comes due - in unit longevity.
I'm sticking with the heat theory and the intermittent theory. I think the evidence above backs this up. |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 10015 Location: almost Mile High in the USA
|
Posted: Mon Apr 14, 2025 8:09 pm Post subject: |
|
|
So it was a hardware problem after all.
I would have thought that after AMD's Athlon and XP chips the IHS on the Athlon64 would have also included a thermal sensor to prevent damage, but perhaps not.
Sounds like that machine is ready for the recycling bin... however I have run my i7 up to 90°C+ for several hours on end and it's still okay, alas it does have thermal throttling capability. Unsure if the A64 also had throttling, or it just shuts down/crashes on a overheat situation. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
pjp Administrator


Joined: 16 Apr 2002 Posts: 20609
|
Posted: Tue Apr 15, 2025 3:30 am Post subject: |
|
|
eccerr0r wrote: | I would have thought that after AMD's Athlon and XP chips the IHS on the Athlon64 would have also included a thermal sensor to prevent damage, but perhaps not. | His CPU seems to be included in that. The Athlon X2 6000+ was a Brisbane. Unless it's a separate issue, but I noticed at least one instance of "faulty temperature sensors." _________________ Quis separabit? Quo animo? |
|
Back to top |
|
 |
JustAnother Apprentice

Joined: 23 Sep 2016 Posts: 209
|
Posted: Tue Apr 15, 2025 8:53 pm Post subject: |
|
|
One thing that might be an important point is that even one thermal shutdown of a cpu may be a bad thing, because it pushes the cpu beyond just "running hot".
I should have shut down the computer right after a funny sound like that and started asking questions.
It's kind of like a car engine. If the engine is revved up past the red line, the engine may or may not throw a rod, but even if it doesn't throw a rod the parts have been stressed beyond their design points.
By the way, is there a simple way to tell if the northbridge on part of the main cpu package, and if not, how to identify the northbridge chip? |
|
Back to top |
|
 |
JustAnother Apprentice

Joined: 23 Sep 2016 Posts: 209
|
Posted: Tue Apr 15, 2025 10:36 pm Post subject: |
|
|
I got the idea to try a little experiment. I realized there was a simple setup
lying around that compiled two trivial c files and linked them into an executable.
I was playing around with make.
So I wrote a tiny script:
Code: | function cycle()
{
rm *.o
make
}
count=0; while true; do
printf "count: %s\n" "${count}"
cycle
count=$(( ${count} + 1 ))
done
|
The script just grinds on gcc.
To run the script:
Code: | grindme.sh | tee grindme.log
sed -rn '/^count:/p' grindme.log | tail -n1 |
This gets the number of cycles before a crash. For freezes, eyeballing the screen gets the count.
So I made several runs and here are the counts:
#1: 142 (crashed)
#2: 153 (crashed)
#3: 95 (crashed)
#4: 94 (froze)
#5 (after reboot): 167 (crashed)
#6: 185 (froze)
This gets old pretty fast.
After a reboot, the log file shows this:
<SNIP>
count: 141
gcc -c -ggdb -O0 -v -o myprog.o myprog.c
gcc -c -ggdb -O0 -v -o hello.o hello.c
gcc -o myprog myprog.o hello.o
gcc -static -o myprog.x myprog.o hello.o
count: 142
gcc -c -ggdb -O0 -v -o myprog.o myprog.c
<NULs><SNIP>
tee only made it to 142 due to buffering (there is not tee option to run unbuffered).
There are actually several hundred nul characters in there. I snipped 'em.
The cases where the script crashes always said the same thing: ld failed with code 1.
This is the same mystery mentioned above: non-random with respect to specific failure,
although this one is not very specific.
What ever is going on here, this specific failure mystery is a critical clue. Any ideas here?
And yet when I do ordinary noodling on this computer (editing a file, using the web browser, etc.)
you almost wouldn't know there is a problem.
And yet I did freeze the computer doing ordinary things: I was looking at a bunch of bloated
web pages about computer cases. Which generates more heat.
Firefox does regularly not crash on this computer.
None of this proves that this is heat related, but the chips are slowly lining up in that direction.
Maybe this motherboard is exquisitely balanced right at the edge of total failure. |
|
Back to top |
|
 |
Banana Moderator


Joined: 21 May 2004 Posts: 2009 Location: Germany
|
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 10015 Location: almost Mile High in the USA
|
Posted: Wed Apr 16, 2025 7:16 am Post subject: |
|
|
Athlon64's and onwards I thought had the memory controller on chip, so northbridge is on the cpu.
Intel Nehalem/Westmere was the first to have the memory controller on the CPU die.
If you're at the edge, you should try underclocking if your firmware supports user hacking. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
Chiitoo Administrator


Joined: 28 Feb 2010 Posts: 2783 Location: Here and Away Again
|
Posted: Wed Apr 16, 2025 7:54 am Post subject: |
|
|
I'd probably still test with just one memory stick, if there are more than one, and in different slots.
As also mentioned, memtest is not flawless either.
I had bad RAM once, but in certain circumstances I would get a clean result.
This was when Ryzen was new though, and memtest86+ did not work for me at all so I used the other memtest86 from passmark.
I sent them an e-mail about it and after some exchanges I received debug builds from them and they got the issue sorted eventually.
That is if I remember things right-like. I can't find the e-mails right now for some reason...
With elden hardware this kind of bugs are of course a lot more unlikely, but regardless, I think it's a good thing to test. _________________ Kindest of regardses. |
|
Back to top |
|
 |
Josef.95 Advocate

Joined: 03 Sep 2007 Posts: 4755 Location: Germany
|
Posted: Wed Apr 16, 2025 9:44 am Post subject: |
|
|
With a Mainboard from 2008, I think the Voltage regulator is probably dead.
This was the issue with my good old Abit Mainboard (on the 3,3 Volt line) :-/
Try check the Voltage with sys-apps/lm-sensors (on idle, and heavy load). |
|
Back to top |
|
 |
JustAnother Apprentice

Joined: 23 Sep 2016 Posts: 209
|
Posted: Wed Apr 16, 2025 9:54 pm Post subject: |
|
|
Lots of good insight here. I'll try this part first, since I just got this part fixed.
Quote: | Can you add temperature readings to your tests? |
Here is the situation. I've known for a long time that when the computer started showing all
those messages on bootup, it was complaining about some sensor suite. I think it said
that module it87 would not load due to "resource busy", among a few other module errors.
I tried a long time ago to find these messages in dmesg or the logs, and couldn't. So yesterday I started
thinking about this again and looked much harder for the error messages in /var/log. Nothing.
Why would these critical messages be missing? My guess is that the errors go to stderr and
don't make it into the logs. If that is the case, that needs to be fixed to log the errors.
I am going to have to ctrl-S the screen and take a picture to get at those messages.
I let this slide for so long because:
: The computer always worked, and the above situation requires the picture. PITA.
: I was always under the impression that the control system for the fan was within the
motherboard, and that there was no alteration of this from software.
In other words, I thought all the sensor software just did passive monitoring.
Then yesterday I found out about pwmconfig, sensors, and fancontrol. pwmconfig won't do
anything without it87 loaded. sensors shows the cpu temperature, but nothing else.
The fix for it87 is to put this in place:
Code: | cat /etc/modprobe.d/it87.conf
# Local IT87 sensor options
options it87 ignore_resource_conflict=1 |
Without that module option it won't load.
So pwmconfig now worked:
Code: | cat /etc/fancontrol
# Configuration file generated by pwmconfig, changes will be lost
INTERVAL=10
DEVPATH=
DEVNAME=
FCTEMPS=
FCFANS=
MINTEMP=
MAXTEMP=
MINSTART=
MINSTOP= |
And sensors shows this sort of thing:
Code: | sensors
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +61.0°C
Core0 Temp: +59.0°C
Core1 Temp: +57.0°C
Core1 Temp: +61.0°C
it8712-isa-0290
Adapter: ISA adapter
in0: 1.38 V (min = +0.26 V, max = +1.02 V) ALARM
in1: 0.00 V (min = +0.00 V, max = +1.63 V) ALARM
in2: 3.31 V (min = +0.13 V, max = +0.00 V) ALARM
+5V: 64.00 mV (min = +0.00 V, max = +0.00 V) ALARM
in4: 3.09 V (min = +0.00 V, max = +0.00 V) ALARM
in5: 1.55 V (min = +0.02 V, max = +0.00 V) ALARM
in6: 2.05 V (min = +0.16 V, max = +2.05 V) ALARM
5VSB: 2.05 V (min = +0.00 V, max = +0.00 V) ALARM
Vbat: 3.33 V
fan1: 4299 RPM (min = 1753 RPM)
fan3: 2606 RPM (min = 20 RPM)
temp1: +54.0°C (low = +10.0°C, high = +1.0°C) ALARM sensor = thermistor
temp2: +40.0°C (low = +1.0°C, high = +0.0°C) ALARM sensor = thermistor
temp3: -128.0°C (low = +0.0°C, high = +4.0°C)
pwm1: 0% (freq = 375000 Hz)
pwm2: 0% (freq = 375000 Hz)
pwm3: 0% (freq = 375000 Hz)
cpu0_vid: +1.550 V
intrusion0: ALARM |
I hope there's nothing ALARMing about those ALARM's.
Anyway, I now have the ability to dump these results after each cycle of the grinder script,
so I'll paste that soon.
By the way to computer is still working fine after idling all night. |
|
Back to top |
|
 |
JustAnother Apprentice

Joined: 23 Sep 2016 Posts: 209
|
Posted: Thu Apr 17, 2025 3:42 am Post subject: |
|
|
Here is a run that froze at 151 grind cycles:
https://pastebin.com/82dTT2fs
I changed the script to append a file on each cycle, but e.g. the freeze whacked a file with 164 results down to 116.
Then I changed the script to put each run cycle log into a separate file and sync it. That catches all the results.
With the sync, many of the individual result files wound up with zero bytes. So the sync is critical. Worth remembering that.
Note from this result and the above results that when the grinder script is run on an idling computer, it repeatably takes about
150 cycles to freeze or crash. But in the cases where it crashes and I could restart the script quickly, it takes about 90 cycles,
almost as if something has not had time to fully reequilibrate its temperature. But I could be fooling myself about this.
I'm afraid all these freezes will mess up the file system at some point. So no more runs for now. |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 55218 Location: 56N 3W
|
Posted: Thu Apr 17, 2025 2:59 pm Post subject: |
|
|
JustAnother,
The voltage outputs are not really useful.
/etc/sensors3.conf says
Code: | chip "it87-*" "it8712-*" "it8716-*" "it8718-*" "it8720-*"
label in8 "Vbat" |
It needs to be configured for your motherboard.
All of the input voltages to the sensor chip must be scaled (on the motherboard) to fit within the range 0v to 3.3v, or the chip will be destroyed.
That's two resistors in a divider for each input.
Having done that, the readings can be scaled (by the sensors program) to reverse the scaling applied by the resistive dividers, so that the outputs reflect the actual voltage values
Now the tricky bit ... which input is which?
That varies from motherboard to motherboard.
Its also possible to configure the alarm levels.
Code: | +5V: 64.00 mV (min = +0.00 V, max = +0.00 V) ALARM | That 64mV looks like an unused input rather than the 5v _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
JustAnother Apprentice

Joined: 23 Sep 2016 Posts: 209
|
Posted: Mon Apr 21, 2025 10:10 pm Post subject: |
|
|
I got it, kind of.
Things were going downhill, so as a last ditch effort I decided to replace the thermal paste. That's when I realized that the same part (fan bracket) that broke (see above) in 2019 had failed again, only this time the failure was more subtle.
There is a small plastic hook which in 2019 fractured and went flying. This time it fractured except at the edge, so the edge acted as a hinge and the hook rotated up, which released most of the spring stress holding the fan plate to the cpu plate, but not all of it. So The fan was contacting the cpu just well enough to allow the cpu to dump heat if it was idling. Anything beyond an idle would overheat the cpu.
This was a tricky one because it is possible to rotate the cpu fan assembly slightly back and forth in a nominal setup - after all there are two flat plates with some grease between them, and a non-rigid mechanical assembly. But with the "hinge" still at work, the twist test seemed ok at first. It "felt" almost right.
I made a tiny sheet steel clasp and screwed the broken piece back in place. Barely enough space to put screw heads in there. That will buy time to get the new part.
So portage ran for 12 hours and caught up.
But there was another mysterious reboot when I started to put the metal plate back on the case. So this may not be out of the woods yet.
So it look like this motherboard was indeed sitting right at the precipice of an outright failure.
It also looks like this was a cpu overheat situation, but there is another factor here - due to the uneven forces, there may have been a thermal gradient across the cpu, just to make things a little more interesting.
I read about the voltage regular module (VRM), and it is heat producer and is also a prime candidate for this of of thing.
In hindsight the lack of randomness in the type of software failure may well have been the tipoff that this was not a dimm problem. If a dimm has a fault at a specific point, wouldn't the type of instruction failure still be randomized? |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 10015 Location: almost Mile High in the USA
|
Posted: Mon Apr 21, 2025 11:14 pm Post subject: |
|
|
Not sure why you're writing off RAM errors. Again, because there is so much of it everywhere you have to blame it until proven otherwise. The only way to tell is doing targeted testing which you finally did. There's no way to tell otherwise.
RAM errors are also not random despite it being part of the name. Also well designed CPUs should not produce random results if overheating... There were some people who did tests on multiple CPUs with heatsinks knocked off, some just slowed down a lot (best outcome), some outright hung (second best), some produced random errors (arguably the worst outcome) and some let out the magic smoke (at least it didn't corrupt your data.) _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|