Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Removed PYTHON_TARGETS, and computer freezes
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2  
Reply to topic    Gentoo Forums Forum Index Portage & Programming
View previous topic :: View next topic  
Author Message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 209

PostPosted: Mon Apr 14, 2025 5:44 pm    Post subject: Reply with quote

I had forgotten about this, but the cpu in this computer is not a regular cpu. It is a heat-abused cpu.

About 8 years ago I heard a loud bang from what I thought was the computer case, like a bb bouncing off of metal. A few minutes later the computer shut down.

My first mistake was to power it up again. It booted, but shut down again after a while.
OK, so it was the computer case. I opened it, expecting to find maybe a blown up capacitor.
Instead, everything looked ok, except there was a small piece of plastic at the bottom.
Was it there before? I dunno, so I booted again and it shut down again.
At that point a much more careful inspection showed that the plastic guard around the cpu socket that has two small hooks
for the cpu cooler had broken off one of the hooks, which went flying.

The cooler plate had separated on one side, leading to a small tilt between the cpu plate and the cooler plate.
A small tilt, but a total thermal decoupling. It was hard to spot this with all the stuff around there.

So I ordered a new plastic guard piece, and after that no problems. Until now.

How may times did the computer overheat and shut down? Two to four times, but even one time is a big mistake.
Maybe this is where the bill comes due - in unit longevity.

I'm sticking with the heat theory and the intermittent theory. I think the evidence above backs this up.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 10015
Location: almost Mile High in the USA

PostPosted: Mon Apr 14, 2025 8:09 pm    Post subject: Reply with quote

So it was a hardware problem after all.

I would have thought that after AMD's Athlon and XP chips the IHS on the Athlon64 would have also included a thermal sensor to prevent damage, but perhaps not.

Sounds like that machine is ready for the recycling bin... however I have run my i7 up to 90°C+ for several hours on end and it's still okay, alas it does have thermal throttling capability. Unsure if the A64 also had throttling, or it just shuts down/crashes on a overheat situation.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 20609

PostPosted: Tue Apr 15, 2025 3:30 am    Post subject: Reply with quote

eccerr0r wrote:
I would have thought that after AMD's Athlon and XP chips the IHS on the Athlon64 would have also included a thermal sensor to prevent damage, but perhaps not.
His CPU seems to be included in that. The Athlon X2 6000+ was a Brisbane. Unless it's a separate issue, but I noticed at least one instance of "faulty temperature sensors."
_________________
Quis separabit? Quo animo?
Back to top
View user's profile Send private message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 209

PostPosted: Tue Apr 15, 2025 8:53 pm    Post subject: Reply with quote

One thing that might be an important point is that even one thermal shutdown of a cpu may be a bad thing, because it pushes the cpu beyond just "running hot".
I should have shut down the computer right after a funny sound like that and started asking questions.

It's kind of like a car engine. If the engine is revved up past the red line, the engine may or may not throw a rod, but even if it doesn't throw a rod the parts have been stressed beyond their design points.

By the way, is there a simple way to tell if the northbridge on part of the main cpu package, and if not, how to identify the northbridge chip?
Back to top
View user's profile Send private message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 209

PostPosted: Tue Apr 15, 2025 10:36 pm    Post subject: Reply with quote

I got the idea to try a little experiment. I realized there was a simple setup
lying around that compiled two trivial c files and linked them into an executable.
I was playing around with make.

So I wrote a tiny script:

Code:
function cycle()
{
 rm *.o
 make
}

count=0; while true; do
 printf "count: %s\n" "${count}"
 cycle
 count=$(( ${count} + 1 ))
done

The script just grinds on gcc.

To run the script:
Code:
grindme.sh | tee grindme.log
sed -rn '/^count:/p' grindme.log | tail -n1


This gets the number of cycles before a crash. For freezes, eyeballing the screen gets the count.

So I made several runs and here are the counts:
#1: 142 (crashed)
#2: 153 (crashed)
#3: 95 (crashed)
#4: 94 (froze)
#5 (after reboot): 167 (crashed)
#6: 185 (froze)
This gets old pretty fast.

After a reboot, the log file shows this:

<SNIP>
count: 141
gcc -c -ggdb -O0 -v -o myprog.o myprog.c
gcc -c -ggdb -O0 -v -o hello.o hello.c
gcc -o myprog myprog.o hello.o
gcc -static -o myprog.x myprog.o hello.o
count: 142
gcc -c -ggdb -O0 -v -o myprog.o myprog.c
<NULs><SNIP>

tee only made it to 142 due to buffering (there is not tee option to run unbuffered).
There are actually several hundred nul characters in there. I snipped 'em.

The cases where the script crashes always said the same thing: ld failed with code 1.
This is the same mystery mentioned above: non-random with respect to specific failure,
although this one is not very specific.

What ever is going on here, this specific failure mystery is a critical clue. Any ideas here?

And yet when I do ordinary noodling on this computer (editing a file, using the web browser, etc.)
you almost wouldn't know there is a problem.

And yet I did freeze the computer doing ordinary things: I was looking at a bunch of bloated
web pages about computer cases. Which generates more heat.
Firefox does regularly not crash on this computer.

None of this proves that this is heat related, but the chips are slowly lining up in that direction.
Maybe this motherboard is exquisitely balanced right at the edge of total failure.
Back to top
View user's profile Send private message
Banana
Moderator
Moderator


Joined: 21 May 2004
Posts: 2009
Location: Germany

PostPosted: Wed Apr 16, 2025 6:48 am    Post subject: Reply with quote

Can you add temperature readings to your tests? (If you somewhere mentioned it is not possible, forgive me and ignore it)
_________________
Forum Guidelines

PFL - Portage file list - find which package a file or command belongs to.
My delta-labs.org snippets do expire
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 10015
Location: almost Mile High in the USA

PostPosted: Wed Apr 16, 2025 7:16 am    Post subject: Reply with quote

Athlon64's and onwards I thought had the memory controller on chip, so northbridge is on the cpu.

Intel Nehalem/Westmere was the first to have the memory controller on the CPU die.

If you're at the edge, you should try underclocking if your firmware supports user hacking.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Chiitoo
Administrator
Administrator


Joined: 28 Feb 2010
Posts: 2783
Location: Here and Away Again

PostPosted: Wed Apr 16, 2025 7:54 am    Post subject: Reply with quote

I'd probably still test with just one memory stick, if there are more than one, and in different slots.

As also mentioned, memtest is not flawless either.

I had bad RAM once, but in certain circumstances I would get a clean result.

This was when Ryzen was new though, and memtest86+ did not work for me at all so I used the other memtest86 from passmark.

I sent them an e-mail about it and after some exchanges I received debug builds from them and they got the issue sorted eventually.

That is if I remember things right-like. I can't find the e-mails right now for some reason...

With elden hardware this kind of bugs are of course a lot more unlikely, but regardless, I think it's a good thing to test.
_________________
Kindest of regardses.
Back to top
View user's profile Send private message
Josef.95
Advocate
Advocate


Joined: 03 Sep 2007
Posts: 4755
Location: Germany

PostPosted: Wed Apr 16, 2025 9:44 am    Post subject: Reply with quote

With a Mainboard from 2008, I think the Voltage regulator is probably dead.
This was the issue with my good old Abit Mainboard (on the 3,3 Volt line) :-/
Try check the Voltage with sys-apps/lm-sensors (on idle, and heavy load).
Back to top
View user's profile Send private message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 209

PostPosted: Wed Apr 16, 2025 9:54 pm    Post subject: Reply with quote

Lots of good insight here. I'll try this part first, since I just got this part fixed.

Quote:
Can you add temperature readings to your tests?


Here is the situation. I've known for a long time that when the computer started showing all
those messages on bootup, it was complaining about some sensor suite. I think it said
that module it87 would not load due to "resource busy", among a few other module errors.

I tried a long time ago to find these messages in dmesg or the logs, and couldn't. So yesterday I started
thinking about this again and looked much harder for the error messages in /var/log. Nothing.

Why would these critical messages be missing? My guess is that the errors go to stderr and
don't make it into the logs. If that is the case, that needs to be fixed to log the errors.
I am going to have to ctrl-S the screen and take a picture to get at those messages.

I let this slide for so long because:
: The computer always worked, and the above situation requires the picture. PITA.
: I was always under the impression that the control system for the fan was within the
motherboard, and that there was no alteration of this from software.
In other words, I thought all the sensor software just did passive monitoring.

Then yesterday I found out about pwmconfig, sensors, and fancontrol. pwmconfig won't do
anything without it87 loaded. sensors shows the cpu temperature, but nothing else.

The fix for it87 is to put this in place:

Code:
    cat /etc/modprobe.d/it87.conf
# Local IT87 sensor options
options it87 ignore_resource_conflict=1


Without that module option it won't load.
So pwmconfig now worked:

Code:
cat /etc/fancontrol
# Configuration file generated by pwmconfig, changes will be lost
INTERVAL=10
DEVPATH=
DEVNAME=
FCTEMPS=
FCFANS=
MINTEMP=
MAXTEMP=
MINSTART=
MINSTOP=


And sensors shows this sort of thing:

Code:
sensors
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp:   +61.0°C 
Core0 Temp:   +59.0°C 
Core1 Temp:   +57.0°C 
Core1 Temp:   +61.0°C 

it8712-isa-0290
Adapter: ISA adapter
in0:           1.38 V  (min =  +0.26 V, max =  +1.02 V)  ALARM
in1:           0.00 V  (min =  +0.00 V, max =  +1.63 V)  ALARM
in2:           3.31 V  (min =  +0.13 V, max =  +0.00 V)  ALARM
+5V:          64.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in4:           3.09 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:           1.55 V  (min =  +0.02 V, max =  +0.00 V)  ALARM
in6:           2.05 V  (min =  +0.16 V, max =  +2.05 V)  ALARM
5VSB:          2.05 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
Vbat:          3.33 V 
fan1:        4299 RPM  (min = 1753 RPM)
fan3:        2606 RPM  (min =   20 RPM)
temp1:        +54.0°C  (low  = +10.0°C, high =  +1.0°C)  ALARM  sensor = thermistor
temp2:        +40.0°C  (low  =  +1.0°C, high =  +0.0°C)  ALARM  sensor = thermistor
temp3:       -128.0°C  (low  =  +0.0°C, high =  +4.0°C)
pwm1:              0%  (freq = 375000 Hz)
pwm2:              0%  (freq = 375000 Hz)
pwm3:              0%  (freq = 375000 Hz)
cpu0_vid:    +1.550 V
intrusion0:  ALARM


I hope there's nothing ALARMing about those ALARM's.
Anyway, I now have the ability to dump these results after each cycle of the grinder script,
so I'll paste that soon.

By the way to computer is still working fine after idling all night.
Back to top
View user's profile Send private message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 209

PostPosted: Thu Apr 17, 2025 3:42 am    Post subject: Reply with quote

Here is a run that froze at 151 grind cycles:

https://pastebin.com/82dTT2fs

I changed the script to append a file on each cycle, but e.g. the freeze whacked a file with 164 results down to 116.
Then I changed the script to put each run cycle log into a separate file and sync it. That catches all the results.
With the sync, many of the individual result files wound up with zero bytes. So the sync is critical. Worth remembering that.

Note from this result and the above results that when the grinder script is run on an idling computer, it repeatably takes about
150 cycles to freeze or crash. But in the cases where it crashes and I could restart the script quickly, it takes about 90 cycles,
almost as if something has not had time to fully reequilibrate its temperature. But I could be fooling myself about this.

I'm afraid all these freezes will mess up the file system at some point. So no more runs for now.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 55218
Location: 56N 3W

PostPosted: Thu Apr 17, 2025 2:59 pm    Post subject: Reply with quote

JustAnother,

Code:
it8712-isa-0290

The voltage outputs are not really useful.

/etc/sensors3.conf says
Code:
chip "it87-*" "it8712-*" "it8716-*" "it8718-*" "it8720-*"

    label in8 "Vbat"

It needs to be configured for your motherboard.
All of the input voltages to the sensor chip must be scaled (on the motherboard) to fit within the range 0v to 3.3v, or the chip will be destroyed.
That's two resistors in a divider for each input.
Having done that, the readings can be scaled (by the sensors program) to reverse the scaling applied by the resistive dividers, so that the outputs reflect the actual voltage values
Now the tricky bit ... which input is which?
That varies from motherboard to motherboard.
Its also possible to configure the alarm levels.

Code:
+5V:          64.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
That 64mV looks like an unused input rather than the 5v
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 209

PostPosted: Mon Apr 21, 2025 10:10 pm    Post subject: Reply with quote

I got it, kind of.

Things were going downhill, so as a last ditch effort I decided to replace the thermal paste. That's when I realized that the same part (fan bracket) that broke (see above) in 2019 had failed again, only this time the failure was more subtle.

There is a small plastic hook which in 2019 fractured and went flying. This time it fractured except at the edge, so the edge acted as a hinge and the hook rotated up, which released most of the spring stress holding the fan plate to the cpu plate, but not all of it. So The fan was contacting the cpu just well enough to allow the cpu to dump heat if it was idling. Anything beyond an idle would overheat the cpu.

This was a tricky one because it is possible to rotate the cpu fan assembly slightly back and forth in a nominal setup - after all there are two flat plates with some grease between them, and a non-rigid mechanical assembly. But with the "hinge" still at work, the twist test seemed ok at first. It "felt" almost right.

I made a tiny sheet steel clasp and screwed the broken piece back in place. Barely enough space to put screw heads in there. That will buy time to get the new part.

So portage ran for 12 hours and caught up.

But there was another mysterious reboot when I started to put the metal plate back on the case. So this may not be out of the woods yet.

So it look like this motherboard was indeed sitting right at the precipice of an outright failure.

It also looks like this was a cpu overheat situation, but there is another factor here - due to the uneven forces, there may have been a thermal gradient across the cpu, just to make things a little more interesting.

I read about the voltage regular module (VRM), and it is heat producer and is also a prime candidate for this of of thing.

In hindsight the lack of randomness in the type of software failure may well have been the tipoff that this was not a dimm problem. If a dimm has a fault at a specific point, wouldn't the type of instruction failure still be randomized?
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 10015
Location: almost Mile High in the USA

PostPosted: Mon Apr 21, 2025 11:14 pm    Post subject: Reply with quote

Not sure why you're writing off RAM errors. Again, because there is so much of it everywhere you have to blame it until proven otherwise. The only way to tell is doing targeted testing which you finally did. There's no way to tell otherwise.

RAM errors are also not random despite it being part of the name. Also well designed CPUs should not produce random results if overheating... There were some people who did tests on multiple CPUs with heatsinks knocked off, some just slowed down a lot (best outcome), some outright hung (second best), some produced random errors (arguably the worst outcome) and some let out the magic smoke (at least it didn't corrupt your data.)
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Portage & Programming All times are GMT
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum