Removed PYTHON_TARGETS, and computer freezes

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

I had forgotten about this, but the cpu in this computer is not a regular cpu. It is a heat-abused cpu.

About 8 years ago I heard a loud bang from what I thought was the computer case, like a bb bouncing off of metal. A few minutes later the computer shut down.

My first mistake was to power it up again. It booted, but shut down again after a while.
OK, so it was the computer case. I opened it, expecting to find maybe a blown up capacitor.
Instead, everything looked ok, except there was a small piece of plastic at the bottom.
Was it there before? I dunno, so I booted again and it shut down again.
At that point a much more careful inspection showed that the plastic guard around the cpu socket that has two small hooks
for the cpu cooler had broken off one of the hooks, which went flying.

The cooler plate had separated on one side, leading to a small tilt between the cpu plate and the cooler plate.
A small tilt, but a total thermal decoupling. It was hard to spot this with all the stuff around there.

So I ordered a new plastic guard piece, and after that no problems. Until now.

How may times did the computer overheat and shut down? Two to four times, but even one time is a big mistake.
Maybe this is where the bill comes due - in unit longevity.

I'm sticking with the heat theory and the intermittent theory. I think the evidence above backs this up.

eccerr0r · Posted: Mon Apr 14, 2025 8:09 pm Post subject:

So it was a hardware problem after all.

I would have thought that after AMD's Athlon and XP chips the IHS on the Athlon64 would have also included a thermal sensor to prevent damage, but perhaps not.

Sounds like that machine is ready for the recycling bin... however I have run my i7 up to 90°C+ for several hours on end and it's still okay, alas it does have thermal throttling capability. Unsure if the A64 also had throttling, or it just shuts down/crashes on a overheat situation.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

pjp · Administrator Joined: 16 Apr 2002 Posts: 20609

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

One thing that might be an important point is that even one thermal shutdown of a cpu may be a bad thing, because it pushes the cpu beyond just "running hot".
I should have shut down the computer right after a funny sound like that and started asking questions.

It's kind of like a car engine. If the engine is revved up past the red line, the engine may or may not throw a rod, but even if it doesn't throw a rod the parts have been stressed beyond their design points.

By the way, is there a simple way to tell if the northbridge on part of the main cpu package, and if not, how to identify the northbridge chip?

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

I got the idea to try a little experiment. I realized there was a simple setup
lying around that compiled two trivial c files and linked them into an executable.
I was playing around with make.

So I wrote a tiny script:

Banana · Posted: Wed Apr 16, 2025 6:48 am Post subject:

Can you add temperature readings to your tests? (If you somewhere mentioned it is not possible, forgive me and ignore it)
_________________
Forum Guidelines

PFL - Portage file list - find which package a file or command belongs to.
My delta-labs.org snippets do expire

eccerr0r · Posted: Wed Apr 16, 2025 7:16 am Post subject:

Athlon64's and onwards I thought had the memory controller on chip, so northbridge is on the cpu.

Intel Nehalem/Westmere was the first to have the memory controller on the CPU die.

If you're at the edge, you should try underclocking if your firmware supports user hacking.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

Chiitoo · Posted: Wed Apr 16, 2025 7:54 am Post subject:

I'd probably still test with just one memory stick, if there are more than one, and in different slots.

As also mentioned, memtest is not flawless either.

I had bad RAM once, but in certain circumstances I would get a clean result.

This was when Ryzen was new though, and memtest86+ did not work for me at all so I used the other memtest86 from passmark.

I sent them an e-mail about it and after some exchanges I received debug builds from them and they got the issue sorted eventually.

That is if I remember things right-like. I can't find the e-mails right now for some reason...

With elden hardware this kind of bugs are of course a lot more unlikely, but regardless, I think it's a good thing to test.
_________________
Kindest of regardses.

Josef.95 · Advocate Joined: 03 Sep 2007 Posts: 4755 Location: Germany

With a Mainboard from 2008, I think the Voltage regulator is probably dead.
This was the issue with my good old Abit Mainboard (on the 3,3 Volt line) :-/
Try check the Voltage with sys-apps/lm-sensors (on idle, and heavy load).

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

Lots of good insight here. I'll try this part first, since I just got this part fixed.

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

Here is a run that froze at 151 grind cycles:

https://pastebin.com/82dTT2fs

I changed the script to append a file on each cycle, but e.g. the freeze whacked a file with 164 results down to 116.
Then I changed the script to put each run cycle log into a separate file and sync it. That catches all the results.
With the sync, many of the individual result files wound up with zero bytes. So the sync is critical. Worth remembering that.

Note from this result and the above results that when the grinder script is run on an idling computer, it repeatably takes about
150 cycles to freeze or crash. But in the cases where it crashes and I could restart the script quickly, it takes about 90 cycles,
almost as if something has not had time to fully reequilibrate its temperature. But I could be fooling myself about this.

I'm afraid all these freezes will mess up the file system at some point. So no more runs for now.

NeddySeagoon · Posted: Thu Apr 17, 2025 2:59 pm Post subject:

JustAnother,

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

I got it, kind of.

Things were going downhill, so as a last ditch effort I decided to replace the thermal paste. That's when I realized that the same part (fan bracket) that broke (see above) in 2019 had failed again, only this time the failure was more subtle.

There is a small plastic hook which in 2019 fractured and went flying. This time it fractured except at the edge, so the edge acted as a hinge and the hook rotated up, which released most of the spring stress holding the fan plate to the cpu plate, but not all of it. So The fan was contacting the cpu just well enough to allow the cpu to dump heat if it was idling. Anything beyond an idle would overheat the cpu.

This was a tricky one because it is possible to rotate the cpu fan assembly slightly back and forth in a nominal setup - after all there are two flat plates with some grease between them, and a non-rigid mechanical assembly. But with the "hinge" still at work, the twist test seemed ok at first. It "felt" almost right.

I made a tiny sheet steel clasp and screwed the broken piece back in place. Barely enough space to put screw heads in there. That will buy time to get the new part.

So portage ran for 12 hours and caught up.

But there was another mysterious reboot when I started to put the metal plate back on the case. So this may not be out of the woods yet.

So it look like this motherboard was indeed sitting right at the precipice of an outright failure.

It also looks like this was a cpu overheat situation, but there is another factor here - due to the uneven forces, there may have been a thermal gradient across the cpu, just to make things a little more interesting.

I read about the voltage regular module (VRM), and it is heat producer and is also a prime candidate for this of of thing.

In hindsight the lack of randomness in the type of software failure may well have been the tipoff that this was not a dimm problem. If a dimm has a fault at a specific point, wouldn't the type of instruction failure still be randomized?

eccerr0r · Posted: Mon Apr 21, 2025 11:14 pm Post subject:

Not sure why you're writing off RAM errors. Again, because there is so much of it everywhere you have to blame it until proven otherwise. The only way to tell is doing targeted testing which you finally did. There's no way to tell otherwise.

RAM errors are also not random despite it being part of the name. Also well designed CPUs should not produce random results if overheating... There were some people who did tests on multiple CPUs with heatsinks knocked off, some just slowed down a lot (best outcome), some outright hung (second best), some produced random errors (arguably the worst outcome) and some let out the magic smoke (at least it didn't corrupt your data.)
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?