Removed PYTHON_TARGETS, and computer freezes

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

1I read the eselect section about PYTHON_TARGETS.
I decided (on 04-07) to go ahead after a fresh update and remove any PHTHON_*
from make.conf and update.
Emerge said it had to rebuild 290 packages, even packages that seemed to
have nothing to do with python.
I said go ahead.
Then the computer started hard freezing every few hours. No ssh, ping, no nothing.
No log messages. Frozen.
So I'm having to reboot and emerge --resume every few hours to try to
get through this update cycle.

I did the same process with another laptop computer (with nividia gpu),
and it still working on llvm after two days, but still running.
Q: why does llvm depend on PYTHON_TARGETS?

Before this situation happened, there were occasional hard freezes, not very
recently, and not nearly as often.

Is anybody else having issues like this or have any insights?

Banana · Posted: Thu Apr 10, 2025 5:49 am Post subject:

Zucca · Posted: Thu Apr 10, 2025 8:52 am Post subject:

Hm. I'm not sure if this can affect, but why do you have -ggdb enabled globally?
_________________
..: Zucca :..

Hu · Administrator Joined: 06 Mar 2007 Posts: 23449

If I recall correctly, -ggdb will greatly increase the memory/disk requirements, due to all the debug symbols. The requirement is notably worse for template-heavy C++ programs, relative to plain C programs.

It is not normal for a system to ever suffer a "hard freeze", so if this system has been intermittently failing like that even before this last round of updates, I would start with the idea that the system has an underlying fault and that the load of these updates is provoking that fault more frequently.

eccerr0r · Posted: Thu Apr 10, 2025 4:19 pm Post subject:

At 4GiB RAM, running -j2 for MAKEOPTS, and having X running at the same time, you're probably really stressing your swap and it's possible it makes it looks like the machine hangs. Keeping the gdb symbols around probably exasperates the issue (and I thought portage strips binaries by default (FEATURES=nostrip || FEATURES=splitdebug?) so -ggdb basically gets thrown away?)
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

I'll try to describe this in more detail, since I left out a few things, and
there are ongoing changes.

I'll deal with the -ggdb issue shortly and fix that. But first, this.

Up until Monday, I had very occasional freezes, starting a few months ago. I just cussed and rebooted.

Here is what I do every week:

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

Update: clang crashed. See https://pastebin.com/gaK2bK0Q

The last time it crashed at file 969. This time it crashed at file 1258.
The computer has been working on this for ~ 10 hours and there is no freeze.

I'm going to try

eccerr0r · Posted: Fri Apr 11, 2025 6:00 am Post subject:

-ggdb should not cause your computer to crash.

Looks like some of your pastebin expired already, but since you're getting random behavior, likely it is hardware related.

How is your cooling? how old is the PSU? I suspect bad ram should give you exceptions but it's also worth to be tested.

Actually you should go check ram with memtest86+ or something. Getting errors like

Hu · Administrator Joined: 06 Mar 2007 Posts: 23449

I concur with eccerr0r regarding register %b15, but that is interesting. %r15 is a valid register on amd64. Per man ascii, b is 0x62, and r is 0x72. Thus, they are only one bit away in representation. If an r was stored into a RAM cell that changed bit 4 from on to off, you would change that r into b, and get the reported error.

Likewise, t changing bit 4 to off would produce d, hence quiet becomes quied. This is further supported by how the compiler's own output shows the string was quiet, yet the error text complains about quied.

eccerr0r · Posted: Fri Apr 11, 2025 5:02 pm Post subject:

Funny... I've never had a computer have bad memory like this that it causes human notable errors but computer still runs...

Usually it's so subtle that I lose bits only when copying large files, or so bad the computer constantly segfaults. But this time it's bad enough to be visible while working with strings (and silent data corruption too) but doesn't constantly segfault.

So yeah, check your ram. Might need to replace or stop overclocking or underclock RAM to see if it helps. At 4GiB RAM I wouldn't recommend blocking off bad blocks but that was an option I had when dealing with bad RAM because the machine had more than necessary (blocking off 512MB RAM on 64GiB is no big deal, but 512MB on 4GB is a huge deal.)

Then I did use memtest86+ to notice a bad resistor and bad clock lines on a few DIMMs I had once (apparently I bought mishandled DIMMs), had to do DIMM surgery to fix them. The test patterns gave a really good sign at what to look for and I found them!
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

NeddySeagoon · Posted: Fri Apr 11, 2025 5:08 pm Post subject:

/me puts a few shillings 'on the nose' for a hardware related problem.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

The points about the bit flips are well taken. But there is still a cloud of confusion here.

But first an update. I let the computer idle last night. It was still up after ~15 hours.
So I got the idea to do an emaint sync and a regular update, excluding clang, inkscape, and wireshark,
which have caused crashes and freezes.

The computer rebooted this time. Thus far in this saga emerge has either failed or frozen the computer.

I ran a quick memtest which showed nothing, but a good memtest must run overnight, so I will do that.

Anyway I get the memo about this computer needing a date with the glue factory. I let this slide
because of the hassles of researching the hardware to prevent this kind of frustration, and my case is 3" taller
than the new ones. I need the same height.

As for the PSU, memory, etc. the motherboard is ~2008 and the PSU is ~2016, and the hard drive is ~2017, so
everything is, ahem, "mature", like a patient over 60.

I have dealt with bad power supplies, seen bad memory, bad motherboards, but this is weird.

But this situation raises several tough questions. To reiterate:

: Occasional freezes and reboots over the last few months.
: A switch of PYTHON_TARGETS greatly accelerates the freezes.
: A downgrade of mesa greatly attenuates the freezes, but not completely.
: Packages fail in a way that suggest bit flips in source code.
: But the package failures are not random. Some packages seem much more prone.
The same errors (quied, bogus ` characters, %b15, etc.) occur, but at seemingly random places in the sequence of files to build.

If this is a pure hardware problem, it is a weird one.
What scenario could explain this contradictory set of randomness and non-randomness?
I think there is still a chance there is some bug involved with this, but that chance appears slim.

By the way has anybody ever seen any evidence that the screensaver has any relationship to freezes?
Just asking.

Anyway, I'll get about a week to play with this while a new computer is in transit.
I'll update after a long memtest.

pjp · Administrator Joined: 16 Apr 2002 Posts: 20609

NeddySeagoon · Posted: Sat Apr 12, 2025 10:12 am Post subject:

With a motherboaid of that vintage, the capacators on the 12v CPU regulators need to be looked at.
Domed, tilted or leaking examples mean that they all need to be replaced with low ESR parts.

It's not too bad a job if you already have moderate skills with a soldering iron.

It all works when things are stable, or change slowly.
As soon as the CPU does a big speed change, that equals a big power step, the capacators can't cope, voltages go low and anything can happen.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

eccerr0r · Posted: Sat Apr 12, 2025 1:34 pm Post subject:

I've also had bad disk controllers and bad chipsets but nowadays those are integrated onto the motherboard and warranted replacement....
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

pjp · Administrator Joined: 16 Apr 2002 Posts: 20609

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

Update: I ran a full memtest last night and a few hours of memtest today. It passed.

More attempts at emerge --resume froze it.

Just to fully rule out the portage build process itself, I just happen to have the build process for openwrt on the
computer, which exercises things pretty hard. So I tried that, and the computer froze. Oddly, if I just do ordinary
stuff like web browsing and file editing, the computer is stable.

So this does indeed look like a hardware problem. As for capacitors, this motherboard circa 2009 post-dates the
capacitor debacle of ~ 2003, but they do go bad, and other parts go bad.

I have an old ~2004 hp laptop sitting around waiting for a funeral. It boots erratically. I think the gridballs on the nvidia gpu
are the problem - they got sued for this, but it could be the capacitors. I'll eyeball the caps before I toss the thing. I'm still fond of it.

Never let a good piece of junk go to waste.

The ultimate scapegoat for failing motherboards is of course the word electromigration. In the early 1960s the IC industry freaked
out over this. But it turned out that a small doping amount of copper in the aluminum strips would slow down the problem quite a bit.
Until recently, when the currents are much larger.

eccerr0r · Posted: Sun Apr 13, 2025 6:42 am Post subject:

perhaps tried a different disk controller? (USB vs PATA vs SATA?)

I have a computer that constantly corrupts stuff over ethernet but not wifi...

I've had very few boards fail to EM. Most were GPUs of all sorts. I had one CPU fail due to overclocking (and overvolting). And had one chipset on a m/b that kept corrupting stuff as it shuffled data through, but that board was acquired second hand so unsure of its history. But all in all, RAM was the most likely culprit of errors, but most were acquired bad versus failed over time.

In any case I'll need to check all my machines once in a while with a thorough system test... but knock on wood, no recent failures other than power supplies and cooling fans... And have yet to have a hard drive return corrupted stuff (other than the bad disk controller) - hard drives have so far only given me what I wrote or nothing at all.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

NeddySeagoon · Posted: Sun Apr 13, 2025 9:26 am Post subject:

JustAnother,

My money is on power supply (not always the metal box) transient response.
When the CPU goes from idle to flat out, there in a huge step change in the input current to the CPU. 100A or more, in a CPU clock cycle. The PSU has to cope with this step and keep the voltage stable within a few mV.
As parts age, particularly capacitors, the PSU transient response gets worse. You don't need the infamous 2003 capacitor problem.
It's not something you can test at home.

Try running prime95. That's a horrible CPU stress test. In turn that will stress the CPU PSU.

If prime95 crashes on start, you have a pointer. If not, we have found another thing that it isn't.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 209

Here's another attempt to fit these pieces together.

To reiterate again my personal take on this:
: Portage seems to be the problem, freezing and crashing the computer, but it really isn't.
: Ordinary usage of the computer is not a problem, but that isn't quite the case.
: Within portage,
: non-random with respect to packages
: With packages that do cause freeze/fail problems, random with respect to
file number, non-random with respect to specific failures.
: memtest catches nothing on the time scale of a day.

So what is going on here? Is there one pattern behind all of this?

Ask the following simple question: what does portage actually "do"?
The answer: portage grinds on gcc and copies a bunch of files all over the place.

Another question: what are the two types of packages on a build system?
The answer: one type grinds on gcc and copies a bunch of files all over the place,
and the other type just copies a bunch of files all over the place.

Take all these things mentioned above and put them on a table, and then ask:
can the "things" be separated into two sets that have one key difference?

I think the answer is yes, and the key difference is the amount of heat being
generated within the substrate of the cpu and northbridge, and to a lesser extent
the substrates within the dimm's.

In other words, a conditional intermittent connection within a chip substrate,
with the condition being the thermal stress on the chip substrate -- i.e. the temperature.

This explains two of the portage mysteries. Packages like clang and inkscape that
grind on gcc put more thermal stress on the silicon and once the temperature of the
substrate hits a certain level the chance of an intermittent rises dramatically.
Packages like firefox-bin don't grind on gcc, and the thermal stress is much lower,
so the packages have a much higher chance of building.

Once a grinder package induces a high substrate temperature and activates the
condition, a failure (mostly) is statistical and only a matter of time.

As for the other part about the specific failures: not obvious.

One thing I have noticed over time is that the sounds from the computer cpu fan
are a decent (kind of) indication of the thermal stress on the cpu.
In the morning, if I see the hard drive light on and the computer is quiet, I know
the computer is wrapped around the axle with the swap file.
And what is the computer doing? It is copying a bunch of files all over the place -- low thermal stress.
If the computer is making noise, I know it is grinding on gcc and making real progress -- high thermal stress.

Once I got memtest running, I noticed that the fan was making some noise (and uniformly over time),
but not a whole lot. Kind of like a stove on medium low - a low simmer, but not a boil.

If any of this funny business about intermittent connections sounds strange,
every TV repairman used to have to deal with stuff like this.
I went though this with the wiring harness on a 75 Volvo.

What I am saying is that if all the stuff mentioned in this topic is seen within the context
of thermal stress, it all fits together better.

Consider this question: if this scenario has any merit, who is the likely culprit --
the cpu, the northbridge, or the dimm's?
There are discussions about this. People warn not to directly compare specs between cpu's
and dimm's.
DDR2's (like mine) seem to not get much notice. DDR4 gets complaints about the need for heatsinks.
DDR5 is faster but is more power efficient, so there are fewer complaints.
This is being actively published, but the papers seem to mostly figure out what the
designers already know.
People point out that with dimm's the heat diffusion area is spread out over 8 chips.
With a cpu the heat is coming out of an area significantly smaller than the area of the cpu package.
In other words, if you have very little in the way of hard facts and have to start pointing
the finger, the cpu/northbridge may be the better bet.

Concerning memtest, another important dimension to whole process comes into play: current
fluctuations on various time scales, and their relationship to thermal fluctuations.
What does memtest do? It is one process that generates a lot of very fast current
fluctuations, but on a longer time scale generates zero fluctuations, because it is in a
tight loop doing just one thing, kind of like a workout that only exercises a
couple of muscles. Portage on the other hand is more like a full body workout, and it is
bringing down the machine in minutes.

Memtest should be changed to have another test that exercises the memory but adjusts the
overall thermal loading over time to approximate some power spectral density, with 1/f being is
good candidate.

And there is a lot more to consider, but I'll stop here for now.

eccerr0r · Posted: Sun Apr 13, 2025 11:53 pm Post subject:

That is a problem with most people in the world, they don't have an appreciation of what's actually in a computer and how it all needs to work together to get a seamless system. This goes for software too, as well as the hardware-software interface.

Portage is merely a python script that runs gcc among lots of other stuff. It's software just like any other software.

Computers nowadays tries to use less power when it's not doing anything. When it starts doing something, it demands power and if it's not there, it will likely do something unintended - that's why we have to blame power supplies but again everything works together as a system and you can't always instantly blame one or another.

I still run machines with DDR2 and DDR3. I recently got one machine with DDR4. I've used all sorts of ram from SRAM to the original DRAM, FPM, EDO, SDRAM, DDR. I have not used DDR5 or RAMBUS. However they all at a software level are the same: write data in, expect to read data back out. Main reason why RAM is blamed first: statistically, most of the transistors on your computer are used for RAM.

Same with pretty much any hardware no matter what software is thrown on it, it should be reliable based on the constraints of the hardware - one of which is clock rate and you have to muck with that by itself in firmware settings or board settings, it usually is protected from random control as it can cause crashes too if you go too far. And as said earlier there's a whole bunch of parts that need to work together to get you a sane running environment. If you break any part of it, the whole things come crashing down like your computer.

That DDR2 machine of mine is a Core2 Quad and I run it 24/7. The machine is probably 15+ years old now, yeah it is newer than an Athlon 64, but no matter - I verified the machine is stable and that's why it runs and still runs fine...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

logrusx · Advocate Joined: 22 Feb 2018 Posts: 2985

I had core2quad like eccerr0r. It could fit max 8GB on 4 slots. At one point it was hard to find 2GB pieces, so I got what I found. It had a defect and it only manifested itself during emerge because it was at the end of the range.

How many passes did memtest do? Mine needed a lot. I don't remember well but it might well have been over 8, which is the recommended minimum. It took maybe 11 hours and generated a lot of heat. Especially on the memory chips.

I doubt the problem is heat because you would have noticed it way before things get heated. The system generates heat at idle and any increase in temperature would be enough, once such issues start manifesting.

Best Regards,
Georgi

NeddySeagoon · Posted: Mon Apr 14, 2025 8:37 am Post subject:

JustAnother,

Run with one stick of memory at a time until you have tested all your RAM, one stick at a time.
Memtest is not flawless.
This means that you will 'wipe the contacts' on the RAM as a bonus. That's been known to fix problems too.

Run prime95 to stress your CPU, or even cpuburn. The latter is in the ::gentoo repo.
Keep an eye on your CPU temperatures. Both drive your CPU (and motherboard VRM) hard.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

eccerr0r · Posted: Mon Apr 14, 2025 9:38 am Post subject:

I had a Pentium II motherboard whose RAM ran fine but if I passed a lot of data through ATAPI it eventually would come up with errors while copying data.
I threw that board away. Kept CPU for no real reason. CPU, hdd, and RAM worked fine on other boards. Detection was md5summing copied data.

Another weird issue. I had a 8GB DDR3 DIMM I got for free and stuck it into a machine with another 8GB and four 4GB. A few of the memory locations readily and consistently showed up as bad on memtest86+ so I noted them. I then swapped out all the 4GB DIMMs and filled the rest of the slots with 8GB DIMMs for 64GiB. Retested the RAM and I still see the bad locations. Then a few months later I swapped/upgraded to a new CPU and retested the RAM... the bad RAM disappeared! Unsure if it was the CPU or firmware as I did have to reset the CMOS settings which I did not do prior to this last change...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

NeddySeagoon · Posted: Mon Apr 14, 2025 10:07 am Post subject:

eccerr0r,

DDR3 and later is a nest of vipers.

The CPU (memory controller) sets up the timings and signal drive strengths by trial and error.
Its called training. As things drift with temperature changes, the training may not be quite right any more.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.