View previous topic :: View next topic |
Author |
Message |
OldTango l33t


Joined: 21 Feb 2004 Posts: 737
|
Posted: Mon Jun 23, 2025 9:20 pm Post subject: Random emerge failures on updates [SOLVED] |
|
|
What I get during major package emerges (system updates) are segfaults, general protection faults and waiting for unfinished jobs. Sometimes I even get an out of memory error which seems very unlikely in my case.
The issues always happen when a large package like LLVM is in the mix of package updates or other large packages are in the mix ie... gcc, clang etc. Which is why I have added --keep-going to my emerge commands so I can get through most of the packages then I check the logs for failed packages and the errors involved. I can recover form the failed emerges only buy doing a system reboot and restarting the emerge process. Rebooting solves the problem until it surfaces again. I have not been able to pin down the problem as yet but it seems like the RAM is not being cleared properly.
I am not sure what is causing this ongoing issue but I am looking into the possible causes.
Maybe due to heat build up however my system never exceeds 75C. My cooler is an older (about 4 years) Corsair 360 AIO and it barley keeps my system from throttling.
I do not have a swap-drive. I use a tempfs and its possible the settings are insufficient. Code: | tmpfs /var/tmp/portage tmpfs size=16G,uid=portage,gid=portage,mode=775,noatime,nosuid,nodev 0 0 |
What I do know is I can run emerge without issues on small updates less than say 50 packages but when I have 100 plus packages with large packages included, I start having problems. I manually configure and build my own kernels using stable gentoo-sources.
My Gentoo System
MSI MEG X570 ACE
AMD Ryzen 9 5950X
G.SKILL TridentZ Royal (F4-3200C14Q-128GTRS) DDR4-3200 32GBx4
Corsair HX1000 80 PLUS PLATINUM Certified (ATX12V v2.4 and EPS 2.92)
Kernel Linux 6.12.21-gentoo x86_64
In this case a system reboot allows me to continue the emerge process and complete the system updates. The problem only seems to happen on huge compiles that tax the system for long periods of time. The fact that a reboot resolves it, (a workaround), seems to all but eliminate hardware issues.
Just for reference My Gentoo Server (Which was a working Windows 10 system I converted to Gentoo). This system has never experienced these issues what soever, even though both machines run similar setups.
MSI X470 Gaming Pro Carbon
AMD Ryzen 7 2700X
Corsair VENGEANCE LPX 32GB (2 x 16GB) DDR4
Seasonic Prime TX-850W
Kernel Linux 6.12.21-gentoo x86_64
TIA Tango 
Last edited by OldTango on Mon Jul 07, 2025 6:25 pm; edited 1 time in total |
|
Back to top |
|
 |
Josef.95 Advocate

Joined: 03 Sep 2007 Posts: 4839 Location: Germany
|
Posted: Mon Jun 23, 2025 10:01 pm Post subject: |
|
|
Hm, for a first check, i would try it with only two memory modules.
If it works, then check it with the other two memory modules too. |
|
Back to top |
|
 |
OldTango l33t


Joined: 21 Feb 2004 Posts: 737
|
Posted: Mon Jun 23, 2025 11:06 pm Post subject: |
|
|
Josef.95 wrote: | Hm, for a first check, i would try it with only two memory modules.
If it works, then check it with the other two memory modules too. | This has been going on for a long time now. I have tried that in the past but there appears to be no change in the random failures. I am not sure they are random but a result of the time and use the system gets.
I almost never reboot the system it runs 24/7. It seems the longer I go between reboots the more I have the problem.
TIA Tango  |
|
Back to top |
|
 |
Hu Administrator

Joined: 06 Mar 2007 Posts: 23671
|
Posted: Tue Jun 24, 2025 12:24 am Post subject: |
|
|
This sounds like hardware failure to me. In my opinion, you should use a cooler that can do better than "barely keeps [ ... ] from throttling."
Rebooting probably rearranges the used memory not to use the failing module, though it could also be that the reboot in some way brings the system temperature down enough to stabilize the CPU.
An improper tmpfs configuration cannot cause this.
What diagnostics have you done to rule out hardware failure? |
|
Back to top |
|
 |
pietinger Moderator

Joined: 17 Oct 2006 Posts: 5932 Location: Bavaria
|
Posted: Tue Jun 24, 2025 12:25 am Post subject: |
|
|
OldTango,
do you have set in your make.conf EMERGE_DEFAULT_OPTS="--jobs X" ?
(see more here: https://wiki.gentoo.org/wiki/User:Pietinger/Tutorials/Optimize_compile_times#Using_EMERGE_DEFAULT_OPTS )
(maybe show us your settings in make.conf?)
If yes, it could be that 16 GB for /var/tmp/portage is not sufficient. Yes, there is no package which really needs 16 GB:
https://wiki.gentoo.org/wiki/Portage_TMPDIR_on_tmpfs#Considering_tmpfs_size
... but if you install/emerge more packages at the same time you can reach this limit of 16 GB.
If you have a 128 GB machine then you can safely set a higher value; dont worry, the kernel will allocate this memory only if needed. See my "df":
Code: | tmpfs 24G 0 24G 0% /var/tmp/portage |
Yes, if I don't do an emerge, this directory is of course empty AND these 24 GB are not used at all; i.e. you are actually only specifying the maximum in the fstab - but without it being branched off from the total main memory of the kernel. As long as these 24 GB are not called up, they are available to all applications.
P.S.: Of course it can be a hardware problem also, as @Hu and @Josef.95 assume. _________________ https://wiki.gentoo.org/wiki/User:Pietinger |
|
Back to top |
|
 |
Hu Administrator

Joined: 06 Mar 2007 Posts: 23671
|
Posted: Tue Jun 24, 2025 1:50 am Post subject: |
|
|
At one time, with certain CFLAGS, clang needed ~30GiB of space in the build directory. |
|
Back to top |
|
 |
OldTango l33t


Joined: 21 Feb 2004 Posts: 737
|
Posted: Tue Jun 24, 2025 3:14 am Post subject: |
|
|
Hu wrote: | This sounds like hardware failure to me. In my opinion, you should use a cooler that can do better than "barely keeps [ ... ] from throttling." | True but short of a custom water cooling system (not in my budget) I am using the best cooler available to me at the time I built the system. A newer version is available now that has a much larger cold plate and improved pump. Testing it on my Windows based gaming machine (AMD Ryzen 9 3950X) who's cooler had reached it's effective end. Test results are looking great, but at 200 a pop, it will be a couple of months before I can acquire another one.
Hu wrote: | What diagnostics have you done to rule out hardware failure? | Aside from what Josef.95 has suggested, I have removed the side cover and used a very large fan to help in the cooling process, which did not really help. It kept the temps a little cooler but did not stop the emerge fails. Its hard to test this because after a shutdown to reconfigure hardware or a system reboot, I can not reproduce the errors. I have taken a newly configured gentoo-sources kernel and ran a make -j32 on it expecting that to fail, nope it built the kernel in 46 seconds, but I don't fully understand that process.
pietinger wrote: | OldTango,
do you have set in your make.conf EMERGE_DEFAULT_OPTS="--jobs X" ? | No I do not. Sorry I didn't post my make.conf earlier.
Code: | CFLAGS="-march=native -O2 -pipe"
CXXFLAGS="${CFLAGS}"
CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt rdrand sha sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3"
USE="aac aalib alsa cairo caja cdda cddb cups dbus elogind encode exif ffmpeg flac gif gnome-keyring gnutls gstreamer gtk guile ipv6 java jpeg jpeg2k lame libnotify mad mp3 multilib ogg openal opengl oss png policykit pulseaudio python sdl spell sqlite startup-notification svg tiff truetype udev usb v4l vorbis vpx X xattr xinerama x264 xv xvid"
#GENTOO_MIRRORS="ftp://gentoo.mirrors.tds.net/gentoo http://distfiles.gentoo.org http://www.ibiblio.org/pub/Linx/distributions/gentoo http://mirror.usu.edu/mirrors/gentoo ftp://ftp.gtlib.gatech.edu/pub/gentoo"
ALSA_CARDS=""
VIDEO_CARDS="nvidia"
INPUT_DEVICES="libinput"
PORTAGE_ELOG_CLASSES="warn error log"
PORTAGE_ELOG_SYSTEM="save"
LINGUAS="en en_US"
ACCEPT_LICENSE="-* @FREE"
DISTDIR="/var/cache/distfiles"
MAKEOPTS="-j32" |
I have been waiting a while to do system updates but today was the right day. An Code: | emerge -avuND --exclude sys-kernel/* --with-bdeps=y @world | showed 95 packages needed updates and one was llvm-core/llvm. 55 packages built before llvm failed to compile before I ran Code: | emerge --resume --keep-going | After which I received the message Code: | * The following 12 packages have failed to build, install, or execute
* postinst:
*
* (llvm-core/llvm-20.1.7:20/20.1::gentoo, ebuild scheduled for merge), Log file:
* '/var/tmp/portage/llvm-core/llvm-20.1.7/temp/build.log'
* (llvm-core/llvm-toolchain-symlinks-20-r1:20/20::gentoo, ebuild scheduled for merge)
* (llvm-core/llvmgold-20:0/0::gentoo, ebuild scheduled for merge)
* (media-libs/mesa-25.0.7:0/0::gentoo, ebuild scheduled for merge)
* (llvm-core/clang-20.1.7:20/20.1::gentoo, ebuild scheduled for merge)
* (llvm-runtimes/compiler-rt-sanitizers-20.1.7:20/20::gentoo, ebuild scheduled for merge)
* (llvm-core/clang-runtime-20.1.7:20/20::gentoo, ebuild scheduled for merge)
* (llvm-runtimes/compiler-rt-20.1.7:20/20::gentoo, ebuild scheduled for merge)
* (llvm-core/clang-toolchain-symlinks-20:20/20::gentoo, ebuild scheduled for merge)
* (app-crypt/gcr-3.41.2-r1:0/1::gentoo, ebuild scheduled for merge), Log file:
* '/var/tmp/portage/app-crypt/gcr-3.41.2-r1/temp/build.log'
* (games-fps/worldofpadman-1.6-r3:0/0::gentoo, ebuild scheduled for merge), Log file:
* '/var/tmp/portage/games-fps/worldofpadman-1.6-r3/temp/build.log'
* (www-client/seamonkey-2.53.20:0/0::gentoo, ebuild scheduled for merge), Log file:
* '/var/tmp/portage/www-client/seamonkey-2.53.20/temp/build.log' | of which 10 were related to llvm failing. I didn't look at the build errors because I didn't expect any new insights into the problem.
I did a system reboot ran the emerge command again and all but "games-fps/worldofpadman-1.6-r3" built without fail.
I was monitoring the system and llvm-core/llvm-20.1.7 was maxing all 32 cores at 100% and consuming 16 to 18 gigs of ram.
Also llvm-core/clang-20.1.7 was maxing all cores but it was consuming almost 32 gigs of ram.
I have had this problem with llvm long past then for a short while everything ran smoothly but now the ugly is back.
TIA Tango  |
|
Back to top |
|
 |
Hu Administrator

Joined: 06 Mar 2007 Posts: 23671
|
Posted: Thu Jun 26, 2025 3:20 pm Post subject: |
|
|
In my opinion, if you're using a cooler that was near top end when new, and it's properly installed and used in a good environment, then it ought to be doing better than yours seems to be doing. However, before modifying hardware, I would rule out faulty RAM. Run a memory test. If you find errors, they are not necessarily defective RAM sticks, but they are a sign of a serious problem. Properly operating systems should be able to run memtest indefinitely with no errors reported. |
|
Back to top |
|
 |
niderecha n00b

Joined: 10 Nov 2024 Posts: 66
|
Posted: Fri Jun 27, 2025 2:58 am Post subject: |
|
|
If we talk about possibly failing memory modules, did you stress test the memory with something like memtest or memtester? |
|
Back to top |
|
 |
OldTango l33t


Joined: 21 Feb 2004 Posts: 737
|
Posted: Mon Jun 30, 2025 8:10 pm Post subject: |
|
|
Hu wrote: | In my opinion, if you're using a cooler that was near top end when new, and it's properly installed and used in a good environment, then it ought to be doing better than yours seems to be doing. However, before modifying hardware, I would rule out faulty RAM. Run a memory test. If you find errors, they are not necessarily defective RAM sticks, but they are a sign of a serious problem. Properly operating systems should be able to run memtest indefinitely with no errors reported. | Sorry for late replies been busy with some other projects.
The cooler was near top end when it was new 4 years ago and has performed well in that time. At the moment the system temp sits at 33C with just a few apps running. When new portage updates could push the temps up to 65C to 68C now the temps can run 70C to 78C under major loads, still within the 95C max range of the CPU.
I just got the latest versions of memtest and Ubuntu live dvd. Ubuntu may have memtest on it as well. I thought I would run memtest first then do a complete fschk to rule out file system and drive issues.
The only hardware problem I have found so far is a USB-C port not working. Possible bad cable issue.
TIA, Tango  |
|
Back to top |
|
 |
OldTango l33t


Joined: 21 Feb 2004 Posts: 737
|
Posted: Mon Jul 07, 2025 6:24 pm Post subject: |
|
|
After 4 days of testing my RAM, I have 2 bad sticks. One with a single but repeatable error, and one that memtest86 can't even do a full pass on without aborting. So hoping the RMA gets approved but until then I’ll be running with limited RAM for now.
Thanks to all.
Tango  |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|