View previous topic :: View next topic |
Author |
Message |
mcnutty Tux's lil' helper
Joined: 29 Dec 2009 Posts: 120
|
Posted: Thu Oct 21, 2021 3:14 pm Post subject: Random computer freeze - need help diagnosing |
|
|
I've got an old computer I'm using as a database server (without any graphical interface e.g. X). Unfortunately every so often (I'd say 3-7 days) it freezes up. I lose all network connections to it and trying to directly access it with a keyboard and mouse does not work either. Nothing shows up in /var/log/message as far as I can tell and I'm not sure where else to look. Any suggestions on how to troubleshoot this?
Here is my /var/log/messages from the last time it booted until my computer froze today around 7:00am on Oct 21. |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Thu Oct 21, 2021 4:54 pm Post subject: |
|
|
A few random thoughts:
- Update your BIOS/UEFI to the latest version, if you haven't done it yet
- Check your BIOS/UEFI settings. Make sure you use default values. No overclocking.
- Install memtest86-bin and let it run for at least 2 days. Any memory errors?
- What about your power supply? How old is it?
- Choose a stable longterm kernel (5.10.x)
- Make sure that all modules and options needed for your mainboard and processor are enabled in your kernel.
- Use only stable filesystems: ext4. Btrfs looks suspicious - it has a long history of strange errors - although most of them do NOT end in inaccessible machines.
BTW (not related to the issue you posted): Your random number generator is initialized too late:
Code: | [ 14.500022] random: crng init done |
Either enable (one of) your hardware random number generators in the kernel config or install and enable a package like sys-apps/haveged. See: http://www.issihosts.com/haveged/.
Last edited by mike155 on Sat Oct 23, 2021 11:46 am; edited 1 time in total |
|
Back to top |
|
|
mcnutty Tux's lil' helper
Joined: 29 Dec 2009 Posts: 120
|
Posted: Thu Oct 21, 2021 6:53 pm Post subject: |
|
|
Thanks for the tips.
1) I'll check that out. I don't think it's massively out of date, but it's been a while since I checked.
2) It should basically be at default, although maybe an XMP profile is set. I can try disabling that.
3) Good idea.
4) The power supply is relatively new, less that a year I think. I also used a tester when installing it to make sure it was roughly to spec, although that's no guarantee.
5) I'll try downgrading.
6) Everything should be enabled, but I can always double check.
7) Only using stable filesystems on / and /boot. Although I'm using zfs on a separate partition.
I'll also try to take care of the crng issue.
As a side side note... there are several network drive mount failures in the log:
Code: | Oct 21 07:53:58 green kernel: [ 11.563656] CIFS: VFS: cifs_mount failed w/return code = -101
Oct 21 07:53:58 green kernel: [ 11.566012] CIFS: Attempting to mount \\truenas\Backup |
I was having a similar problem as reported here. I followed the advice to create a dhcpcd-hook which does allow me to mount the drives after the network is available. However despite adding noauto to the mount options in the /etc/fstab I still get the errors above.
Example line in /etc/fstab
Code: | //truenas/Pictures /mnt/truenas/pictures cifs vers=3.0,noauto,iocharset=utf8,file_mode=0777,dir_mode=0777,rsize=1049000,wsize=1049000,nofail 0 0 |
|
|
Back to top |
|
|
figueroa Advocate
Joined: 14 Aug 2005 Posts: 2963 Location: Edge of marsh USA
|
Posted: Sat Oct 23, 2021 3:38 am Post subject: |
|
|
Random freezes without proximate cause are almost always hardware failure or heat. New doesn't mean it isn't failing. _________________ Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi |
|
Back to top |
|
|
szatox Advocate
Joined: 27 Aug 2013 Posts: 3137
|
Posted: Sat Oct 23, 2021 8:30 am Post subject: |
|
|
figueroa wrote: | Random freezes without proximate cause are almost always hardware failure or heat. New doesn't mean it isn't failing. |
I've had similar issues with some particular kernel versions (which made it to gentoo's stable tree). It was a few years ago and I'm still using the same hardware, and the issue was gone after an update. And then returned with another update. And then was gone forever after another update. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54237 Location: 56N 3W
|
Posted: Sat Oct 23, 2021 10:03 am Post subject: |
|
|
szatox,
Forever is a very long time ...
Quote: | Absence of evidence is not evidence of absence |
or, it's not possible to prove the absence of something.
e.g. We have not found SETI yet but there has mean a lot lot of looking.
That does not mean that we are alone in the universe. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
mcnutty Tux's lil' helper
Joined: 29 Dec 2009 Posts: 120
|
Posted: Sat Oct 23, 2021 2:45 pm Post subject: |
|
|
figueroa wrote: | Random freezes without proximate cause are almost always hardware failure or heat. New doesn't mean it isn't failing. |
I have almost no doubt it is some kind of hardware failure, but what evidence is there that it's the PSU, vs some other component, other than sometimes that's the problem? I'm trying to avoid buying new hardware on a hunch until I have (hopefully) solved my problem.
I also sort of suspect it might be the network card. Are there any diagnostics that can help me decide if it's one rather than the other without having to buy a new PSU and network card before I even know that's the problem? |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Sat Oct 23, 2021 3:43 pm Post subject: |
|
|
What is the result of the memtest86-bin test? It is reasonable to start with that test. Let it run for at least 24 hours. |
|
Back to top |
|
|
figueroa Advocate
Joined: 14 Aug 2005 Posts: 2963 Location: Edge of marsh USA
|
Posted: Sat Oct 23, 2021 3:56 pm Post subject: |
|
|
Preventive maintenance may go a log way to solving problems. Remove and reinstall all components and reseat all plugs. Blow out dust. Refresh thermal paste between CPU and heat sink.
Unplug and remove everything not absolutely needed. (Floppy drives, optical drives, USB peripherals)
Stress test the best you can. Transfer groups of very large files over the network. Compile a series of large programs.
If you have parts, start swapping in components. (Power supply, removable cards.)
Open the box and put a fan blowing on it. If failures stop, it's heat. _________________ Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21635
|
Posted: Sat Oct 23, 2021 4:24 pm Post subject: |
|
|
If this system is not running a GUI, why does it have installed and loaded the nVidia proprietary modules? There's no evidence they are the cause, but if you don't need a GUI, I suggest to remove the nVidia modules and, if possible, remove the graphics card that motivated installing the drivers. At nothing else, you will probably get at least a bit of power savings by not having a discrete graphics card in the system.
Along the line of removing optional components, if you suspect the network card, you could try removing it temporarily. If you don't have a spare, then this would inhibit the system's use as a database server, which may or may not be too heavy a burden on your other uses. Still, as a temporary measure, it might be worth trying for the sake of excluding culprits if you want to avoid speculative hardware purchases. |
|
Back to top |
|
|
mcnutty Tux's lil' helper
Joined: 29 Dec 2009 Posts: 120
|
Posted: Mon Oct 25, 2021 8:15 pm Post subject: |
|
|
Hu wrote: | If this system is not running a GUI, why does it have installed and loaded the nVidia proprietary modules? There's no evidence they are the cause, but if you don't need a GUI, I suggest to remove the nVidia modules and, if possible, remove the graphics card that motivated installing the drivers. At nothing else, you will probably get at least a bit of power savings by not having a discrete graphics card in the system. |
I was occasionally using the machine for machine learning. I'm using it less and less for that and haven't actually used it since dropping the GUI several weeks ago. I could potentially switch to the nouveau driver if I decide to give up using it for machine learning altogether, but it would be hard to get rid of the card altogether for emergency system maintenance.
Hu wrote: | Along the line of removing optional components, if you suspect the network card, you could try removing it temporarily. If you don't have a spare, then this would inhibit the system's use as a database server, which may or may not be too heavy a burden on your other uses. Still, as a temporary measure, it might be worth trying for the sake of excluding culprits if you want to avoid speculative hardware purchases. |
Since the last time I posted, I was able to try a different network card and the system still froze, so that's looking less likely than I thought. I also ran memtest86-bin, which passed without errors. I've also run mprime for several hours without issue. The Tctl topped out at about 84 (with Tdie @ 64), so I don't think it's an overheating issue. The only other hardware installed is an m.2 ssd. smartmontools reports that the drive is in good health and idling around 40°C. I'm sure it could theoretically be the cause, but it seem less likely.
Thanks for all the input from everyone. At least I seem to be narrowing it down. |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Mon Oct 25, 2021 8:34 pm Post subject: |
|
|
Does your machine freeze while you're running memtest86? If it does, it's a real hardware problem and you can rule out anything related to Linux or software. |
|
Back to top |
|
|
mcnutty Tux's lil' helper
Joined: 29 Dec 2009 Posts: 120
|
Posted: Mon Oct 25, 2021 8:37 pm Post subject: |
|
|
mike155 wrote: | Does your machine freeze while you're running memtest86? If it does, it's a real hardware problem and you can rule out anything related to Linux or software. |
Nope, no crashes while running memtest86 over night. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54237 Location: 56N 3W
|
Posted: Mon Oct 25, 2021 9:27 pm Post subject: |
|
|
mcnutty,
Try prime95 and watch the temperature.
Its a CPU and cooling stress test.
If it fails, its certainly hardware related but it need not be faulty parts.
memtest86 is a good RAM subsystem test, as long as you booted into it, so that's a good sign. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
|