Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Random computer freeze - need help diagnosing
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
mcnutty
Tux's lil' helper
Tux's lil' helper


Joined: 29 Dec 2009
Posts: 120

PostPosted: Thu Oct 21, 2021 3:14 pm    Post subject: Random computer freeze - need help diagnosing Reply with quote

I've got an old computer I'm using as a database server (without any graphical interface e.g. X). Unfortunately every so often (I'd say 3-7 days) it freezes up. I lose all network connections to it and trying to directly access it with a keyboard and mouse does not work either. Nothing shows up in /var/log/message as far as I can tell and I'm not sure where else to look. Any suggestions on how to troubleshoot this?

Here is my /var/log/messages from the last time it booted until my computer froze today around 7:00am on Oct 21.
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Thu Oct 21, 2021 4:54 pm    Post subject: Reply with quote

A few random thoughts:
  1. Update your BIOS/UEFI to the latest version, if you haven't done it yet
  2. Check your BIOS/UEFI settings. Make sure you use default values. No overclocking.
  3. Install memtest86-bin and let it run for at least 2 days. Any memory errors?
  4. What about your power supply? How old is it?
  5. Choose a stable longterm kernel (5.10.x)
  6. Make sure that all modules and options needed for your mainboard and processor are enabled in your kernel.
  7. Use only stable filesystems: ext4. Btrfs looks suspicious - it has a long history of strange errors - although most of them do NOT end in inaccessible machines.


BTW (not related to the issue you posted): Your random number generator is initialized too late:
Code:
[   14.500022] random: crng init done

Either enable (one of) your hardware random number generators in the kernel config or install and enable a package like sys-apps/haveged. See: http://www.issihosts.com/haveged/.


Last edited by mike155 on Sat Oct 23, 2021 11:46 am; edited 1 time in total
Back to top
View user's profile Send private message
mcnutty
Tux's lil' helper
Tux's lil' helper


Joined: 29 Dec 2009
Posts: 120

PostPosted: Thu Oct 21, 2021 6:53 pm    Post subject: Reply with quote

Thanks for the tips.

1) I'll check that out. I don't think it's massively out of date, but it's been a while since I checked.
2) It should basically be at default, although maybe an XMP profile is set. I can try disabling that.
3) Good idea.
4) The power supply is relatively new, less that a year I think. I also used a tester when installing it to make sure it was roughly to spec, although that's no guarantee.
5) I'll try downgrading.
6) Everything should be enabled, but I can always double check.
7) Only using stable filesystems on / and /boot. Although I'm using zfs on a separate partition.

I'll also try to take care of the crng issue.

As a side side note... there are several network drive mount failures in the log:
Code:
Oct 21 07:53:58 green kernel: [   11.563656] CIFS: VFS: cifs_mount failed w/return code = -101
Oct 21 07:53:58 green kernel: [   11.566012] CIFS: Attempting to mount \\truenas\Backup


I was having a similar problem as reported here. I followed the advice to create a dhcpcd-hook which does allow me to mount the drives after the network is available. However despite adding noauto to the mount options in the /etc/fstab I still get the errors above.

Example line in /etc/fstab
Code:
//truenas/Pictures  /mnt/truenas/pictures  cifs vers=3.0,noauto,iocharset=utf8,file_mode=0777,dir_mode=0777,rsize=1049000,wsize=1049000,nofail  0 0
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2963
Location: Edge of marsh USA

PostPosted: Sat Oct 23, 2021 3:38 am    Post subject: Reply with quote

Random freezes without proximate cause are almost always hardware failure or heat. New doesn't mean it isn't failing.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3137

PostPosted: Sat Oct 23, 2021 8:30 am    Post subject: Reply with quote

figueroa wrote:
Random freezes without proximate cause are almost always hardware failure or heat. New doesn't mean it isn't failing.

I've had similar issues with some particular kernel versions (which made it to gentoo's stable tree). It was a few years ago and I'm still using the same hardware, and the issue was gone after an update. And then returned with another update. And then was gone forever after another update.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54237
Location: 56N 3W

PostPosted: Sat Oct 23, 2021 10:03 am    Post subject: Reply with quote

szatox,

Forever is a very long time ...
Quote:
Absence of evidence is not evidence of absence


or, it's not possible to prove the absence of something.
e.g. We have not found SETI yet but there has mean a lot lot of looking.
That does not mean that we are alone in the universe.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
mcnutty
Tux's lil' helper
Tux's lil' helper


Joined: 29 Dec 2009
Posts: 120

PostPosted: Sat Oct 23, 2021 2:45 pm    Post subject: Reply with quote

figueroa wrote:
Random freezes without proximate cause are almost always hardware failure or heat. New doesn't mean it isn't failing.


I have almost no doubt it is some kind of hardware failure, but what evidence is there that it's the PSU, vs some other component, other than sometimes that's the problem? I'm trying to avoid buying new hardware on a hunch until I have (hopefully) solved my problem.

I also sort of suspect it might be the network card. Are there any diagnostics that can help me decide if it's one rather than the other without having to buy a new PSU and network card before I even know that's the problem?
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Sat Oct 23, 2021 3:43 pm    Post subject: Reply with quote

What is the result of the memtest86-bin test? It is reasonable to start with that test. Let it run for at least 24 hours.
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2963
Location: Edge of marsh USA

PostPosted: Sat Oct 23, 2021 3:56 pm    Post subject: Reply with quote

Preventive maintenance may go a log way to solving problems. Remove and reinstall all components and reseat all plugs. Blow out dust. Refresh thermal paste between CPU and heat sink.

Unplug and remove everything not absolutely needed. (Floppy drives, optical drives, USB peripherals)

Stress test the best you can. Transfer groups of very large files over the network. Compile a series of large programs.

If you have parts, start swapping in components. (Power supply, removable cards.)

Open the box and put a fan blowing on it. If failures stop, it's heat.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21635

PostPosted: Sat Oct 23, 2021 4:24 pm    Post subject: Reply with quote

If this system is not running a GUI, why does it have installed and loaded the nVidia proprietary modules? There's no evidence they are the cause, but if you don't need a GUI, I suggest to remove the nVidia modules and, if possible, remove the graphics card that motivated installing the drivers. At nothing else, you will probably get at least a bit of power savings by not having a discrete graphics card in the system.

Along the line of removing optional components, if you suspect the network card, you could try removing it temporarily. If you don't have a spare, then this would inhibit the system's use as a database server, which may or may not be too heavy a burden on your other uses. Still, as a temporary measure, it might be worth trying for the sake of excluding culprits if you want to avoid speculative hardware purchases.
Back to top
View user's profile Send private message
mcnutty
Tux's lil' helper
Tux's lil' helper


Joined: 29 Dec 2009
Posts: 120

PostPosted: Mon Oct 25, 2021 8:15 pm    Post subject: Reply with quote

Hu wrote:
If this system is not running a GUI, why does it have installed and loaded the nVidia proprietary modules? There's no evidence they are the cause, but if you don't need a GUI, I suggest to remove the nVidia modules and, if possible, remove the graphics card that motivated installing the drivers. At nothing else, you will probably get at least a bit of power savings by not having a discrete graphics card in the system.

I was occasionally using the machine for machine learning. I'm using it less and less for that and haven't actually used it since dropping the GUI several weeks ago. I could potentially switch to the nouveau driver if I decide to give up using it for machine learning altogether, but it would be hard to get rid of the card altogether for emergency system maintenance.

Hu wrote:
Along the line of removing optional components, if you suspect the network card, you could try removing it temporarily. If you don't have a spare, then this would inhibit the system's use as a database server, which may or may not be too heavy a burden on your other uses. Still, as a temporary measure, it might be worth trying for the sake of excluding culprits if you want to avoid speculative hardware purchases.

Since the last time I posted, I was able to try a different network card and the system still froze, so that's looking less likely than I thought. I also ran memtest86-bin, which passed without errors. I've also run mprime for several hours without issue. The Tctl topped out at about 84 (with Tdie @ 64), so I don't think it's an overheating issue. The only other hardware installed is an m.2 ssd. smartmontools reports that the drive is in good health and idling around 40°C. I'm sure it could theoretically be the cause, but it seem less likely.

Thanks for all the input from everyone. At least I seem to be narrowing it down.
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Mon Oct 25, 2021 8:34 pm    Post subject: Reply with quote

Does your machine freeze while you're running memtest86? If it does, it's a real hardware problem and you can rule out anything related to Linux or software.
Back to top
View user's profile Send private message
mcnutty
Tux's lil' helper
Tux's lil' helper


Joined: 29 Dec 2009
Posts: 120

PostPosted: Mon Oct 25, 2021 8:37 pm    Post subject: Reply with quote

mike155 wrote:
Does your machine freeze while you're running memtest86? If it does, it's a real hardware problem and you can rule out anything related to Linux or software.

Nope, no crashes while running memtest86 over night.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54237
Location: 56N 3W

PostPosted: Mon Oct 25, 2021 9:27 pm    Post subject: Reply with quote

mcnutty,

Try prime95 and watch the temperature.
Its a CPU and cooling stress test.

If it fails, its certainly hardware related but it need not be faulty parts.

memtest86 is a good RAM subsystem test, as long as you booted into it, so that's a good sign.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum