Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Method to test hardware functionality of crashing system?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1676
Location: San Jose, CA

PostPosted: Thu Sep 10, 2020 7:56 pm    Post subject: Method to test hardware functionality of crashing system? Reply with quote

I have a three year old server that is hanging every few days overnight.

It's done it three times in the last week or so.

The strange thing is: it's not totally dead. The screen freezes and I can't switch to console, but I can log in remotely. When I do, I can't kill any application that's hung which includes X, kde, plasma. I can't even get it to reboot.

I have to hard reset or power cycle it.

This feels like a hardware issue to me. Does anyone have any tricks to figuring out if it's hardware or if I somehow botched the software so badly that X is hanging beyond kill -9?

I was hoping to put off upgrading this system until next year... I'm going to start looking for motherboard and CPU deals...

Thanks in advance.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46069
Location: 56N 3W

PostPosted: Thu Sep 10, 2020 8:20 pm    Post subject: Reply with quote

RayDude,

Can you read logs is get logs off it?
dmesg would be good.

What does smartctl -x say about the HDD?

Boot into a few cycles of memtest86
A fail does not always mean a RAM fail.

Take out half the RAM. Does it still hang.
Now try with only the half of the RAM that was out.

Put all the RAM back ... what happens now.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1676
Location: San Jose, CA

PostPosted: Fri Sep 11, 2020 12:40 am    Post subject: Reply with quote

Thanks Neddy!

NeddySeagoon wrote:
RayDude,

Can you read logs is get logs off it?
dmesg would be good.



I did not think of this. So obvious. Next time it happens I will definitely check both.


NeddySeagoon wrote:
What does smartctl -x say about the HDD?


I know the hard drives / ssd are okay. I keep tabs on them.


NeddySeagoon wrote:

Boot into a few cycles of memtest86
A fail does not always mean a RAM fail.

Take out half the RAM. Does it still hang.
Now try with only the half of the RAM that was out.

Put all the RAM back ... what happens now.


I will consider this.

The first failures happened after I installed a new BIOS and attempted to run the memory faster. It worked great, until it didn't.

But then, I slowed it all back down to slower than stock (this is a Ryzen 5 1600) I put memory at 2133 and I've never left stock cpu frequency or voltage and it is still happening.

It makes me wonder if the BIOS upgrade broke something. I'll check out gigabyte has released another bios to fix this one...

Thanks again. I really appreciate you taking the time to respond.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1676
Location: San Jose, CA

PostPosted: Fri Sep 11, 2020 2:11 am    Post subject: Reply with quote

Update: there was a new bios released last month. It contained AGESA 1.0.0.6 update. I'm hoping that helps.

I'm leaving everything stock to see if it fails again.

I'm crossing my fingers...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1676
Location: San Jose, CA

PostPosted: Sun Sep 13, 2020 12:31 am    Post subject: Reply with quote

I updated the BIOS, which set everything back to BIOS defaults, I left it there.

I did an emerge -DNuq @world yesterday and things went south again. Again, X windows died. Black screen, no activity.

But I was able to login remotely and check the system. Here is the end of dmesg, keep in mind some of the machination around the nvidia drivers are me trying to get X to restart and failing.

Code:
[29781.707378] elogind-daemon[2003]: New session c17 of user man.
[29782.656892] elogind-daemon[2003]: Removed session c17.
[54302.920055] elogind-daemon[2003]: New session 6 of user XXXX.
[54311.100668] elogind-daemon[2003]: Removed session 6.
[54319.280085] elogind-daemon[2003]: New session 7 of user XXXX.
[54441.427526] TCP: request_sock_TCP: Possible SYN flooding on port 56190. Sending cookies.  Check SNMP counters.
[76583.096115] fuse: init (API version 7.31)
[84602.043020] elogind-daemon[2003]: New session 8 of user XXXX.
[86195.821445] udevd[805]: invalid key/value pair in file /lib/udev/rules.d/60-steam-input.rules on line 42, starting at character 82 ('u')
[86723.913499] elogind-daemon[2003]: Removed session 7.
[87027.491302] traps: ThreadPoolSingl[4325] trap int3 ip:563acaf0f594 sp:7fd05b513f50 error:0 in chrome (deleted)[563ac804b000+7bf1000]
[87028.245102] elogind-daemon[2003]: Removed session 3.
[87092.876276] elogind-daemon[2003]: New session 9 of user root.
[87094.733875] elogind-daemon[2003]: Removed session 9.
[87103.959788] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[87103.959886] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[87104.611948] elogind-daemon[2003]: New session 10 of user mythtv.
[87158.604771] elogind-daemon[2003]: Removed session 10.
[87171.388304] elogind-daemon[2003]: New session 11 of user root.
[87780.826265] [drm] [nvidia-drm] [GPU ID 0x00000800] Unloading driver
[87780.836313] nvidia-modeset: Unloading
[87780.845269] nvidia-nvlink: Unregistered the Nvlink Core, major device number 246
[87823.682060] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
[87823.682484] nvidia 0000:08:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[87823.882322] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.66  Wed Aug 12 19:42:48 UTC 2020
[87824.142123] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[87824.142224] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[93754.509148] nvidia-nvlink: Unregistered the Nvlink Core, major device number 246


I ran /etc/intit.d/xdm stop and it says sddm stopped, but X didn't.

I tried to remove the nvidia drivers and the system wouldn't do it, even with modprobe -r -f because "module in use".

ps -ef | grep plasma showed that plasma was still running. I killed it.

ps -ef | grep kde (I think) showed that something of kde was still running and I killed it.

Then I could unload the nvidia modules.

Then I tried to start sddm (xdm) again and nothing. The nvidia driver didn't even load. I loaded it by hand and drm didn't load, only the nvidia module loaded and some of the output in dmesg is what it had to say...

Nothing I did could get video to recover.

But CTRL-ALT-F1 did get me to a working console.

I'm starting to think the video card or motherboard is going bad. I wonder if there's dust caked on the video card. I didn't study it closely the last time I had the case open. I should probably check. I wonder if things are overheating. Every since the thermal monitor for KDE stopped working I haven't been paying attention to the temps. I should set up a script and watch the next time I emerge @world...

I did something wacky in BIOS for the next experiment. I turned the PCIe ports down to PCIe Gen 1 to see if it makes a difference. I'll do an emerge @world next Friday and see what happens.

It's funny, I had a very similar problem in the system this system replaced a couple years ago and I'm pretty sure the video card died in very similar ways. I still have it, can't throw out a Geforce, I might need it in a pinch, but it's just cooking in the garage summer heat for the last several years.

Man I want this thing to survive until Zen 3 comes out and doesn't bust a wallet.

Thanks for listening. I'll keep posting status because it helps me organize my thoughts.

PS. I wonder if the syn flooding is a symptom of the crash...

Edit: I have a huge SHM for zoneminder on this machine. This feels like it might be a memory management issue that affects X and plasma. I've been thinking about doubling my ram to 32 GB. DRAM prices are in freefall at the moment, should bottom out by the end of the year, maybe first quarter as manufacturers scale production back. But for now DRAM and SSDs are getting cheaper by the week. But I hestiate to buy new RAM when a new system might want DDR5? I'll have to check to see if Zen 3 supports DDR5... I suspect not...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1676
Location: San Jose, CA

PostPosted: Wed Sep 16, 2020 10:36 pm    Post subject: Reply with quote

I don't know if anyone can help me, but I got an oops.

Code:

[408991.171083] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[408991.171092] CPU: 0 PID: 3377 Comm: Xorg Tainted: P           O    T 5.8.8-gentoo #1
[408991.171094] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020
[408991.171095] Call Trace:
[408991.171104]  dump_stack+0x6d/0x90
[408991.171109]  warn_alloc.cold+0x74/0xdb
[408991.171113]  ? __alloc_pages_direct_compact+0x11d/0x140
[408991.171117]  __alloc_pages_slowpath.constprop.0+0xb53/0xb90
[408991.171121]  ? wake_up_q+0x90/0x90
[408991.171124]  ? prep_new_page+0xbd/0xc0
[408991.171127]  __alloc_pages_nodemask+0x210/0x240
[408991.171131]  kmalloc_order+0x1b/0x60
[408991.171148]  nvkms_alloc+0x1b/0xd0 [nvidia_modeset]
[408991.171168]  _nv002653kms+0x16/0x30 [nvidia_modeset]
[408991.171185]  ? _nv002759kms+0x66/0x1470 [nvidia_modeset]
[408991.171200]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[408991.171202]  ? __alloc_pages_nodemask+0x11b/0x240
[408991.171216]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[408991.171219]  ? kmalloc_order+0x57/0x60
[408991.171232]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[408991.171245]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[408991.171259]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]
[408991.171273]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]
[408991.171449]  ? nvidia_frontend_unlocked_ioctl+0x2f/0x40 [nvidia]
[408991.171452]  ? ksys_ioctl+0x82/0xc0
[408991.171454]  ? __x64_sys_ioctl+0x11/0x20
[408991.171457]  ? do_syscall_64+0x3e/0xb0
[408991.171460]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[408991.171475] Mem-Info:
[408991.171482] active_anon:2641871 inactive_anon:301721 isolated_anon:0
                 active_file:492415 inactive_file:210014 isolated_file:0
                 unevictable:24 dirty:34 writeback:0
                 slab_reclaimable:142143 slab_unreclaimable:30998
                 mapped:1689062 shmem:1608888 pagetables:19906 bounce:0
                 free:176615 free_pcp:0 free_cma:0
[408991.171486] Node 0 active_anon:10567484kB inactive_anon:1206884kB active_file:1969660kB inactive_file:840056kB unevictable:96kB isolated(anon):0kB isolated(file):0kB mapped:6756248kB
dirty:136kB writeback:0kB shmem:6435552kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
[408991.171490] DMA free:15888kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB p
resent:15972kB managed:15888kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[408991.171491] lowmem_reserve[]: 0 3468 15940 15940
[408991.171498] DMA32 free:611724kB min:14688kB low:18360kB high:22032kB reserved_highatomic:0KB active_anon:1428252kB inactive_anon:427880kB active_file:202688kB inactive_file:452276kB u
nevictable:0kB writepending:16kB present:3616964kB managed:3616964kB mlocked:0kB kernel_stack:4492kB pagetables:12700kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[408991.171498] lowmem_reserve[]: 0 0 12472 12472
[408991.171505] Normal free:78848kB min:52828kB low:66032kB high:79236kB reserved_highatomic:2048KB active_anon:9139232kB inactive_anon:779004kB active_file:1766972kB inactive_file:387780
kB unevictable:96kB writepending:120kB present:13094400kB managed:12776556kB mlocked:96kB kernel_stack:13828kB pagetables:66924kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[408991.171505] lowmem_reserve[]: 0 0 0 0
[408991.171507] DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15888kB
[408991.171517] DMA32: 26229*4kB (UME) 18227*8kB (UME) 17805*16kB (UME) 2039*32kB (UME) 188*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 612892kB
[408991.171525] Normal: 5695*4kB (UMEH) 2019*8kB (UMEH) 2080*16kB (UMEH) 225*32kB (UMEH) 33*64kB (MEH) 3*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 82164kB
[408991.171536] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[408991.171537] 2376843 total pagecache pages
[408991.171540] 65526 pages in swap cache
[408991.171541] Swap cache stats: add 2813577, delete 2748166, find 1460048/1950362
[408991.171542] Free swap  = 5640700kB
[408991.171542] Total swap = 8388604kB
[408991.171543] 4181834 pages RAM
[408991.171544] 0 pages HighMem/MovableOnly
[408991.171544] 79482 pages reserved
[408991.171553] BUG: unable to handle page fault for address: 0000000000007980
[408991.171557] #PF: supervisor read access in kernel mode
[408991.171559] #PF: error_code(0x0000) - not-present page
[408991.171561] PGD 0 P4D 0
[408991.171564] Oops: 0000 [#1] PREEMPT SMP NOPTI
[408991.171568] CPU: 0 PID: 3377 Comm: Xorg Tainted: P           O    T 5.8.8-gentoo #1
[408991.171569] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020
[408991.171593] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[408991.171601] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[408991.171603] RSP: 0018:ffffb099811c3ce8 EFLAGS: 00010202
[408991.171606] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[408991.171608] RDX: ffff98aceb7e9348 RSI: 0000000000007980 RDI: ffff98ace71d1008
[408991.171610] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000
[408991.171611] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[408991.171613] R13: 0000000000007980 R14: ffff98ace71d1008 R15: 0000000000000001
[408991.171616] FS:  00007f9eaf52d8c0(0000) GS:ffff98ad0e800000(0000) knlGS:0000000000000000
[408991.171618] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[408991.171620] CR2: 0000000000007980 CR3: 00000003ef2be000 CR4: 00000000003406f0
[408991.171622] Call Trace:
[408991.171641]  ? _nv002759kms+0x3ca/0x1470 [nvidia_modeset]
[408991.171655]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[408991.171660]  ? __alloc_pages_nodemask+0x11b/0x240
[408991.171674]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[408991.171678]  ? kmalloc_order+0x57/0x60
[408991.171693]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[408991.171708]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[408991.171723]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]
[408991.171738]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]
[408991.171911]  ? nvidia_frontend_unlocked_ioctl+0x2f/0x40 [nvidia]
[408991.171915]  ? ksys_ioctl+0x82/0xc0
[408991.171918]  ? __x64_sys_ioctl+0x11/0x20
[408991.171921]  ? do_syscall_64+0x3e/0xb0
[408991.171925]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[408991.171929] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter fuse nvidia_drm(PO) nvidia_modeset(PO) hid_logitech_hidpp nvidia(PO) input_leds hid_logitech_dj r8169 realtek libphy
[408991.171941] CR2: 0000000000007980
[408991.171944] ---[ end trace 816cbc84fb70ef20 ]---
[408991.171966] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[408991.171970] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[408991.171972] RSP: 0018:ffffb099811c3ce8 EFLAGS: 00010202
[408991.171974] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[408991.171975] RDX: ffff98aceb7e9348 RSI: 0000000000007980 RDI: ffff98ace71d1008
[408991.171977] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000
[408991.171978] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[408991.171980] R13: 0000000000007980 R14: ffff98ace71d1008 R15: 0000000000000001
[408991.171982] FS:  00007f9eaf52d8c0(0000) GS:ffff98ad0e800000(0000) knlGS:0000000000000000
[408991.171984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[408991.171986] CR2: 0000000000007980 CR3: 00000003ef2be000 CR4: 00000000003406f0
[409016.172216] GpuWatchdog[4352]: segfault at 0 ip 000055e936015a02 sp 00007f004d0e2850 error 6 in chrome[55e93177b000+7bf3000]
[409016.172225] Code: 89 de e8 c1 8e 6f ff 80 7d c7 00 79 09 48 8b 7d b0 e8 42 e9 6b fe 41 8b 84 24 e0 00 00 00 89 45 b0 48 8d 7d b0 e8 ce df 9c fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 48 5b 41 5c 41 5d 41 5e


Can someone understand this? I'll pour through it after I get the PC rebooted.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1676
Location: San Jose, CA

PostPosted: Wed Sep 16, 2020 11:43 pm    Post subject: Reply with quote

I found a thread on the internet that implies that the first error message:

Code:
Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP),


Is caused by the nvidia driver crashing. If that's the case, then maybe an old driver would fix it, or perhaps the video card really is dying.

This crash didn't even happen under stress. I had just woken up the display from blanking when this crash happened.

The fact that reloading the driver didn't fix the problem before makes me think this is a hardware failure...

Dangit.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum