Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
X desktop infrequently locks up, Nvidia related? [solved]
View unanswered posts
View posts from last 24 hours
View posts from last 7 days

 
Reply to topic    Gentoo Forums Forum Index Desktop Environments
View previous topic :: View next topic  
Author Message
Longcast
n00b
n00b


Joined: 25 Nov 2018
Posts: 27
Location: I'm in the system mainframe blockchain cloud deep-learning code-wall! Watch out!

PostPosted: Tue Jan 29, 2019 3:21 am    Post subject: X desktop infrequently locks up, Nvidia related? [solved] Reply with quote

Video Card: GTX 950
My version of nvidia-drivers: 415.18

My X desktop operates normally outside of this problem. This problem can happen a wide time-frame (minutes to several hours), but when it happens the desktop freezes in place and I'm locked out of doing anything about it without ssh'ing in. The fact that I can use ssh to get in tells me that this is a video card problem, perhaps? I've tried several versions of the proprietary Nvidia drivers, but they don't seem to make a difference. What am I missing?

My dmesg:
Code:
[ 6247.436594] X: Corrupted page table at address 7f5cbc2e1aa0
[ 6247.436597] PGD 3fa4cd067 P4D 3fa4cd067 PUD 3fa5b6067 PMD 250402cfd067
[ 6247.436602] BAD
[ 6247.436604] Bad pagetable: 001d [#1] SMP NOPTI
[ 6247.436606] Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO)
[ 6247.436611] CPU: 6 PID: 3227 Comm: X Tainted: P           O    4.14.83-gentoo #33
[ 6247.436612] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P1.10 06/19/2018
[ 6247.436614] task: ffff9ae04bc5de80 task.stack: ffff9f52c24d0000
[ 6247.436616] RIP: 0033:0x7f5cbc2e1aa0
[ 6247.436617] RSP: 002b:00007ffce7a492b8 EFLAGS: 00013202
[ 6247.436618] RAX: 00007f5cbc2e1aa0 RBX: 0000557c83a963f0 RCX: 0000000000000000
[ 6247.436619] RDX: 0000557c83c298d0 RSI: 0000000000004000 RDI: 0000557c84266e50
[ 6247.436620] RBP: 0000557c84266e50 R08: 0000557c84472c90 R09: 0000000000000000
[ 6247.436621] R10: 0000000000000000 R11: 0000000000000000 R12: 0000557c83a963f0
[ 6247.436622] R13: 0000000000004000 R14: 0000557c84474cc0 R15: 000000000360007c
[ 6247.436623] FS:  00007f5cc0c878c0(0000) GS:ffff9ae05ed80000(0000) knlGS:0000000000000000
[ 6247.436624] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6247.436625] CR2: ffffbfe042cfd708 CR3: 00000003f9b42000 CR4: 00000000003406e0
[ 6247.436627] RIP: 0x7f5cbc2e1aa0 RSP: 00007ffce7a492b
[ 6247.436628] ---[ end trace 67d1eb901d68009f ]---
[ 6247.436643] X (3227) used greatest stack depth: 11616 bytes left

_________________
Body by Nautilus, Brain by Mattel.


Last edited by Longcast on Thu Jan 31, 2019 4:57 am; edited 2 times in total
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Tue Jan 29, 2019 3:33 am    Post subject: Reply with quote

"Corrupted page table" sounds like a sign something has gone very wrong. Any other details besides those lines?
Back to top
View user's profile Send private message
Longcast
n00b
n00b


Joined: 25 Nov 2018
Posts: 27
Location: I'm in the system mainframe blockchain cloud deep-learning code-wall! Watch out!

PostPosted: Tue Jan 29, 2019 3:40 am    Post subject: Reply with quote

Ant P. wrote:
"Corrupted page table" sounds like a sign something has gone very wrong. Any other details besides those lines?

Not too many. This block in dmesg is pretty much by itself. How wrong are we talking about here?

This is the output of dmesg|grep nvidia:
Code:
[    8.381335] nvidia: loading out-of-tree module taints kernel.
[    8.381342] nvidia: module license 'NVIDIA' taints kernel.
[    8.396311] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
[    8.396555] nvidia 0000:23:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    9.185530] caller _nv001095rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
[    9.910111] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  415.18  Thu Nov 15 21:35:37 CST 2018
[   10.322497] [drm] [nvidia-drm] [GPU ID 0x00002300] Loading driver
[   10.322499] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:23:00.0 on minor 0
[   62.438035] nvidia-smi (2338) used greatest stack depth: 13000 bytes left
[   62.464804] caller _nv001095rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
[ 6247.436606] Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) #this is the panic, also note the gap in boot time is due to an unrelated issue about iwlwifi drivers loading

_________________
Body by Nautilus, Brain by Mattel.
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Tue Jan 29, 2019 3:51 am    Post subject: Reply with quote

Trying to figure out how hard it's crashed — when you ssh to the machine, does `top` show Xorg stuck in "D" state at all? Can you kill (or kill -9) it and regain control of the screen? If not, does `chvt 1` have any effect?
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21624

PostPosted: Tue Jan 29, 2019 4:59 am    Post subject: Reply with quote

As for "how bad is this": best case, the nVidia driver has a bug that corrupts process page tables (but only those, no other forms of kernel memory are in danger of corruption). Worst case, depending on perspective: the nVidia driver has a bug that corrupts arbitrary memory and, left alone, may corrupt something that survives a reboot. Or, you could say the worst case is that you have a hardware fault and the nVidia driver is an innocent bystander as the hardware fault causes corruption which, again, might eventually corrupt persisted data (like pages written to a filesystem). Either way, when memory corruption is involved, "worst case" can become very bad, very quickly.

Can you reproduce the fault in an untainted kernel?
Back to top
View user's profile Send private message
Longcast
n00b
n00b


Joined: 25 Nov 2018
Posts: 27
Location: I'm in the system mainframe blockchain cloud deep-learning code-wall! Watch out!

PostPosted: Tue Jan 29, 2019 8:51 pm    Post subject: Reply with quote

In trying to replicate the problem, I've come across another problem (similar to the first one) in dmesg that didn't crash my desktop, but I think is worth looking at.

Code:
[60347.173573] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[60347.173582] IP: snd_ctl_notify.part.7+0x6c/0x190
[60347.173583] PGD 0 P4D 0
[60347.173586] Oops: 0000 [#1] SMP NOPTI
[60347.173588] Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO)
[60347.173594] CPU: 0 PID: 17969 Comm: kworker/0:1 Tainted: P           O    4.14.83-gentoo #33
[60347.173595] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P1.10 06/19/2018
[60347.173599] Workqueue: events process_unsol_events
[60347.173601] task: ffff8a8440935e80 task.stack: ffff97650283c000
[60347.173603] RIP: 0010:snd_ctl_notify.part.7+0x6c/0x190
[60347.173605] RSP: 0018:ffff97650283fda8 EFLAGS: 00010012
[60347.173606] RAX: ffff8a844a667360 RBX: ffff8a84494044e0 RCX: ffff8a843cc66358
[60347.173608] RDX: 0000000000000001 RSI: 0000000000000100 RDI: ffff8a843cc66340
[60347.173609] RBP: ffff8a8449404000 R08: 0000000000000002 R09: ffff8a84495adb10
[60347.173610] R10: 0000000000000001 R11: ffff8a84495ad800 R12: ffff8a843cc66340
[60347.173612] R13: 0000000000000202 R14: 0000000000000010 R15: ffff8a843cc66300
[60347.173613] FS:  0000000000000000(0000) GS:ffff8a845ec00000(0000) knlGS:0000000000000000
[60347.173615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[60347.173616] CR2: 0000000000000010 CR3: 00000003ffad8000 CR4: 00000000003406f0
[60347.173617] Call Trace:
[60347.173620]  hdmi_present_sense+0x2e5/0x760
[60347.173623]  check_presence_and_report+0x82/0xc0
[60347.173625]  process_unsol_events+0x5d/0x70
[60347.173628]  process_one_work+0x1c9/0x3d0
[60347.173630]  worker_thread+0x26/0x3c0
[60347.173632]  ? trace_event_raw_event_workqueue_execute_start+0x80/0x80
[60347.173634]  kthread+0x115/0x130
[60347.173636]  ? kthread_create_on_node+0x40/0x40
[60347.173639]  ret_from_fork+0x22/0x40
[60347.173640] Code: fb 0f 84 02 01 00 00 41 8b 47 50 85 c0 74 ec 4d 8d 67 40 4c 89 e7 e8 04 e6 28 00 49 89 c5 49 8b 47 58 49 8d 4f 58 48 39 c8 74 1e <41> 8b 16 3b 50 10 75 0e e9 ed 00 00 00 39 50 10 0f 84 e4 00 00
[60347.173661] RIP: snd_ctl_notify.part.7+0x6c/0x190 RSP: ffff97650283fda8
[60347.173662] CR2: 0000000000000010
[60347.173663] ---[ end trace a890a060cc8d54cb ]---


Quote:
the worst case is that you have a hardware fault and the nVidia driver is an innocent bystander as the hardware fault causes corruption which, again, might eventually corrupt persisted data (like pages written to a filesystem). Either way, when memory corruption is involved, "worst case" can become very bad, very quickly.

I have a feeling this might be worst-case if things are happening that don't exactly crash the X server. I've also had issues now with the kernel panicking at boot about one tenth of the time, which is worrisome. Nearly all of the parts in my PC are new (<5 months old), but my video card is a few years old. I've started backing up important data on a separate drive, though.

EDIT: On regular operation, corrupted page tables are reported by dmesg every while similar to this:
Code:
[ 2714.810616] panel-7-cpugrap: Corrupted page table at address 7f676e3ffc08
[ 2714.810621] PGD 404baf067 P4D 404baf067 PUD 4019d6067 PMD 250400aef067
[ 2714.810626] BAD
[ 2714.810630] Bad pagetable: 000d [#1] SMP NOPTI
[ 2714.810632] Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO)
[ 2714.810637] CPU: 4 PID: 3401 Comm: panel-7-cpugrap Tainted: P           O    4.14.83-gentoo #33
[ 2714.810639] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P1.10 06/19/2018
[ 2714.810641] task: ffffa264cc435e80 task.stack: ffffa70bc2760000
[ 2714.810643] RIP: 0033:0x7f676c15b4a8
[ 2714.810645] RSP: 002b:00007ffd6637bd28 EFLAGS: 00010246
[ 2714.810646] RAX: 0000000000000000 RBX: 0000558b0ffa6210 RCX: 00007f676c1405f3
[ 2714.810648] RDX: 00000000000000fa RSI: 0000000000000003 RDI: 0000000000000000
[ 2714.810649] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000558b0fff9f70
[ 2714.810650] R10: 0000558b10031340 R11: 0000000000000293 R12: 0000558b100230a0
[ 2714.810652] R13: 00000000000000fa R14: 00007f676c68baa0 R15: 0000000000000003
[ 2714.810654] FS:  00007f676e3ff900(0000) GS:ffffa264ded00000(0000) knlGS:0000000000000000
[ 2714.810655] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2714.810656] CR2: ffffc764c0aefff8 CR3: 00000003fc25c000 CR4: 00000000003406e0
[ 2714.810658] RIP: 0x7f676c15b4a8 RSP: 00007ffd6637bd28
[ 2714.810660] ---[ end trace 5877417b2ee255d7 ]---
[ 2714.810664] panel-7-cpugrap: Corrupted page table at address 7f676e3ffbe0
[ 2714.810665] PGD 404baf067 P4D 404baf067 PUD 4019d6067 PMD 250400aef067
[ 2714.810667] BAD
[ 2714.810669] Bad pagetable: 0009 [#2] SMP NOPTI
[ 2714.810669] Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO)
[ 2714.810672] CPU: 4 PID: 3401 Comm: panel-7-cpugrap Tainted: P      D    O    4.14.83-gentoo #33
[ 2714.810673] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P1.10 06/19/2018
[ 2714.810674] task: ffffa264cc435e80 task.stack: ffffa70bc2760000
[ 2714.810678] RIP: 0010:__get_user_8+0x21/0x2b
[ 2714.810679] RSP: 0000:ffffa70bc2763e88 EFLAGS: 00050206
[ 2714.810680] RAX: 00007f676e3ffbe7 RBX: ffffa264c1193c00 RCX: 00000000000002b0
[ 2714.810681] RDX: ffffffffffffffff RSI: ffffa264c1193c00 RDI: ffffa264cc435e80
[ 2714.810682] RBP: 00007f676e3ffbe0 R08: ffffa264ded24a20 R09: ffffa264c1b54290
[ 2714.810683] R10: 0000000000000246 R11: ffffffffb55ff80d R12: 0000000000000000
[ 2714.810684] R13: 0000000000000000 R14: ffffa264c1193c00 R15: ffffa264cc435e80
[ 2714.810685] FS:  00007f676e3ff900(0000) GS:ffffa264ded00000(0000) knlGS:0000000000000000
[ 2714.810686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2714.810687] CR2: ffffc764c0aefff8 CR3: 00000003fc25c000 CR4: 00000000003406e0
[ 2714.810688] Call Trace:
[ 2714.810692]  exit_robust_list+0x2b/0x110
[ 2714.810695]  ? __delayacct_add_tsk+0x148/0x170
[ 2714.810698]  mm_release+0xde/0x120
[ 2714.810701]  do_exit+0x141/0xb40
[ 2714.810703]  ? SyS_poll+0x6b/0x100
[ 2714.810705]  ? SyS_poll+0x6b/0x100
[ 2714.810708]  rewind_stack_do_exit+0x17/0x20
[ 2714.810709] Code: 0f 01 ca c3 66 0f 1f 44 00 00 48 83 c0 07 72 25 65 48 8b 14 25 40 4d 01 00 48 3b 82 d8 09 00 00 73 13 48 19 d2 48 21 d0 0f 01 cb <48> 8b 50 f9 31 c0 0f 01 ca c3 31 d2 48 c7 c0 f2 ff ff ff 0f 01
[ 2714.810730] RIP: __get_user_8+0x21/0x2b RSP: ffffa70bc2763e88
[ 2714.810731] ---[ end trace 5877417b2ee255d8 ]---
[ 2714.810732] Fixing recursive fault but reboot is needed!

This is all still with the tainted kernel, however. I will try without the nVidia driver loading.
Am I reasonably going to be able to recover from this without having to rebuild the system? It seems there are a lot of factors here.
_________________
Body by Nautilus, Brain by Mattel.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21624

PostPosted: Wed Jan 30, 2019 3:02 am    Post subject: Reply with quote

If you haven't written any corrupted files, recovery should be as easy as eliminating the origin of the corruption and rebooting. Since you have problems in early boot, that suggests the nVidia driver may not be involved. Have you run a memtest on this system recently?
Back to top
View user's profile Send private message
Longcast
n00b
n00b


Joined: 25 Nov 2018
Posts: 27
Location: I'm in the system mainframe blockchain cloud deep-learning code-wall! Watch out!

PostPosted: Thu Jan 31, 2019 4:55 am    Post subject: Reply with quote

Hu wrote:
Have you run a memtest on this system recently?

A memtest checks out- memtest86+ says it's fine.

I browsed my kernel config and eventually got a long-running X session (~7 hours without dmesg complaining) by disabling CONFIG_DRM in the kernel (a setting recommended by the nvidia-drivers wiki page, oops. I'm not sure how the driver ran with this enabled).

Around that 7 hour mark, I did get a weird report related to my iwlwifi driver, and I'm not sure whether it's related to
Hu wrote:
If you haven't written any corrupted files, recovery should be as easy as eliminating the origin of the corruption and rebooting

or not. I don't want to make the thread a blog about my issues with my system, though, so I'm marking the thread solved since my initial problem seems to be fixed.

Code:
[25378.067711] ------------[ cut here ]------------
[25378.067716] WARNING: CPU: 7 PID: 1436 at drivers/net/wireless/intel/iwlwifi/mvm/rs.c:1231 iwl_mvm_rs_tx_status+0xc8/0x1fe0
[25378.067717] Modules linked in: nvidia_modeset(PO) nvidia(PO) nvidia_drm(PO)
[25378.067721] CPU: 7 PID: 1436 Comm: irq/42-iwlwifi Tainted: P           O    4.14.83-gentoo #44
[25378.067721] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P1.10 06/19/2018
[25378.067722] task: ffff91ad49bd3600 task.stack: ffffb768c8dec000
[25378.067724] RIP: 0010:iwl_mvm_rs_tx_status+0xc8/0x1fe0
[25378.067725] RSP: 0018:ffffb768c8defc20 EFLAGS: 00010282
[25378.067726] RAX: 00000000ffffffea RBX: ffff91acc4152820 RCX: 0000000000000000
[25378.067726] RDX: ffffb768c8defc9c RSI: 0000000000000000 RDI: 0000000000000000
[25378.067727] RBP: ffffb768c8defd60 R08: ffffffff8e2e9149 R09: 00000000000000ff
[25378.067728] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[25378.067728] R13: 0000000000000000 R14: ffff91ad3b769548 R15: ffff91acc4152b70
[25378.067729] FS:  0000000000000000(0000) GS:ffff91ad5edc0000(0000) knlGS:0000000000000000
[25378.067730] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25378.067730] CR2: 00007f5ed38c4000 CR3: 00000003ff698000 CR4: 00000000003406e0
[25378.067731] Call Trace:
[25378.067736]  iwl_mvm_tx_reclaim+0x30d/0x390
[25378.067738]  iwl_mvm_rx_ba_notif+0x16a/0x230
[25378.067740]  iwl_pcie_rx_handle+0x22c/0x930
[25378.067741]  iwl_pcie_irq_handler+0x5f9/0x970
[25378.067744]  ? irq_forced_thread_fn+0x70/0x70
[25378.067745]  ? irq_thread_dtor+0x90/0x90
[25378.067746]  irq_thread_fn+0x1c/0x60
[25378.067748]  ? irq_thread_dtor+0x90/0x90
[25378.067749]  irq_thread+0x11c/0x160
[25378.067751]  ? wake_threads_waitq+0x30/0x30
[25378.067753]  kthread+0x115/0x130
[25378.067754]  ? kthread_create_on_node+0x40/0x40
[25378.067756]  ret_from_fork+0x22/0x40
[25378.067757] Code: 48 89 83 80 03 00 00 8b 83 ec 03 00 00 0f b6 75 04 89 c7 89 44 24 10 e8 57 df ff ff 85 c0 89 44 24 18 74 6c 0f 0b e9 7c ff ff ff <0f> 0b e9 75 ff ff ff 49 c7 c0 a8 8a 6f 8e 49 8b 3e 48 c7 c1 90
[25378.067775] ---[ end trace 1765f7a1b5cea7f9 ]---

This did break my wifi driver and I couldn't connect to the internet unless I rebooted. It might be a bad assumption that this is somehow related, but "nvidia" is mentioned(?) This hasn't happened before, either, so I'm not sure how to replicate this.
_________________
Body by Nautilus, Brain by Mattel.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21624

PostPosted: Fri Feb 01, 2019 3:02 am    Post subject: Reply with quote

nVidia earned a dishonorable mention there for being an out-of-tree proprietary driver. Historically, such drivers have tended to be lower quality and the origin of weird bugs, so kernel problems with such drivers loaded are clearly marked as a warning to those who might try to debug the issue. The call stack looks to be unrelated to nVidia in this case though.
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1435
Location: Berlin, Germany

PostPosted: Thu Mar 07, 2019 1:45 pm    Post subject: Reply with quote

can I chime in here, even though the thread is closed?

I have a similar setup (GTX580, GONFIG_DRM is set =y in the kernel config), and I also get these weird, random freezes from time to time. It most-often happens playing a game (EVE Online, which is rather graphics-intensive), but just now happened when EVE wasn't running. I've tried furmark on the card, 'stress' on the CPU, and didn't run into any problems.

Is it possible that this is also my issue? Is having CONFIG_DRM set in the kernel going to cause problems with the nvidia driver?

Cheers,

EE
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Desktop Environments All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum