X desktop infrequently locks up, Nvidia related? [solved]

Longcast · Last edited by Longcast on Thu Jan 31, 2019 4:57 am; edited 2 times in total

Video Card: GTX 950
My version of nvidia-drivers: 415.18

My X desktop operates normally outside of this problem. This problem can happen a wide time-frame (minutes to several hours), but when it happens the desktop freezes in place and I'm locked out of doing anything about it without ssh'ing in. The fact that I can use ssh to get in tells me that this is a video card problem, perhaps? I've tried several versions of the proprietary Nvidia drivers, but they don't seem to make a difference. What am I missing?

My dmesg:

Ant P. · Watchman Joined: 18 Apr 2009 Posts: 6920

"Corrupted page table" sounds like a sign something has gone very wrong. Any other details besides those lines?

Longcast · Posted: Tue Jan 29, 2019 3:40 am Post subject:

Ant P. · Watchman Joined: 18 Apr 2009 Posts: 6920

Trying to figure out how hard it's crashed — when you ssh to the machine, does `top` show Xorg stuck in "D" state at all? Can you kill (or kill -9) it and regain control of the screen? If not, does `chvt 1` have any effect?

Hu · Moderator Joined: 06 Mar 2007 Posts: 21624

As for "how bad is this": best case, the nVidia driver has a bug that corrupts process page tables (but only those, no other forms of kernel memory are in danger of corruption). Worst case, depending on perspective: the nVidia driver has a bug that corrupts arbitrary memory and, left alone, may corrupt something that survives a reboot. Or, you could say the worst case is that you have a hardware fault and the nVidia driver is an innocent bystander as the hardware fault causes corruption which, again, might eventually corrupt persisted data (like pages written to a filesystem). Either way, when memory corruption is involved, "worst case" can become very bad, very quickly.

Can you reproduce the fault in an untainted kernel?

Longcast · Posted: Tue Jan 29, 2019 8:51 pm Post subject:

In trying to replicate the problem, I've come across another problem (similar to the first one) in dmesg that didn't crash my desktop, but I think is worth looking at.

Hu · Moderator Joined: 06 Mar 2007 Posts: 21624

If you haven't written any corrupted files, recovery should be as easy as eliminating the origin of the corruption and rebooting. Since you have problems in early boot, that suggests the nVidia driver may not be involved. Have you run a memtest on this system recently?

Longcast · Posted: Thu Jan 31, 2019 4:55 am Post subject:

Hu · Moderator Joined: 06 Mar 2007 Posts: 21624

nVidia earned a dishonorable mention there for being an out-of-tree proprietary driver. Historically, such drivers have tended to be lower quality and the origin of weird bugs, so kernel problems with such drivers loaded are clearly marked as a warning to those who might try to debug the issue. The call stack looks to be unrelated to nVidia in this case though.

ExecutorElassus · Posted: Thu Mar 07, 2019 1:45 pm Post subject:

can I chime in here, even though the thread is closed?

I have a similar setup (GTX580, GONFIG_DRM is set =y in the kernel config), and I also get these weird, random freezes from time to time. It most-often happens playing a game (EVE Online, which is rather graphics-intensive), but just now happened when EVE wasn't running. I've tried furmark on the card, 'stress' on the CPU, and didn't run into any problems.

Is it possible that this is also my issue? Is having CONFIG_DRM set in the kernel going to cause problems with the nvidia driver?

Cheers,

EE