[Solved]nvidia-drivers freezing system with GTX 650

RichardGv · Last edited by RichardGv on Thu Aug 07, 2014 2:22 pm; edited 3 times in total

Update: After I moved my memory sticks to two other slots, the problem haven't appeared in two weeks. Presumably, solved. Thanks to ville.aakko for the suggestion!

Environment:
Gentoo ~amd64, pf-sources-3.15_p2 (Unsupported kernel, but is this problem actually related to it?)
GTX 650
gcc-4.8.3[hardened]
Update:: The problem occurs with vanilla-sources-3.15.6 compiled with vanilla GCC_SPECS as well.

Problem:
After starting X, the system often freezes, so frequently that makes X unusable. Usually at first, all sudden everything displayed on the X screen is struck, then the content on the screen is sometimes updated for a few times, very slowly, then it gets entirely struck. Cursor sometimes still works but sometimes it doesn't. Alt-SysRq keys sometimes works and sometimes doesn't. Ctrl-Alt-F{1..6} almost never works.
The card is GTX 650. I'm using nvidia-drivers-340.24 primarily, but downgrading to 337.25 doesn't help. When the kernel got struck, I would almost always find several lines related to nvidia-drivers in kern.log:

krinn · Watchman Joined: 02 May 2003 Posts: 7470

RichardGv

ville.aakko · Posted: Mon Jul 21, 2014 6:58 am Post subject:

Hi,

I had a very similar issue (don't remember if I had the syslog messages, but the symptoms were exactly the same). I also had a lot of black windows if I enabled compositing.

EDIT: I have a GTX 660 Ti PE

The funny thing is, I'm not sure what fixed it (I don't have it anymore). I though I might have been some poorly inserted RAM (but I could compile away / do anyghin if I didn't use the graphics card); but OTOH, the problems started after an upgrade, before which the last one had been a long while ago.

I had some packaged at @preserved-rebuild that kept re-compiling / re-listing, and did a revdep-rebuild (which did find something unrelated, but still rebuilt something); IIRC I upgraded kernel, and rebuild some system packaged (glibc or similar), and the problem went away! I was quite frustrated and did several things at the same time. I know, not the right way of fixing things - I used a bash-root-hammer

My guess is, that some library had a bug / incompatibility with the nvidia-drivers, or some (system) library is compiled against different version of some other library and portage does not notice it for some reason. Try running revdep-rebuild.

Cheers!
_________________
- Ville

Randy Andy

Hi Folks,

I have had similar trouble also, but only with my better Nvidia-Cards, so I came to the following conclusion: The better/performant the Nvidia hardware is, the worse is the nvidia-driver.
I never had this trouble with my low cost Nvidia consumer cards before, but with my Quadro FX 4800, Tesla chipset (not Keppler as yours).

It works relatively well with the nouveau driver, but I missed some important features and that was the reason for me to search long time for a working proprietary driver.

The only well working nvidia-driver for this card is the so called legacy series, which is actually the version ~304.123 (supports 1.16 xorg-server now) or the stable one +304.121, up to xorg 1.15.

So try one of this versions to get rid of your problems, hopefully. :wink:

Much success, Andy.
_________________
If you want to see a Distro done right, compile it yourself!

pa1983 · Tux's lil' helper Joined: 09 Jan 2004 Posts: 101

Randy Andy.

Tesla series GPU's are no longer supported. Nvidia dropped support not long ago. So legacy drivers is the only way in your case.
_________________
NAS: i3 4360 3.7Ghz, 20Gb ram, 256Gb SSD, 65Tb HDD, NIC: Intel 2x1Gbit, Realtek 2.5Gbit
ROUTER: J1900 2Ghz, 8Gb ram, 128Gb SSD, NIC: 2x1Gbit, WIFI: Atheros AR9462 and AR5005G

Randy Andy · Posted: Mon Jul 21, 2014 12:20 pm Post subject:

RichardGv · Posted: Wed Jul 23, 2014 7:28 am Post subject:

Summary of the new methods I've tried and their outcomes:

Compile pf-sources-3.15_p4 with a new configuration modified from Arch Linux .config. Still freezes.
Downgrade to nvidia-drivers-304.123. Still freezes. Log is provided below.
Move my 2 memory sticks to other slots. I'm still testing. No freeze so far.

By the way, the GPU temperature is moderately low.

@ville.aakko:

programmist11180 · n00b Joined: 05 Aug 2014 Posts: 2

Hello, comrades.
I have similar problem (on Debian, not Gentoo).

krinn · Watchman Joined: 02 May 2003 Posts: 7470

people report xid are hardware error, many just from heat but some cause by bad hardware part.
At least try this little script, it will do wonder for your debug : https://code.google.com/p/nvidia-fanspeed/
(you'll get temp and can set fan throttle base on temp, so if it freeze you will see if it has frozen at a certain temp...)

programmist11180 · n00b Joined: 05 Aug 2014 Posts: 2

Xid errors documentation http://docs.nvidia.com/deploy/xid-errors/index.html

shazeal · Posted: Wed Aug 06, 2014 7:33 pm Post subject:

RichardGv · Posted: Thu Aug 07, 2014 1:21 pm Post subject:

Thanks for the new suggestions! The good thing is I have not been able to reproduce the issue since July 23rd, with neither pf-sources-3.15_p4 nor the new gentoo-sources-3.16.0, so it should be pretty safe to say the problem is solved for me -- at least right now. (I moved to gentoo-source after I found uksm bringing kernel freeze and some random kernel errors.) I'm not completely sure if it's related to my moving of memory sticks, though, since the problem mysteriously disappeared once beforehand as well. Thanks again for the advice from ville.aakko!

F1r31c3r · Posted: Fri Dec 12, 2014 6:36 am Post subject: This is a wierd issue

This has been happening on and off for the past few months after an update came in.

I can not trace down exactly the culprit. Someone changed something to cause the issue.
When i get chance i am going to try and roll back the kernel then see what happens. It does not do this all the time so it is not frequently repeatable from what i can see but it sure as hell does happen at totally off the mark times.

As is usually the case, something got a bug fix and most likely the nvidia drivers did not get updated to the bugfix. Finding it is not easy and while nvidia drivers are closed source it makes it even harder.

Puked out messages for interested parties...

RichardGv · Posted: Mon Dec 15, 2014 12:50 pm Post subject: Re: This is a wierd issue

I have never spot the problem again since July 23, 2014. It just disappeared after I moved the memory sticks -- or maybe it's the weather or something else. Still have no idea what is causing the issue. I'm upgrading the kernel and the drivers normally.

F1r31c3r · Posted: Mon Dec 15, 2014 1:13 pm Post subject: Re: This is a wierd issue

F1r31c3r · Posted: Tue Dec 16, 2014 9:52 pm Post subject: Kernel Voluntary preempt

So I changed the preempt model from low latency desktop(forced preempt) to desktop (Voluntary preempt) and it would seem that the errors have gone away for now.

Considering the error message said 'atomic or interrupt context' this would make some sort of sense at least.
Usually with graphics card binary drivers they never install or work with anything less than low latency forced preempt. For those that don't know, the preempt is the way the kernel deals with scheduled processes.

we shall see in the near future how and if it is any better.

UPDATE:

The yield CPU errors crash kwin so i turned of 'Suspend 3D effects when apps in full screen' and dropped the OpenGL 3.1 down to OpenGL 2.0 to test, further seems to be more stable. My idea was that when exiting a graphics demanding app kwin tries to re-enable the 3D effects and causes problems. In this case with 'suspending 3D effects for full screen applications' disabled it should stop kwin from trying to yield the CPU at that specific time.
Well that is the theory anyway. At least voluntary preempt helped in recovering from this error rather than locking everything up and sending the whole screen corrupted.

If it happens again i shall compile kwin with debug and run it to try and get more output see what is going on.

Anyone got any other feedback feel free to post it... :lol:

_________________
A WikI, A collection of mass misinformation based on opinion and manipulation by a deception of freedom.
If we know the truth, then we should be free from deception (John 8:42-47 )

gentoorockerfr · Apprentice Joined: 25 May 2012 Posts: 203

same problem here with 3.19-pf kernel only!
gentoo64 nvidia gtx 650
I will try to change ram positions.I have all positions with memory(4)

F1r31c3r · Posted: Fri Mar 06, 2015 10:52 am Post subject: