GeForce GTX 980M: Power level 0 crashes the system

segmentation-fault · Tux's lil' helper Joined: 11 Oct 2016 Posts: 99

A WARNING from my own painful experience with a GeForce GTX 980M: any power level other than "Performance" drives the power consumption down - unfortunately DOWN TOO MUCH! I experienced complete system crashes (blank screen with a non-blinking underscore cursor in the upper left corner - and nothing goes! only reboot!) with both "Adaptive" and "Auto" settings, at erratic times, with me typing, or me doing nothing at the time, with kernels 4.19.x and 5.4.y, with nvidia-drivers 440.x and 460.y,with all other kinds of power management disabled (USB autosuspend, NVidia "Dynamic" Power Management...). What gives me a rock-stable system is "Performance" - at the cost of 25W more power consumption.

FWIW: I do have KMS (Kernel-Mode-Setting) and

Hu · Moderator Joined: 06 Mar 2007 Posts: 21619

Is this reproducible with an untainted kernel? Does the system actually crash, or is this just a display problem? That is, does the system remain accessible over the network? Does the kernel write any useful logs before it dies? If not, can you get anything out via a serial console or netconsole?

segmentation-fault · Tux's lil' helper Joined: 11 Oct 2016 Posts: 99

Hu · Moderator Joined: 06 Mar 2007 Posts: 21619

segmentation-fault · Tux's lil' helper Joined: 11 Oct 2016 Posts: 99

My primary goal was to warn others about the exact constellation where this happens: NVidia GeForce GTX card, ASUS laptop, nvidia-drivers 440 and 460, kernels 4.19 and 5.4, and a "power level 0" that is automatically the level the card is driven to by both "Auto" and "Adaptive" modes in "PowerMizer" settings of nvidia-settings.

The problem is, we actually don't know how exactly this "power level 0" is achieved by PowerMizer. Let's say you choose "Adaptive". This is supposed to adapt the power level according to the current needs. If you choose "Auto", it does the same "automatically". Both will drive power level to 0 (you can see the power level falling "live" from 3 to 2 to 1 to 0 in PowerMizer), as soon as the card has nothing to do. However, that's what PowerMizer says. We don't know what the card's true power level is - we cannot debug the driver.

We also don't know if the problem is power level 0 itself, or that the driver fails to drive the power consumption fast enough to a higher level, as soon as some part of the system needs some more graphics functionality from the card. So maybe it's the driver that cannot respond to needed "power spikes" due to processes that suddenly kick in. It's also possible that it's some OpenGL problem, with the NVidia OpenGL driver...

The type of crash is also very characteristic of a (NVidia?) graphics driver failure: blank (dark) screen with a sole NON-blinking underscore in the upper left corner. This is actually what you see when you start X and the nvidia graphics driver intializes itself: just before the screen becomes totally blank and the mouse cursor appears, you see exactly this non-blinking underscore in the upper left. Seeing it also after the crash indicates to me that the graphics driver crashed.

Now this would not be *that* bad (I would at least have the consoles, I could restart X and possibly read some kernel messages in the logs - and, above all, I wouldn't have to reboot and recheck all my disks), but the system crashes hard, in the sense that only a hard reset and reboot will help. And this indicates that the crashed graphics driver took the kernel with it through Kernel Mode Setting. Because, if the console graphics mode is now also controlled through an nvidia module, then it's clear that the console will also be unresponsive in case nvidia crashes.

On the other side, I am not willing to revert back to VESA modes for the console, just because PowerMizer cannot drive the card in and out of power level 0 correctly - NVidia should fix its drivers, it's as simple as that. But I have studied the changelog of nvidia-drivers and my impression is that the software controls NVidia, not NVidia its software (BTW NVidia is incapable of putting /usr/share/doc/nvidia-drivers-460.91.03-r1/NVIDIA_Changelog.bz2 into some web location, so you have to install the driver, in order to read its changelog, something that defeats the purpose of a changelog in case you desperately need to see if some version X solved some bug that had to do with your specific problem... :roll:

- very frustrating, I will rather not post my notes :evil:

on that!

).