Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
GeForce GTX 980M: Power level 0 crashes the system
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
segmentation-fault
Tux's lil' helper
Tux's lil' helper


Joined: 11 Oct 2016
Posts: 99

PostPosted: Fri Jan 28, 2022 12:31 am    Post subject: GeForce GTX 980M: Power level 0 crashes the system Reply with quote

:arrow: A WARNING from my own painful experience with a GeForce GTX 980M: any power level other than "Performance" drives the power consumption down - unfortunately DOWN TOO MUCH! I experienced complete system crashes (blank screen with a non-blinking underscore cursor in the upper left corner - and nothing goes! only reboot!) with both "Adaptive" and "Auto" settings, at erratic times, with me typing, or me doing nothing at the time, with kernels 4.19.x and 5.4.y, with nvidia-drivers 440.x and 460.y,with all other kinds of power management disabled (USB autosuspend, NVidia "Dynamic" Power Management...). What gives me a rock-stable system is "Performance" - at the cost of 25W more power consumption.

FWIW: I do have KMS (Kernel-Mode-Setting) and
Code:
nvidia_drm.modeset=1
as one of my boot kernel parameters.

I have since the following in my /etc/crontab file (I use vixie-cron, so check the syntax for your cron):

Code:

# Setting NVidia's Power Management "Power Mizer Mode" to "Prefer Maximum Performance".
# THIS IS *ABSOLUTELY NECESSARY*, AS ANY OTHER MODE *WILL* EVENTUALLY *CRASH* THE SYSTEM!
# NOTE: nvidia-settings needs DISPLAY to run properly.
# But DISPLAY is not available inside cron, unless we pass it as shown here.
# NOTE: nvidia-settings not only needs DISPLAY, it also *needs* XAUTHORITY too!
# But XAUTHORITY also is not available inside cron, unless we pass it as shown here.
*/10  *  * * *   your-user-name   DISPLAY=$(w -f $(id -un) | awk 'NF > 7 && $2 ~ /pts\/[0-9]+/ {print $3; exit}') XAUTHORITY=/home/$(id -un)/.Xauthority /usr/bin/nvidia-settings -a "GPUPowerMizerMode=1" >/dev/null 2>&1


This will set it every 10 minutes - this may seem overkill to you, but better safe than sorry.

This is "poor man's settings persistence" - I will not start yet another ("persistence") server just for this.

P.S. This is all about desktop operation. "Performance" is all-important for games, but I am not a gamer.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21619

PostPosted: Fri Jan 28, 2022 2:28 am    Post subject: Reply with quote

Is this reproducible with an untainted kernel? Does the system actually crash, or is this just a display problem? That is, does the system remain accessible over the network? Does the kernel write any useful logs before it dies? If not, can you get anything out via a serial console or netconsole?
Back to top
View user's profile Send private message
segmentation-fault
Tux's lil' helper
Tux's lil' helper


Joined: 11 Oct 2016
Posts: 99

PostPosted: Fri Jan 28, 2022 9:07 am    Post subject: Reply with quote

Hu wrote:

Is this reproducible with an untainted kernel?


I didn't try any other driver than the closed-source nvidia-drivers 440.x and 460.y, for various x and y, if that is what you mean by "untainted kernel". It wouldn't make much sense either, because "level 0" is a setting of "PowerMizer" in nvidia-settings. And it is exactly this "level 0" that I am referring to.

Hu wrote:

Does the system actually crash, or is this just a display problem?


It crashes hard. As I wrote, neither keyboard, nor mouse work anymore.


Hu wrote:
That is, does the system remain accessible over the network?


No. I tried that many times.

Hu wrote:
Does the kernel write any useful logs before it dies?


No. I have various .log files in /var/log. I checked them all. There is absolutely nothing crash-related there. It just stops. I even have some cron jobs that run every minute. I looked into the cron log and I could see the system had executed the job in one minute and, from one minute to the next, no more cron messages. Very frustrating! Of course I also looked at Xorg.log and even in the virtual machine log, if there was one running. I even disabled Power Link Management for PCI Express in the virtual Windows, just to make sure it was not interfering.


Hu wrote:
If not, can you get anything out via a serial console or netconsole?


I didn't try this and, since I know that I have had more than 300 days uptime with the "Performance" setting, I am reluctant to use anything else now. I have spent 2 months in 2020 and 3(!) months in 2021 trying to hunt this down, during which time I was unable to do long-term work, due to crashes happening every 2-8 days. But, given that everything (including cron, consoles and ssh over the network) died this way, why should a serial console survive?

The only thing that was alive sometimes, was ping. But by far not always.

No, forget all this. It's probably a combination of a nvidia-drivers/nvidia-settings bug, kernel modesetting with nvidia, the specific card and the hardware it runs on (an ASUS G752VY laptop).

It could of course, theoretically, be some problem with the card memory (I already checked RAM with memcheck86). I did a rudimentary test with memtestCL, but would have to go to some lengths to test thoroughly and push it to the limit, so I just stopped there.

However, I am not alone. I have seen people saying that setting the "Performance" mode in PowerMizer (nvidia-settings) eliminated sudden crashes for them - again, not in games, but during desktop work - but I can't find the link right now.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21619

PostPosted: Fri Jan 28, 2022 4:36 pm    Post subject: Reply with quote

segmentation-fault wrote:
Hu wrote:
Is this reproducible with an untainted kernel?
I didn't try any other driver than the closed-source nvidia-drivers 440.x and 460.y, for various x and y, if that is what you mean by "untainted kernel".
Loading a proprietary module taints the kernel (specifically, it sets TAINT_PROPRIETARY_MODULE). The upstream kernel developers generally respond to problem reports with a question like mine, specifically because proprietary modules have a long history of causing serious problems that mysteriously go away when the proprietary module is not used.
segmentation-fault wrote:
It wouldn't make much sense either, because "level 0" is a setting of "PowerMizer" in nvidia-settings. And it is exactly this "level 0" that I am referring to.
Wouldn't level 0 be a property of the hardware, which nvidia-settings merely lets you request? If so, then in theory an open driver should also be able to request the hardware go to that same level 0.
segmentation-fault wrote:
Hu wrote:
Does the system actually crash, or is this just a display problem?
It crashes hard. As I wrote, neither keyboard, nor mouse work anymore.
You did not originally write that neither keyboard nor mouse work anymore. In fact, neither key nor mouse appear in this thread until the response to which I am replying.
segmentation-fault wrote:
Hu wrote:
That is, does the system remain accessible over the network?
No. I tried that many times.
This is useful to know. It suggests the system's problem is deeper than just video, which is odd since the problem is triggered by a directive from the module meant to operate the video card.
segmentation-fault wrote:
Hu wrote:
Does the kernel write any useful logs before it dies?
No. <snip>
That is disappointing, but not surprising.
segmentation-fault wrote:
Hu wrote:
If not, can you get anything out via a serial console or netconsole?
But, given that everything (including cron, consoles and ssh over the network) died this way, why should a serial console survive?
The serial console will not survive, but it is a very simple device, and it has the possibility of sending the kernel's last messages out over serial to a host that survives, and that host can record them for you to see later. All the other mechanisms you have tried are vastly more complex and require the dying system to maintain a much higher level of functionality in order to save the last messages before death. Of course, this assumes there are such messages and that they would be useful. Neither of those is guaranteed.
segmentation-fault wrote:
The only thing that was alive sometimes, was ping. But by far not always.
That is interesting. If ping works, and sshd does not, then the kernel is alive (albeit only barely), but not running user-space programs (like sshd). This could indicate that some highly critical lock is wedged, so the scheduler cannot move a user process onto the CPU, but the network stack in the kernel can still respond to simple events like ICMP echo request. Unfortunately, that still leaves a fair amount of area for problems to hide, but it does say that the fault did not completely halt the CPU or system bus. That it is intermittent could point to the presence of multiple problems, or it could mean that whatever gets wedged will reliably break user-space, but only intermittently wedge core kernel locks.
segmentation-fault wrote:
No, forget all this. It's probably a combination of a nvidia-drivers/nvidia-settings bug, kernel modesetting with nvidia, the specific card and the hardware it runs on (an ASUS G752VY laptop).
Yes. That is why I wanted to rule out the nVidia module, by seeing the problem reproduced on an open system.
segmentation-fault wrote:
It could of course, theoretically, be some problem with the card memory (I already checked RAM with memcheck86).
Maybe, but memory faults usually manifest as data corruption, not as automatic whole-system crashes.
segmentation-fault wrote:
However, I am not alone. I have seen people saying that setting the "Performance" mode in PowerMizer (nvidia-settings) eliminated sudden crashes for them - again, not in games, but during desktop work - but I can't find the link right now.
That would further support that this is a bug, not a defect in the card. I don't use anything like this, but if you find the link, please do post it for the benefit of other readers who may find this thread.
Back to top
View user's profile Send private message
segmentation-fault
Tux's lil' helper
Tux's lil' helper


Joined: 11 Oct 2016
Posts: 99

PostPosted: Fri Jan 28, 2022 8:02 pm    Post subject: Reply with quote

My primary goal was to warn others about the exact constellation where this happens: NVidia GeForce GTX card, ASUS laptop, nvidia-drivers 440 and 460, kernels 4.19 and 5.4, and a "power level 0" that is automatically the level the card is driven to by both "Auto" and "Adaptive" modes in "PowerMizer" settings of nvidia-settings.

The problem is, we actually don't know how exactly this "power level 0" is achieved by PowerMizer. Let's say you choose "Adaptive". This is supposed to adapt the power level according to the current needs. If you choose "Auto", it does the same "automatically". Both will drive power level to 0 (you can see the power level falling "live" from 3 to 2 to 1 to 0 in PowerMizer), as soon as the card has nothing to do. However, that's what PowerMizer says. We don't know what the card's true power level is - we cannot debug the driver.

We also don't know if the problem is power level 0 itself, or that the driver fails to drive the power consumption fast enough to a higher level, as soon as some part of the system needs some more graphics functionality from the card. So maybe it's the driver that cannot respond to needed "power spikes" due to processes that suddenly kick in. It's also possible that it's some OpenGL problem, with the NVidia OpenGL driver...

The type of crash is also very characteristic of a (NVidia?) graphics driver failure: blank (dark) screen with a sole NON-blinking underscore in the upper left corner. This is actually what you see when you start X and the nvidia graphics driver intializes itself: just before the screen becomes totally blank and the mouse cursor appears, you see exactly this non-blinking underscore in the upper left. Seeing it also after the crash indicates to me that the graphics driver crashed.

Now this would not be *that* bad (I would at least have the consoles, I could restart X and possibly read some kernel messages in the logs - and, above all, I wouldn't have to reboot and recheck all my disks), but the system crashes hard, in the sense that only a hard reset and reboot will help. And this indicates that the crashed graphics driver took the kernel with it through Kernel Mode Setting. Because, if the console graphics mode is now also controlled through an nvidia module, then it's clear that the console will also be unresponsive in case nvidia crashes.

On the other side, I am not willing to revert back to VESA modes for the console, just because PowerMizer cannot drive the card in and out of power level 0 correctly - NVidia should fix its drivers, it's as simple as that. But I have studied the changelog of nvidia-drivers and my impression is that the software controls NVidia, not NVidia its software (BTW NVidia is incapable of putting /usr/share/doc/nvidia-drivers-460.91.03-r1/NVIDIA_Changelog.bz2 into some web location, so you have to install the driver, in order to read its changelog, something that defeats the purpose of a changelog in case you desperately need to see if some version X solved some bug that had to do with your specific problem... :roll: - very frustrating, I will rather not post my notes :evil: on that! :lol: ).
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum