Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Unusual nvidia-smi CPU usage and seems hindering X start.
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Desktop Environments
View previous topic :: View next topic  
Author Message
sbbg
n00b
n00b


Joined: 02 Feb 2013
Posts: 16

PostPosted: Tue Apr 04, 2017 4:19 am    Post subject: Unusual nvidia-smi CPU usage and seems hindering X start. Reply with quote

Hi, Gentoo user & devs.

I am installing Gentoo on a Mi Notebook Air by now.
http://www.notebookcheck.net/Xiaomi-Mi-Air-13-3-inch-Notebook-Review.180561.0.html,
which happens to be an nVidia Optimus machine.
But I did follow the https://wiki.gentoo.org/wiki/NVIDIA/Optimus guide carefully.

After I finished most of the jobs and rebooting,
I found that "MOSTLY" there are 100% single core cpu usage from nvidia-smi for no reason.
And in this state, I can't normally run startX nor xdm. It will hang in black screen with only unblinking cursor on top left corner.
There will be no Xorg.0.log left, which is very disturbing and frustrating.

Do you have any idea why does this happens randomly?
Is there anything I have to check again?

Thank you.
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3806
Location: Austro Bavaria

PostPosted: Tue Apr 04, 2017 5:09 pm    Post subject: Reply with quote

are you using ~ branch? brand new hardware benefit the most from software updates from ~branch

dated hardware, 3 years or older can run on stable branch just fine usually

do you see the command line / shell? => init 3?

how do you start the x-server?

at least i would update the following components to the lastest availabe in portage (~ marked) => gentoo sources, mesa, intel gpu drivers, hole x server components, hole init components (e.g. openrc + eudev)
Back to top
View user's profile Send private message
sbbg
n00b
n00b


Joined: 02 Feb 2013
Posts: 16

PostPosted: Wed Apr 05, 2017 2:27 am    Post subject: Reply with quote

Roman_Gruber wrote:
are you using ~ branch? brand new hardware benefit the most from software updates from ~branch

dated hardware, 3 years or older can run on stable branch just fine usually

do you see the command line / shell? => init 3?

how do you start the x-server?

at least i would update the following components to the lastest availabe in portage (~ marked) => gentoo sources, mesa, intel gpu drivers, hole x server components, hole init components (e.g. openrc + eudev)

Sir,
thank you for reply first.

I'm actually using ~amd64 branch by now.

Sorry, I'm not sure about what do you mean by init 3.
But this laptop does start all services in boot, sysinit, nonetwork, and default run levels without any problem.
Thus I can see the normal [local] service result of shell prompt as well.

There are 2 ways I tried to launch X:
(1)by XDM, which is set to start SDDM in configuration, the xrandr commands which are required by Optimus guide are specified in /usr/share/sddm/scripts/Xsetup.
(2)by launching "startx" command by root. The xrandr commands are put into .xinitrc in root's home.

The package you mentioned are at the following version:
* I only mask 4.10 kernel for compatibility with nvidia-drivers;
Kernel: gentoo-source 4.9.20 ( look like current nvidia-driver does not get along well with kernel 4.10 )
Mesa: mesa-17.0.3 USE="classic dri3 egl gallium gbm nptl vdpau wayland"
intel gpu: xf86-video-intel is NOT installed by xorg-drivers meta package somehow. I'm not sure if this behavior is correct although I specified VIDEO_CARDS="intel i965 nvidia" in make.conf
nvidia-drivers: 378.13
OpenRC: openrc-0.24.2

If you have further suggestion, please tell me again.
I really scratched my head so hard my hairs are running out. Thank you.
Back to top
View user's profile Send private message
sbbg
n00b
n00b


Joined: 02 Feb 2013
Posts: 16

PostPosted: Wed Apr 05, 2017 6:16 am    Post subject: Further details concerning init Reply with quote

Hi,

I can finally come up with more details and clues.

I just found out when I choose "Recovery mode" in GRUB2, after input password and remount the root FS,
nvidia-smi no longer waste CPU usage, and I can start X with both startx and XDM normally.

So I couldn't help but to think there might be something wrong with the service executed in normal boot but not during recovery.
Thus here are the results of both services status.

Normal boot:
Code:
Runlevel: shutdown
 savecache                                                         [  stopped  ]
 killprocs                                                         [  stopped  ]
 mount-ro                                                          [  stopped  ]
Runlevel: sysinit
 sysfs                                                             [  started  ]
 devfs                                                             [  started  ]
 udev                                                              [  started  ]
 dmesg                                                             [  started  ]
 kmod-static-nodes                                                 [  started  ]
 udev-trigger                                                      [  started  ]
Runlevel: default
 dbus                                                              [  started  ]
 bluetooth                                                         [  started  ]
 wicd                                                              [  started  ]
 local                                                             [  started  ]
Runlevel: nonetwork
 local                                                             [  started  ]
Runlevel: boot
 modules                                                           [  started  ]
 hwclock                                                           [  started  ]
 hostname                                                          [  started  ]
 fsck                                                              [  started  ]
 root                                                              [  started  ]
 mtab                                                              [  started  ]
 localmount                                                        [  started  ]
 urandom                                                           [  started  ]
 sysctl                                                            [  started  ]
 bootmisc                                                          [  started  ]
 termencoding                                                      [  started  ]
 keymaps                                                           [  started  ]
 loopback                                                          [  started  ]
 procfs                                                            [  started  ]
 binfmt                                                            [  started  ]
Dynamic Runlevel: hotplugged
Dynamic Runlevel: needed/wanted
 modules-load                                                      [  started  ]
Dynamic Runlevel: manual


Recovery mode:
Code:

Runlevel: shutdown
 savecache                                                         [  stopped  ]
 killprocs                                                         [  stopped  ]
 mount-ro                                                          [  stopped  ]
Runlevel: sysinit
 sysfs                                                             [  started  ]
 devfs                                                             [  started  ]
 udev                                                              [  started  ]
 dmesg                                                             [  started  ]
 kmod-static-nodes                                                 [  started  ]
 udev-trigger                                                      [  started  ]
Runlevel: default
 dbus                                                              [  stopped  ]
 bluetooth                                                         [  stopped  ]
 wicd                                                              [  stopped  ]
 local                                                             [  stopped  ]
Runlevel: nonetwork
 local                                                             [  stopped  ]
Runlevel: boot
 modules                                                           [  stopped  ]
 hwclock                                                           [  stopped  ]
 hostname                                                          [  stopped  ]
 fsck                                                              [  stopped  ]
 root                                                              [  stopped  ]
 mtab                                                              [  stopped  ]
 localmount                                                        [  stopped  ]
 urandom                                                           [  stopped  ]
 sysctl                                                            [  stopped  ]
 bootmisc                                                          [  stopped  ]
 termencoding                                                      [  stopped  ]
 keymaps                                                           [  stopped  ]
 loopback                                                          [  stopped  ]
 procfs                                                            [  stopped  ]
 binfmt                                                            [  stopped  ]
Dynamic Runlevel: hotplugged
Dynamic Runlevel: needed/wanted
Dynamic Runlevel: manual


Does anyone can help me further investigate about where could the problem is?
Thank you.
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3806
Location: Austro Bavaria

PostPosted: Wed Apr 05, 2017 8:55 am    Post subject: Reply with quote

whats the content of grub.cfg

regarding both boot entires please? the working one, the bugged one

e.g. something like (starts with menuentry and ends with })

Code:
menuentry 'System BIOS setup' --users "" {
   echo   'Entering system BIOS setup ...'
   fwsetup
}
Back to top
View user's profile Send private message
sbbg
n00b
n00b


Joined: 02 Feb 2013
Posts: 16

PostPosted: Wed Apr 05, 2017 11:06 am    Post subject: Reply with quote

Roman_Gruber wrote:
whats the content of grub.cfg

regarding both boot entires please? the working one, the bugged one

e.g. something like (starts with menuentry and ends with })

Code:
menuentry 'System BIOS setup' --users "" {
   echo   'Entering system BIOS setup ...'
   fwsetup
}


Hi, thank you for reaching out again.

Here are the grub.cfg menu entries you asked:

Normal boot:
Code:

menuentry 'Gentoo GNU/Linux' --class gentoo --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-7f39c493-22cd-4e55-8b40-71527175e0\
f9' {
        load_video
        insmod gzio
        insmod part_gpt
        insmod ext2
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root  7f39c493-22cd-4e55-8b40-71527175e0f9
        else
          search --no-floppy --fs-uuid --set=root 7f39c493-22cd-4e55-8b40-71527175e0f9
        fi
        echo    'Loading Linux 4.9 ...'
        linux   /boot/kernel-4.9 root=/dev/nvme0n1p5 ro
}


Recovery mode:
Code:

        menuentry 'Gentoo GNU/Linux, with Linux 4.9 (recovery mode)' --class gentoo --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-4\
.9-recovery-7f39c493-22cd-4e55-8b40-71527175e0f9' {
                load_video
                insmod gzio
                insmod part_gpt
                insmod ext2
                if [ x$feature_platform_search_hint = xy ]; then
                  search --no-floppy --fs-uuid --set=root  7f39c493-22cd-4e55-8b40-71527175e0f9
                else
                  search --no-floppy --fs-uuid --set=root 7f39c493-22cd-4e55-8b40-71527175e0f9
                fi
                echo    'Loading Linux 4.9 ...'
                linux   /boot/kernel-4.9 root=/dev/nvme0n1p5 ro single
        }


I didn't change anything generated from grub2-mkconfig actually.
BTW, a really bad, partial solution I tried is to delete nvidia-smi.
But certainly, I would like to know how could this happen actually.
Thank you.
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3806
Location: Austro Bavaria

PostPosted: Wed Apr 05, 2017 1:26 pm    Post subject: Reply with quote

You boot into single mode with your recovery.

Quote:
/boot/kernel-4.9


Is this 4.9.0?

uname -a

latest available is 4.9.20. latest stable amd64 is 4.9.16

A few years ago devs just kicked out buggy kernel versions without a notice. An upgrade suddenly fixed it several times.

Quote:
OpenRC: openrc-0.24.2


do you use openrc with eudev?

Code:
ASUS-G75VW roman # qlist -Iv openrc eudev
app-admin/openrc-settingsd-1.0.1
sys-apps/openrc-0.24.2
sys-fs/eudev-3.2.1-r1
ASUS-G75VW roman #



I still wonder why your box hangs
Back to top
View user's profile Send private message
sbbg
n00b
n00b


Joined: 02 Feb 2013
Posts: 16

PostPosted: Thu Apr 06, 2017 7:02 am    Post subject: Reply with quote

Hi,

Yes, I'm using 4.9.20 gentoo-sources kernel,
here is the uname output:
Code:

Linux Pexus_Mobile 4.9.20-gentoo #3 SMP Tue Apr 4 06:19:24 CST 2017 x86_64 Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz GenuineIntel GNU/Linux


And yes, it uses eudev + openrc.
The result of qlist command is following:
Code:

sys-apps/openrc-0.24.2
sys-fs/eudev-3.2.1-r1


Please feel free to ask anymore details, even an account/passwd to login if that can help.
Thank you again.
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3806
Location: Austro Bavaria

PostPosted: Thu Apr 06, 2017 7:46 pm    Post subject: Reply with quote

google: nvidia-smi hangs

https://devtalk.nvidia.com/default/topic/929282/nvidia-smi-hangs-cannot-be-killed-even-by-sigkill/?offset=1

http://unix.stackexchange.com/questions/255658/nvidia-smi-hangs-indefinitely-what-could-be-the-issue

http://www.linuxquestions.org/questions/slackware-14/nvidia-drivers-some-hard-freezes-on-current-4175452311/

Quote:
Thanks for that Sairum. So far I have:

Installed NVidia 310.32
Disabled Hardware Acceleration in Flash
Disabled compositing in xorg
Upgraded kernel to 3.8.1
Appended 'irqpoll' in lilo.conf
Added '/usr/bin/nvidia-smi -pm 1' to rc.local


So now I will sit tight and see if I have beaten the freezes...


No idea on irqpoll kernel option
same for " Added '/usr/bin/nvidia-smi -pm 1' to rc.local"

https://bugs.archlinux.org/task/23679

Quote:
On reboot doing a nvidia-smi -q is printed a bunch of traces. Had to roll back to the current stable version to get X running (reinstalling the previous new beta didn't help).


---

Summary: Seems to happen a bit more often.

Suggestion, worth a try:

Try to use an older version of the nvidia-driver-binary
research the "irqpoll" kernel option, what it does, and if it may be useful, if it solves something
research " Added '/usr/bin/nvidia-smi -pm 1' to rc.local", if that gives any useful hint
maybe worth a try "On reboot doing a nvidia-smi -q is printed a bunch of traces." with the buggy setup.
try to ask on nvidia support forum. Maybe they can give you a hint.
Back to top
View user's profile Send private message
alikasundara
n00b
n00b


Joined: 10 Nov 2011
Posts: 5

PostPosted: Wed Apr 12, 2017 7:56 pm    Post subject: Reply with quote

Hi sbbg,

Have you had any luck with resolving that? I am facing the same issue, the kernel version is 4.9.6.

Thanks.
Back to top
View user's profile Send private message
Borodux
n00b
n00b


Joined: 21 May 2017
Posts: 1

PostPosted: Sun May 21, 2017 6:40 pm    Post subject: Reply with quote

Hi,

This problem arised for me after updating nvidia-drivers to 381.22. After trying to boot several times it may successfully start X server. Most of the times nvidia-smi eats 100% CPU and the process is totally unkillable.

However downgrading to 378.13 didn't solve the problem. I didn't notice which version was before, but all distfiles are new now. So old version might changed too.

I use:
- 4.9.16-gentoo i686
- framebuffer (vesafb) for splash theme and console decorations
- 01:00.0 VGA compatible controller: NVIDIA Corporation GK106 [GeForce GTX 650 Ti] (rev a1)

Disabling console decorations and framebuffer solved the problem. Also didn't find anything about this new behaviour.
Back to top
View user's profile Send private message
Yamakuzure
Advocate
Advocate


Joined: 21 Jun 2006
Posts: 2273
Location: Bardowick, Germany

PostPosted: Mon May 22, 2017 8:00 am    Post subject: Reply with quote

I have the same issue if the 'nvidia' kernel module is loaded instead of 'nvidia-uvm'. (I am using BumbleBee, though)

When I run something through primusrun/optirun, it looks like this, and everything works just fine:
Code:
 ~ # lsmod | grep nvidia
nvidia_modeset        775279  1
nvidia_uvm            560959  0
nvidia              11458879  40 nvidia_modeset,nvidia_uvm
 ~ # nvidia-smi
Mon May 22 10:03:28 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 381.22                 Driver Version: 381.22                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K2100M       Off  | 0000:01:00.0     Off |                  N/A |
| N/A   32C    P8    N/A /  N/A |      9MiB /  1999MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     13563    G   Xorg                                             7MiB |
+-----------------------------------------------------------------------------+
This problem arose when I updated to kernel 4.9, so I do not think that it is some hard bug in nvidia-drivers. It seems like 'nvidia_uvm' being a must-have now.
_________________
Important German:
  1. "Aha" - German reaction to pretend that you are really interested while giving no f*ck.
  2. "Tja" - German reaction to the apocalypse, nuclear war, an alien invasion or no bread in the house.
Back to top
View user's profile Send private message
Zephyrus
Apprentice
Apprentice


Joined: 01 Sep 2004
Posts: 204

PostPosted: Sat Oct 14, 2017 9:45 am    Post subject: Reply with quote

I have a slightly different setup, indeed I am using bumblebee to run a W530 using the Intel integrated graphics for the main laptop screen and bumblebeee-Nvidia to run an external display.
However, recently I started to have the same issue, with nvidia-smi hanging at boot, when run by udev. This was not happening at every boot but somehow randomly. This made me suspect that it was some kind of race condition at boot.
Indeed, I discovered that by adding a small sleep (sleep 1.5 below) to /lib/udev/nvidia-udev.sh, which was the script running nvidia-smi at boot,
Code:

                #hopefully this prevents infinite loops like bug #454740
                if lsmod | grep -iq nvidia; then
                        sleep 1.5
                        /opt/bin/nvidia-smi > /dev/null
                fi


the issue, at least on my system, disappears!
Back to top
View user's profile Send private message
Jimini
Guru
Guru


Joined: 31 Oct 2006
Posts: 581
Location: Germany

PostPosted: Sat Apr 28, 2018 6:40 pm    Post subject: Reply with quote

I can confirm this behavior for gentoo-sources-4.9.95 and nvidia-drivers-390.42. After running startx, the black screen only show a not blinking prompt, and nvidia-smi generates 100% cpu load. top / ps also show status "D", which means "uninterruptible sleep", IIRC.

Zephyrus' solution works in my case, so I could finally start the gui :-)

Best regards,
Jimini
_________________
"The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents." (H.P. Lovecraft: The Call of Cthulhu)
Back to top
View user's profile Send private message
Satarsa
n00b
n00b


Joined: 21 Sep 2005
Posts: 70
Location: Russia, St.-Petersburg

PostPosted: Thu Jan 24, 2019 3:24 pm    Post subject: Reply with quote

I hit the same issue, but Zephyrus's solution does not work for me.
The only thing works, is not to allow autoloading nvidia kernel module.
In my case I added to /etc/modprobe.d/blacklist.conf
Code:
blacklist nvidia
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Desktop Environments All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum