Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED - bad RAM] Nvidia Xid errors
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
KLarsen
n00b
n00b


Joined: 30 Dec 2005
Posts: 61
Location: Spain

PostPosted: Fri Nov 16, 2018 5:05 pm    Post subject: [SOLVED - bad RAM] Nvidia Xid errors Reply with quote

Since around 2 weeks ago I've started getting NVRM Xid errors when gaming. I have a GTX660 running the latest stable nvidia-drivers (396.54). Around the time the errors started I put 2 new RAM sticks in, doubling the amount of memory.
After a LOT of testing the culprit seems to be the new RAM sticks, when they are installed I get errors, when I take them out I can play games for hours with no errors (might just be a coincidence...).
However, I get no errors in memtest86, even after running it for over 20 hours. The vendor won't initially RMA the memory without some error in memtest (I'm going to pressure them on this though).
I've also tried downgrading nvidia-drivers and the kernel, recompiled the nvidia-drivers several times, put the GPU in the other PCI-E socket, shuffled the RAM around...
The only thing I haven't tried yet is downgrading xorg-server and xorg-drivers, which did get upgraded around the time the errors started happening. I will do this tomorrow.

These are the errors that appear in the log:
Code:
Nov 4 12:35:34 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 4 12:35:34 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00002384, Data 40000001, ErrorCode 0000000c
Nov 4 12:57:30 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 4 12:57:30 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0030, Class 0000a097, Offset 00001c80, Data 40000000, ErrorCode 0000000c
Nov 4 23:53:41 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 4 23:53:41 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000050 beef3901 0000a040 000001b8 1f789000
Nov 5 22:44:50 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 32, Channel ID 00000050 intr 00040000
Nov 5 22:54:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000050 beef3901 0000a040 000001b8 2faac600
Nov 5 23:11:48 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 31, Ch 00000050, engmask 00000101, intr 10000000
Nov 6 23:48:26 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00001b00, Data 00004100, ErrorCode 0000000c
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: EXTRA_MACRO_DATA
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ESR 0x404490=0x80000002
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ChID 0058, Class 0000a097, Offset 00001b00, Data 00004100
Nov 6 23:48:35 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000058 beef9097 0000a097 00001414 00000000
Nov 6 23:51:10 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00001418, Data 00000004, ErrorCode 0000000c
Nov 7 13:02:52 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 7 13:02:52 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000038 beef3901 0000a040 000001b8 ffffffff
Nov 7 13:06:37 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000038 beef3901 0000a040 000001b8 ffffffff

I usually get Xid 69, which according to https://docs.nvidia.com/deploy/xid-errors/index.html is either a hardware error or driver error. None of these errors point to a RAM problem.

I can run Unigine benchmark through several passes without errors, only Steam games give me problems. Also, once in a while the KDE/Plasma compositor stops unexpectedly (no errors in the log though). In general the system is totally stable, running 24/7 and I can reliantly compile with no errors.

So, can anyone help me and suggest something else to try and pinpoint the problem? Is the GPU going bad? Is the system RAM really the culprit? Any help would be much appreciated.


Last edited by KLarsen on Sat Nov 17, 2018 12:40 pm; edited 1 time in total
Back to top
View user's profile Send private message
bunder
Bodhisattva
Bodhisattva


Joined: 10 Apr 2004
Posts: 5934

PostPosted: Fri Nov 16, 2018 7:50 pm    Post subject: Reply with quote

I see you tried reseating the card... are you overclocking the card at all? How good is your case cooling? Power supply rails?
_________________
Neddyseagoon wrote:
The problem with leaving is that you can only do it once and it reduces your influence.

banned from #gentoo since sept 2017
Back to top
View user's profile Send private message
KLarsen
n00b
n00b


Joined: 30 Dec 2005
Posts: 61
Location: Spain

PostPosted: Fri Nov 16, 2018 8:06 pm    Post subject: Reply with quote

The card is factory overclocked.
Cooling should be good, neither the GPU nor the CPU gets above 60°C with the case closed. Opening the case, I still get errors.
I do have another PSU I can check, I'll do so tomorrow.
Back to top
View user's profile Send private message
KLarsen
n00b
n00b


Joined: 30 Dec 2005
Posts: 61
Location: Spain

PostPosted: Sat Nov 17, 2018 12:39 pm    Post subject: Reply with quote

I finally got errors in memtest86, I left it overnight for the third time and this morning it had found 64 errors. Time for RMA.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum