View previous topic :: View next topic |
Author |
Message |
KLarsen n00b
Joined: 30 Dec 2005 Posts: 61 Location: Spain
|
Posted: Fri Nov 16, 2018 5:05 pm Post subject: [SOLVED - bad RAM] Nvidia Xid errors |
|
|
Since around 2 weeks ago I've started getting NVRM Xid errors when gaming. I have a GTX660 running the latest stable nvidia-drivers (396.54). Around the time the errors started I put 2 new RAM sticks in, doubling the amount of memory.
After a LOT of testing the culprit seems to be the new RAM sticks, when they are installed I get errors, when I take them out I can play games for hours with no errors (might just be a coincidence...).
However, I get no errors in memtest86, even after running it for over 20 hours. The vendor won't initially RMA the memory without some error in memtest (I'm going to pressure them on this though).
I've also tried downgrading nvidia-drivers and the kernel, recompiled the nvidia-drivers several times, put the GPU in the other PCI-E socket, shuffled the RAM around...
The only thing I haven't tried yet is downgrading xorg-server and xorg-drivers, which did get upgraded around the time the errors started happening. I will do this tomorrow.
These are the errors that appear in the log:
Code: | Nov 4 12:35:34 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 4 12:35:34 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00002384, Data 40000001, ErrorCode 0000000c
Nov 4 12:57:30 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 4 12:57:30 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0030, Class 0000a097, Offset 00001c80, Data 40000000, ErrorCode 0000000c
Nov 4 23:53:41 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 4 23:53:41 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000050 beef3901 0000a040 000001b8 1f789000
Nov 5 22:44:50 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 32, Channel ID 00000050 intr 00040000
Nov 5 22:54:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000050 beef3901 0000a040 000001b8 2faac600
Nov 5 23:11:48 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 31, Ch 00000050, engmask 00000101, intr 10000000
Nov 6 23:48:26 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00001b00, Data 00004100, ErrorCode 0000000c
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: EXTRA_MACRO_DATA
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ESR 0x404490=0x80000002
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ChID 0058, Class 0000a097, Offset 00001b00, Data 00004100
Nov 6 23:48:35 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000058 beef9097 0000a097 00001414 00000000
Nov 6 23:51:10 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00001418, Data 00000004, ErrorCode 0000000c
Nov 7 13:02:52 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 7 13:02:52 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000038 beef3901 0000a040 000001b8 ffffffff
Nov 7 13:06:37 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000038 beef3901 0000a040 000001b8 ffffffff
|
I usually get Xid 69, which according to https://docs.nvidia.com/deploy/xid-errors/index.html is either a hardware error or driver error. None of these errors point to a RAM problem.
I can run Unigine benchmark through several passes without errors, only Steam games give me problems. Also, once in a while the KDE/Plasma compositor stops unexpectedly (no errors in the log though). In general the system is totally stable, running 24/7 and I can reliantly compile with no errors.
So, can anyone help me and suggest something else to try and pinpoint the problem? Is the GPU going bad? Is the system RAM really the culprit? Any help would be much appreciated.
Last edited by KLarsen on Sat Nov 17, 2018 12:40 pm; edited 1 time in total |
|
Back to top |
|
|
bunder Bodhisattva
Joined: 10 Apr 2004 Posts: 5934
|
Posted: Fri Nov 16, 2018 7:50 pm Post subject: |
|
|
I see you tried reseating the card... are you overclocking the card at all? How good is your case cooling? Power supply rails? _________________
Neddyseagoon wrote: | The problem with leaving is that you can only do it once and it reduces your influence. |
banned from #gentoo since sept 2017 |
|
Back to top |
|
|
KLarsen n00b
Joined: 30 Dec 2005 Posts: 61 Location: Spain
|
Posted: Fri Nov 16, 2018 8:06 pm Post subject: |
|
|
The card is factory overclocked.
Cooling should be good, neither the GPU nor the CPU gets above 60°C with the case closed. Opening the case, I still get errors.
I do have another PSU I can check, I'll do so tomorrow. |
|
Back to top |
|
|
KLarsen n00b
Joined: 30 Dec 2005 Posts: 61 Location: Spain
|
Posted: Sat Nov 17, 2018 12:39 pm Post subject: |
|
|
I finally got errors in memtest86, I left it overnight for the third time and this morning it had found 64 errors. Time for RMA. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|