After a LOT of testing the culprit seems to be the new RAM sticks, when they are installed I get errors, when I take them out I can play games for hours with no errors (might just be a coincidence...).
However, I get no errors in memtest86, even after running it for over 20 hours. The vendor won't initially RMA the memory without some error in memtest (I'm going to pressure them on this though).
I've also tried downgrading nvidia-drivers and the kernel, recompiled the nvidia-drivers several times, put the GPU in the other PCI-E socket, shuffled the RAM around...
The only thing I haven't tried yet is downgrading xorg-server and xorg-drivers, which did get upgraded around the time the errors started happening. I will do this tomorrow.
These are the errors that appear in the log:
Code: Select all
Nov 4 12:35:34 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 4 12:35:34 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00002384, Data 40000001, ErrorCode 0000000c
Nov 4 12:57:30 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 4 12:57:30 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0030, Class 0000a097, Offset 00001c80, Data 40000000, ErrorCode 0000000c
Nov 4 23:53:41 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 4 23:53:41 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000050 beef3901 0000a040 000001b8 1f789000
Nov 5 22:44:50 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 32, Channel ID 00000050 intr 00040000
Nov 5 22:54:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000050 beef3901 0000a040 000001b8 2faac600
Nov 5 23:11:48 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 31, Ch 00000050, engmask 00000101, intr 10000000
Nov 6 23:48:26 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00001b00, Data 00004100, ErrorCode 0000000c
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: EXTRA_MACRO_DATA
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ESR 0x404490=0x80000002
Nov 6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ChID 0058, Class 0000a097, Offset 00001b00, Data 00004100
Nov 6 23:48:35 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000058 beef9097 0000a097 00001414 00000000
Nov 6 23:51:10 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00001418, Data 00000004, ErrorCode 0000000c
Nov 7 13:02:52 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov 7 13:02:52 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000038 beef3901 0000a040 000001b8 ffffffff
Nov 7 13:06:37 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000038 beef3901 0000a040 000001b8 ffffffff
I can run Unigine benchmark through several passes without errors, only Steam games give me problems. Also, once in a while the KDE/Plasma compositor stops unexpectedly (no errors in the log though). In general the system is totally stable, running 24/7 and I can reliantly compile with no errors.
So, can anyone help me and suggest something else to try and pinpoint the problem? Is the GPU going bad? Is the system RAM really the culprit? Any help would be much appreciated.

