Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Graphics Card ... or Something Else [Understood]
View unanswered posts
View posts from last 24 hours

Goto page 1, 2, 3  Next  
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Sun Nov 26, 2017 9:52 pm    Post subject: Graphics Card ... or Something Else [Understood] Reply with quote

Team,

I recently replaced my nVidia 980GT graphics card with a Radeon RX 460. The idea being that I'll move the Radeon to a new system 'real soon..
Amazon Warehouse deals had a good price on the card and it dropped £30 over the weekend while I watched.

The system is a Phenom(tm) II X6 1090T on a M4A78T-E motherboard.

I suspect I've disturbed the 9 year old sediment because all is not well.

After the hardware swap and adding x11-drivers/xf86-video-amdgpu all seemed well - briefly.

At boot, the BIOS warned of a CPU fan issue. All the fans are spinning, all the temperatures are OK
That continues.

Intermittently there are lockups.
Most of the time, the system will not even respond to the reset button.
When it comes back after power cycling, its on 5 cores instead of 6.
There is nothing in any logs and ssh is unresponsive.

So far, I've updated everything graphics software related. That's the kernel, at 4.14.0-gentoo and x11-drivers/xf86-video-amdgpu, which is now -9999.
Its early days with -9999.

I don't see a graphics card causing hard lockups and the CPU booting missing a core points to CPU.

Next, I'll try a serial console. I used to use X-Modem to an HP iPaq, so the bits are still in the chassis.
After that, I'll swap the graphics card back.

If you have any other ideas for the investigation, please post.

-- edit --

I'm not pushing the card very hard
Code:
$ sudo cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Password:
Clock Gating Flags Mask: 0x37bcf
   Graphics Medium Grain Clock Gating: On
   Graphics Medium Grain memory Light Sleep: On
   Graphics Coarse Grain Clock Gating: On
   Graphics Coarse Grain memory Light Sleep: On
   Graphics Coarse Grain Tree Shader Clock Gating: Off
   Graphics Coarse Grain Tree Shader Light Sleep: Off
   Graphics Command Processor Light Sleep: On
   Graphics Run List Controller Light Sleep: On
   Graphics 3D Coarse Grain Clock Gating: Off
   Graphics 3D Coarse Grain memory Light Sleep: Off
   Memory Controller Light Sleep: On
   Memory Controller Medium Grain Clock Gating: On
   System Direct Memory Access Light Sleep: Off
   System Direct Memory Access Medium Grain Clock Gating: On
   Bus Interface Medium Grain Clock Gating: Off
   Bus Interface Light Sleep: On
   Unified Video Decoder Medium Grain Clock Gating: On
   Video Compression Engine Medium Grain Clock Gating: On
   Host Data Path Light Sleep: Off
   Host Data Path Medium Grain Clock Gating: On
   Digital Right Management Medium Grain Clock Gating: Off
   Digital Right Management Light Sleep: Off
   Rom Medium Grain Clock Gating: On
   Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
   300 MHz (MCLK)
   214 MHz (SCLK)
   0.127 W (VDDC)
   0.18 W (VDDCI)
   5.50 W (max GPU)
   5.145 W (average GPU)

GPU Temperature: 29 C
GPU Load: 0 %

UVD: Disabled

VCE: Disabled

_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.


Last edited by NeddySeagoon on Sun Dec 31, 2017 1:55 pm; edited 1 time in total
Back to top
View user's profile Send private message
Jaglover
Watchman
Watchman


Joined: 29 May 2005
Posts: 8291
Location: Saint Amant, Acadiana

PostPosted: Sun Nov 26, 2017 10:54 pm    Post subject: Reply with quote

It sounds like a messed up BIOS, have you tried a hard reset? Next thing would be checking all the voltages under normal load, I wouldn't trust the M/B sensors, I'd use a real voltmeter.
_________________
My Gentoo installation notes.
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Sun Nov 26, 2017 11:31 pm    Post subject: Reply with quote

Jaglover

I have not measured voltages directly - I don't trust the motherboard sensor either.
Good idea. I'll do that.

A power off reset works. The only time I have seen it power up on 5 cores instead of 6 cores is after a lock up that will not respond to the reset button.
That's hard wired to the CPU reset pin.

Why do you say its a messed up BIOS?
After the kernel has got control, the BIOS is no longer used.

I did wonder about the GPU firmware. As the kernel part is built in, that's a kernel rebuild. I'm following 4.14 fairly regularly.
I'll be sure to check for fimware updates before I build the kernel, not after.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Jaglover
Watchman
Watchman


Joined: 29 May 2005
Posts: 8291
Location: Saint Amant, Acadiana

PostPosted: Mon Nov 27, 2017 12:03 am    Post subject: Reply with quote

I meant CMOS reset like removing the battery or setting the reset jumper. The OS may get wrong ideas about hardware if it is messed up.
_________________
My Gentoo installation notes.
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7470

PostPosted: Mon Nov 27, 2017 9:49 am    Post subject: Reply with quote

I would look at cat /proc/interrupts, making sure the card is using msi edge and alone on its irq and if not, trying to isolate the card by checking pci slot irq sharing on m/b manual or adjusting them if bios allow it.
(while you look, check if you have Thermal events interrupts > 0)

But I'm more scared about the cpu fan issue, the lockup and core lost.
I would get the old good card back and see if lock/fan and core problems are gone with it.
ps: yeah, in my mind, it doesn't really smell good NeddySeagoon
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Mon Nov 27, 2017 1:51 pm    Post subject: Reply with quote

Jaglover,

The voltages measured with a DVM are OK. Sensors says
Code:
$ sensors
atk0110-acpi-0
Adapter: ACPI interface
Vcore Voltage:        +1.34 V  (min =  +0.85 V, max =  +1.70 V)
 +3.3 Voltage:        +3.29 V  (min =  +2.97 V, max =  +3.63 V)
 +5 Voltage:          +5.03 V  (min =  +4.50 V, max =  +5.50 V)
 +12 Voltage:        +12.44 V  (min = +10.20 V, max = +13.80 V)
CPU FAN Speed:       1002 RPM  (min =  600 RPM, max = 7200 RPM)
CHASSIS FAN Speed:    618 RPM  (min =  600 RPM, max = 7200 RPM)
CHASSIS FAN 2 Speed:  414 RPM  (min =  600 RPM, max = 7200 RPM)
CPU Temperature:      +35.0°C  (high = +60.0°C, crit = +95.0°C)
MB Temperature:       +34.0°C  (high = +45.0°C, crit = +75.0°C)

amdgpu-pci-0100
Adapter: PCI adapter
fan1:             N/A
temp1:        +26.0°C  (crit =  +0.0°C, hyst =  +0.0°C)

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +22.5°C  (high = +70.0°C)
                       (crit = +90.0°C, hyst = +85.0°C)
which is in good agreement with the DVM. I can't measure Vcore.
I have not checked the ripple. That's harder. Even though the PSU is 9 years old, its had an easy life. Its a 850W Corsair unit, so it been well derated.

Taking some of my own advice, I've done a visual on the Vcore regulator. Theres noting nasty there.

I'll do a CMOS reset before I revert graphics cards. Maybe even fit a new battery. The CMOS battery has never been replaced but it only ever does anything during power cuts as the 5v STBY is always there.

krinn,

The graphics card is using MSI and has an interrupt to itself. There is noting nasty in /proc/interrupts.

This morning, the system started normally. If it would break and stay broken, it would be easy to fix.


I'm tempted to turn off three or four cores in the BIOS and see if that has any effect, besides slowing down emerge.
Thats the first step to identifying a potentially faulty core.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
P.Kosunen
Guru
Guru


Joined: 21 Nov 2005
Posts: 309
Location: Finland

PostPosted: Mon Nov 27, 2017 6:01 pm    Post subject: Reply with quote

What power supply and how old it is?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Mon Nov 27, 2017 6:08 pm    Post subject: Reply with quote

P.Kosunen,

The PSU is an 850W Corsair from 2009.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
1clue
Advocate
Advocate


Joined: 05 Feb 2006
Posts: 2569

PostPosted: Mon Nov 27, 2017 7:46 pm    Post subject: Reply with quote

Take a piece of raw chicken, lay it on top of your old nVidia card. Put it in the oven at 350F for 20 minutes, be careful that the pci pins point to the magnetic north.

Just guessing that this might be a satisfactory substitute for a whole, living chicken. Or it might really tick off the dark nVidia gods.

Seriously, I have no real help. I've never successfully switched video cards like that, across brand. Mostly this post is to subscribe me to the thread, to make it easier to find out how this turned out.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Mon Nov 27, 2017 7:58 pm    Post subject: Reply with quote

1clue,

Welcome :)

This motherboard has a built in Radeon of some sort. I only used it for a few days while I was waiting for the nVidia card to arrive.
Its a real Radeon too. - none of this sharing main memory for the pixel buffer.
Even by the standards of 2009, it was a poor card but there's no AGP slot so it was that or nothing for a while.
Now back to ATI/AMD, so I've done the switch both ways.

Last thing last night I emerged linux-firmware, updated the kernel to 4.14.2 and discovered that I've have had the AMD microcode updater in the kernel since 2009 but never added the microcode. Oops.
Fixed that too.

So far today its been OK. Maybe I shouldn't say that.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Mon Nov 27, 2017 9:40 pm    Post subject: Reply with quote

Don't worry over the CPU missing a core - I get that with my own Phenom II (slightly older 720 model) sometimes, restarting makes it clear up. I guess it's a common problem with the microcode/BIOS/whatever.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Mon Nov 27, 2017 10:06 pm    Post subject: Reply with quote

Ant P.

Its something that has only just started to happen. With the hard lockups, not even responding te reset, I was wondering about a failing CPU core.
I'm glad its a feature and not a fault.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
bunder
Bodhisattva
Bodhisattva


Joined: 10 Apr 2004
Posts: 5934

PostPosted: Tue Nov 28, 2017 12:03 am    Post subject: Reply with quote

did you ever overclock it? could be an electromigration problem... although that's usually theorized to be a 30 year problem, not a 9 year problem. :lol:
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Tue Nov 28, 2017 12:20 am    Post subject: Reply with quote

bunder,

Nope. I'm an Electronics Engineer (retired), so I know better.
Even working hard it only reaches 60 C when the heatsink is clogged.
That's how I know to clean it.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7470

PostPosted: Tue Nov 28, 2017 12:48 pm    Post subject: Reply with quote

NeddySeagoon wrote:
I'm glad its a feature and not a fault.

:D
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Tue Nov 28, 2017 1:14 pm    Post subject: Reply with quote

Well, no improvement yet.

It ran all day yesterday. After 10 min this morning, it locked up and a reset brought it back on 5 cores.
I'll let it run like that to see if that leads to any improvement.

The new kernel and CPU micocode seem to have made it less crash prone but its early days.
One day is not statistically significant.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
P.Kosunen
Guru
Guru


Joined: 21 Nov 2005
Posts: 309
Location: Finland

PostPosted: Tue Nov 28, 2017 3:47 pm    Post subject: Reply with quote

NeddySeagoon wrote:
The PSU is an 850W Corsair from 2009.

Depending on model, it might be cause of issues. Models with all japanese caps should be still good, but if it has some inferior caps included, those might be going bad. (IIRC Corsair made both, good ones with all japanese caps, but also not so good ones.)
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10587
Location: Somewhere over Atlanta, Georgia

PostPosted: Tue Nov 28, 2017 4:43 pm    Post subject: Reply with quote

NeddySeagoon wrote:
Well, no improvement yet.
Have you tried just swapping back to the old card yet? Just to confirm that symptoms go away, that is.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Tue Nov 28, 2017 5:13 pm    Post subject: Reply with quote

John R. Graham,

That's the next step. If normality returns,
it will only confipm that its a system issue. That's not the same as saying its a graphics card problem.
Ask any early Ryzen adopter. :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7470

PostPosted: Tue Nov 28, 2017 5:24 pm    Post subject: Reply with quote

NeddySeagoon wrote:
John R. Graham,
That's the next step. If normality returns,
it will only confipm that its a system issue. That's not the same as saying its a graphics card problem.
Ask any early Ryzen adopter. :)


No NeddySeagoon, it would confirm it is really a system issue and answer:
- does the new card do this (incompatibility somewhere, or just some tweak params to find)
- or does your system has turn bad while you manipulate it or because the new card has bork it.

You were having no lock down with the nvidia, all was fine, getting back the nvidia could tells you, if you still need to fight with parameters or if your system is now damage (which fighting with parameters would do nothing).
For now you are battling against the card/kernel/whatever on a system that might have an issue not because of the card itself.

You really should just get back the nvidia card and see if all is fine ; if yes, it would confirm you're not fighting against the wind.
Back to top
View user's profile Send private message
Jaglover
Watchman
Watchman


Joined: 29 May 2005
Posts: 8291
Location: Saint Amant, Acadiana

PostPosted: Tue Nov 28, 2017 5:39 pm    Post subject: Reply with quote

Maybe the new card draws too much power from m/b. Does it have a separate power connector?
_________________
My Gentoo installation notes.
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Tue Nov 28, 2017 6:01 pm    Post subject: Reply with quote

Jaglover,

No, its motherboard powered. Its a fanless XFX R460
For the time being its in a PCI. 2.0 slot but its supposed to be backwards compatible.
I've never seen the card draw more that 7W.

The card it replaced had a 6 pin connector for 12v.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Jaglover
Watchman
Watchman


Joined: 29 May 2005
Posts: 8291
Location: Saint Amant, Acadiana

PostPosted: Tue Nov 28, 2017 6:17 pm    Post subject: Reply with quote

Totally off topic, but I'm curious. Why British disrespect Alessandro Volta? Because he is not one of them, foreign? All units named after a person are uppercase, yet British write v for volts, but W for watts - named after James Watt. :P
_________________
My Gentoo installation notes.
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54216
Location: 56N 3W

PostPosted: Tue Nov 28, 2017 7:02 pm    Post subject: Reply with quote

Jaglover,

We write A for Ampere and André-Marie Ampère was one of the old enemy ;)

I would mV for millivolts, not mv, and kV for kilovolts. Remember CRT EHT power supplies?
I wonder if there is some ambiguity about the symbol V alone?
Other that it being a Roman Numeral, none comes to mind.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
1clue
Advocate
Advocate


Joined: 05 Feb 2006
Posts: 2569

PostPosted: Tue Nov 28, 2017 7:12 pm    Post subject: Reply with quote

Jaglover wrote:
Totally off topic, but I'm curious. Why British disrespect Alessandro Volta? Because he is not one of them, foreign? All units named after a person are uppercase, yet British write v for volts, but W for watts - named after James Watt. :P


I'm curious why the lack of capitalization denotes disrespect as opposed to lack of knowledge or maybe as just plain laziness?

Speaking as a lazy American who has been using v for volts and a for amps without ever knowing or wondering about capitalization for the past 40 years.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum