Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
System hangs randomly but only when using amdgpu [solved]
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Amity88
Apprentice
Apprentice


Joined: 03 Jul 2010
Posts: 260
Location: Third planet from the Sun

PostPosted: Sun Feb 11, 2018 4:29 pm    Post subject: System hangs randomly but only when using amdgpu [solved] Reply with quote

This is a fresh Linux system, the screen randomly fills with a color (blue/white/yellow etc) and is rendered unusable till a restart. I'm not even sure if this is a hang because sometimes when this happens, I can still hear the audio from the youtube video.

It's weird as it never happens in Windows 8.1 but randomly hits me when I use Gentoo or SysRescueCD or SuSe or Mint of FreeBSD. In short it happens on any non-Windows OS.

I don't think it's a video issue because it happens even in pure CLI. The kern log or dmesg file doesn't indicate any error at the time of the hang. Do you guys have any suggestions on what else I could check to fix this?

Here's the output of lspci, this is an ASUS H81M-CS motherboard:

Code:

00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)
00:02.0 Display controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)
00:1c.1 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #2 (rev d5)
00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation C220 Series Chipset Family H81 Express LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Oland XT [Radeon HD 8670 / R7 250/350]
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
03:00.0 Network controller: Qualcomm Atheros AR9485 Wireless Network Adapter (rev 01)
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 11)


uname -an

Code:

Linux vivalarev 4.9.76-gentoo-r1 #3 SMP Sun Feb 11 18:08:13 IST 2018 x86_64 Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz GenuineIntel GNU/Linux

_________________
Ant P. wrote:
The enterprise distros sell their binaries. Canonical sells their users.


Also... Be ignorant... Be happy! :)


Last edited by Amity88 on Sat Oct 09, 2021 5:35 pm; edited 2 times in total
Back to top
View user's profile Send private message
Jaglover
Watchman
Watchman


Joined: 29 May 2005
Posts: 8291
Location: Saint Amant, Acadiana

PostPosted: Sun Feb 11, 2018 4:49 pm    Post subject: Reply with quote

Try swapping RAM modules if you have more than one. (Maybe Microsoft dream is fulfilled finally, a computer that runs Windows only.)
_________________
My Gentoo installation notes.
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
Amity88
Apprentice
Apprentice


Joined: 03 Jul 2010
Posts: 260
Location: Third planet from the Sun

PostPosted: Tue Feb 13, 2018 6:29 am    Post subject: Reply with quote

I tried running memtest86 over the past 24 hours and didn't get any errors or blanks screens.

Currently, I suspect that it's the AMG GPU driver (amdgpu R7 250, Southern Islands GCN 1.0) that is causing the issue. For the purpose of debug, I'm gonna try the following:

1. Try using the older radeon driver and see if the issue persists.
2. If that doesn't work, I'll try using the onboard Intel GPU.
_________________
Ant P. wrote:
The enterprise distros sell their binaries. Canonical sells their users.


Also... Be ignorant... Be happy! :)
Back to top
View user's profile Send private message
Amity88
Apprentice
Apprentice


Joined: 03 Jul 2010
Posts: 260
Location: Third planet from the Sun

PostPosted: Thu Feb 15, 2018 5:06 pm    Post subject: System hangs randomly but only when using amdgpu/radeon Reply with quote

(changing the subject to better reflect the actual issue)

So, I was able to narrow down the issue to the gprahics driver.

1. The screen blanks out randomly when I used AMDGPU drivers.
2. It's a lot worse when I used Radion.
3. The only thing that worked in the past was the old fglrx driver a year ago. Can't use this anymore though cause they're dropped support :(

4. The onboard Intel GPU driver is stable. This is what I'm using now.

Not sure how I can fix this. If you guys get the AMD R7 250 (Southern Islands) working without random hangs, please let me know.
_________________
Ant P. wrote:
The enterprise distros sell their binaries. Canonical sells their users.


Also... Be ignorant... Be happy! :)
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3339
Location: Rasi, Finland

PostPosted: Thu Feb 15, 2018 5:17 pm    Post subject: Reply with quote

Do you get anything in dmesg/logs?

Have you tried other kernel versions?

I have
VGA:
VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO [Radeon HD 7750/8740 / R7 250E]
on my server. So I could start poking around. I just need to attach a monitor to it. :P
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!


Last edited by Zucca on Thu Feb 15, 2018 7:55 pm; edited 1 time in total
Back to top
View user's profile Send private message
Amity88
Apprentice
Apprentice


Joined: 03 Jul 2010
Posts: 260
Location: Third planet from the Sun

PostPosted: Thu Feb 15, 2018 5:44 pm    Post subject: Reply with quote

I didn't find anything in the logs/dmesg when booted into SysRescueCD after an incident.

About the other kernel version. This system used to work fine with the fglrx drivers. Things just got messy after AMD moved over to the amdgpu drivers.

Also, it's good to know that you actually have something close in design. I think mine is GCN 1.1 and your is probably GCN 1.2 :)

I haven't really started using this build so I'm willing to experiment if you have anything you want me to try..
_________________
Ant P. wrote:
The enterprise distros sell their binaries. Canonical sells their users.


Also... Be ignorant... Be happy! :)
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54220
Location: 56N 3W

PostPosted: Thu Feb 15, 2018 6:19 pm    Post subject: Reply with quote

Amity88,

I have a similar issue with an R450.

I've tried different motherboard slots, the old and new amdgpu drivers, turning off Message Signalled IRQs. (its a command line option)
Memtest finds nothing and there is nothing in kernel logs.

The incident halts the CPU, as it won't even respond to the reset button, which is probably why there is nothing in the logs.
After a power cycle, the system often restarts with one core missing.

A restart can take a couple of hours too. Its left my raid5 'dirty' a few times so it does a resync.

So far, I've only tried the two amdgpu drivers but a few other non accelerated drives should work.
That may help determine if its a hardware or a software problem.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Amity88
Apprentice
Apprentice


Joined: 03 Jul 2010
Posts: 260
Location: Third planet from the Sun

PostPosted: Thu Feb 15, 2018 7:06 pm    Post subject: Reply with quote

Hey there Neddy :)

I've ruled out any hardware issues because I dual boot with Windows 8.1 and it runs pretty stable.

The symptoms on Linux are very similar to what you experience. Doesn't respond to reset key combinations, the actual reset button doesn't work at times. Restarts don't take much time though.

The non-accelerated drivers would do software rendering right? As crappy as it is, I figured that the Intel GPU is better than software rendering. Maybe I should try pulling in fglrx or amdgpu-pro
_________________
Ant P. wrote:
The enterprise distros sell their binaries. Canonical sells their users.


Also... Be ignorant... Be happy! :)
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54220
Location: 56N 3W

PostPosted: Thu Feb 15, 2018 7:27 pm    Post subject: Reply with quote

Amity88,

Yes - there would be no acceleration at all. I had in mind vesa or fbdev.
The GPU does nothing and the CPU does all the drawing. Performance will be terrible.

I've gone back to my 9 year old nVidia card meanwhile as I need to address Meltdown/Spectre eveywhere and having random lock ups doesn't help.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
thumper
Guru
Guru


Joined: 06 Dec 2002
Posts: 552
Location: Venice FL

PostPosted: Tue Feb 20, 2018 2:03 am    Post subject: Reply with quote

Have you checked your logs for kernel crash dumps?

I had these:
Code:
amdgpu 0000:24:00.0: swiotlb buffer is full (sz: 2097152 bytes)
swiotlb: coherent allocation failed for device 0000:24:00.0 size=2097152
CPU: 0 PID: 5149 Comm: Compositor Tainted: G           OE    4.15.3-gentoo #1


And it would eventually hard lock the machine.

After some research I added this to my kernel command line:
Code:
swiotlb=65536


Did that last week, have not crashed since, still time will tell. Could be a coincidence.

George
Back to top
View user's profile Send private message
PrSo
Tux's lil' helper
Tux's lil' helper


Joined: 01 Jun 2017
Posts: 136

PostPosted: Tue Feb 20, 2018 9:20 am    Post subject: Reply with quote

thumper,
those messages in log are totally harmless and shouldn't be the reason of hard locking, please see this bug report, and this patch on LKML, so this _is_ a coincidence, although this could be a symptom.

Amity88,
is there any special reason that you are on 4.9.76-gentoo-r1 kernel?
Back to top
View user's profile Send private message
Amity88
Apprentice
Apprentice


Joined: 03 Jul 2010
Posts: 260
Location: Third planet from the Sun

PostPosted: Tue Feb 20, 2018 2:40 pm    Post subject: Reply with quote

PrSo wrote:

Amity88,
is there any special reason that you are on 4.9.76-gentoo-r1 kernel?


I just use this version because it was the latest stable kernel. Do you feel that a newer kernel would fix the problem?
_________________
Ant P. wrote:
The enterprise distros sell their binaries. Canonical sells their users.


Also... Be ignorant... Be happy! :)
Back to top
View user's profile Send private message
Mimamau
Apprentice
Apprentice


Joined: 11 Jun 2002
Posts: 160
Location: Germany

PostPosted: Tue Feb 20, 2018 2:48 pm    Post subject: Reply with quote

As in my other thread, there seems to be problems with southern islands gpus.
I only get a slow 2d desktop, everything else gives me a blank screen or crashes the system completely.

Even the amdgpu-pro drivers don't work on supported distributions. AMD support wrote:

"I apologize for the delay. I was waiting for feedback from the subject matter experts.
Unfortunately, it appears the HD7870 series has not been qualified with our latest drivers.
The recommendation is to use the inbox drivers or an open source driver, available here: https://www.x.org/wiki/RadeonFeature/#index10h2
If you experience issues with the open source drivers, please file a report at the link above. I have been informed that our engineers monitor and investigate reports listed there.
In order to update this service request, please respond, leaving the service request reference intact.
Best regards,
AMD Global Customer Care"
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54220
Location: 56N 3W

PostPosted: Tue Feb 20, 2018 2:58 pm    Post subject: Reply with quote

Amity88,

There is a new amdgpu driver is the 4.15 kernel.
Its worth a try.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Tony0945
Watchman
Watchman


Joined: 25 Jul 2006
Posts: 5127
Location: Illinois, USA

PostPosted: Tue Feb 20, 2018 3:22 pm    Post subject: Reply with quote

Amity88 wrote:
I just use this version because it was the latest stable kernel. Do you feel that a newer kernel would fix the problem?

4.9.82 is in the tree.

I have problems with 4.4.x and 4.9.x with motherboard module nct6775 failing to load. No problem with 4.14.x Trying 'meld' on the relevant kernel source, I see that 4.4 and 4.9 are identical but 4.4 has tables with an extra entry. Undoubtedly that line supports my mobo which is a new AM4 mobo.

NeddySeagoon wrote:
Amity88,

There is a new amdgpu driver is the 4.15 kernel.
Its worth a try.
Based on Neddy's input, I would try 4.14 or 4.15 (has some Spectre mitigation) or, depending on your comfort level, try backporting the driver to 4.9.
I think I'll try that, just for fun.
EDIT backporting the driver worked fine. Couldn't find where in kernel.org to file a bug. I may just file a bug against gentoo-sources


Last edited by Tony0945 on Tue Aug 14, 2018 12:52 am; edited 3 times in total
Back to top
View user's profile Send private message
PrSo
Tux's lil' helper
Tux's lil' helper


Joined: 01 Jun 2017
Posts: 136

PostPosted: Tue Feb 20, 2018 4:12 pm    Post subject: Reply with quote

Amity88 wrote:

Do you feel that a newer kernel would fix the problem?


Just like Neddy sad, you should try 4.15.4. (change to ~amd64 or unmask gentoo-sources)
There is a big improvement with amdgpu driver, and a new AMD DC (but I am not sure if your card family -Oland- is supported, BTW SI=GCN 1.0)

I have one machine with GCN 1.1 (R4 APU - CIK) and this is the first mainline kernel (4.15) when things on amdgpu driver works quite good (it is still _experimental_ for SI and CIK tough).

One more thing, do you have dual gpu enabled, Intel and AMD?

Code:
00:02.0 Display controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54220
Location: 56N 3W

PostPosted: Tue Feb 20, 2018 4:57 pm    Post subject: Reply with quote

Having done the Spectre updates, I've gone back to my RX450 card.
As the 4.15 kernel didn't fix my lockups, I'm trying 4.16-rc1

Watch this space.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3339
Location: Rasi, Finland

PostPosted: Tue Feb 20, 2018 5:38 pm    Post subject: Reply with quote

NeddySeagoon wrote:
Watch this space.
I have stalled all kernel and amdgpu updates. Now waiting eagerly.

I really don't want my server to lock up. I have exactly one spare GPU and it is AMD HD 7850. I think it's affected too. And I think the current one on my server is too: Cape Verde PRO R7 250E
My desktop has Fiji Based R9 Nano... I think I'm safe there...
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
gcyoung
Apprentice
Apprentice


Joined: 04 Jul 2007
Posts: 170
Location: England

PostPosted: Mon Aug 13, 2018 9:32 pm    Post subject: Amdgpu Radeon R7-240 card and ryzen3 processor Reply with quote

I am also getting intermittent screen and wireless keyboard freezes. While it works, the amdgpu module seems better than the radeonsi. I don't know if it is connected, but my login dmesg output contains a message [[Firmware Bug:] ACPI MWAIT C-state 0x0 not supported by hw].

I note that the Arch web site also contains referenced to problems with the combination of amdgpu and ryzen processor.

I have ssh'd (without X) into the computer from another machine, and find it is still responding normally to commands.
It's a pity, since I like the result before it freezes!

PS: I am using kernel-4.17.6 which is not listed as stable, but I found the same problem with an earlier stable kernel
Back to top
View user's profile Send private message
Goverp
Veteran
Veteran


Joined: 07 Mar 2007
Posts: 1999

PostPosted: Tue Aug 14, 2018 8:39 am    Post subject: Reply with quote

This is probably no help, but I'm running an hp laptop with /proc/cpuinfo model name: "AMD A9-9420 RADEON R5, 5 COMPUTE CORES 2C+3G". It's a STONEY graphics thingy.
It also has an rtl8723de modem, which meant I need a very later kernel (and an external module), so I've been running kernel 4.16 originally, 4.17.1 now.
Never had any problems like described in this thread, nor any issues from using a late kernel.
AFAIK (I read Phoronix summaries) AMDGPU support features regularly in the kernel change logs.

I currently have:
Code:
/etc/portage/make.conf
VIDEO_CARDS="amdgpu radeonsi"
and, to reduce kernel churn:
Code:
/etc/portage/package.keywords
<=sys-kernel/gentoo-sources-4.17.1 ~amd64

I read today that 4.18 has more AMDGPU stuff.
_________________
Greybeard
Back to top
View user's profile Send private message
gcyoung
Apprentice
Apprentice


Joined: 04 Jul 2007
Posts: 170
Location: England

PostPosted: Wed Aug 15, 2018 8:13 pm    Post subject: Amdgpu Freezes Reply with quote

It may be of interest to others with this problem to know that since my last posting I have followed a suggestion given under the heading dpm' on https://wiki.gentoo.org/wiki/AMDGPU#Hardware_detection.

[echo performance > /sys/class/drm/card0/device/power_dpm_state]
and:-
[echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level]

Since making these settings I have had no further "freezes", except when I made only the first setting. Since making the settings I have used the computer, including one five hour mythtv frontend performance, for about twenty hours. Previously, I failed regularly to complete a fairly standard viewing of a film --say about two hours, without needing a reboot.

Unfortunately the settings disappear when I log out, although I suppose I can write a small script to run these settings on login. If there is any way to include the settings as options to the module, I'd be glad to hear of it:-- or possibly there might be kernel setting which would do the trick.

If I don't return with a message that I've had another "freezup", then it can be assumed that these settings have solved, at least my difficulty, although it might not work in other cases.
Back to top
View user's profile Send private message
Amity88
Apprentice
Apprentice


Joined: 03 Jul 2010
Posts: 260
Location: Third planet from the Sun

PostPosted: Sat Oct 09, 2021 5:34 pm    Post subject: Reply with quote

@gcyoung,
You have solved it I think! This is the same solution that worked for me as well. I came back here to updated it. Basically we need to disable the dynamic power management (dpm) of this gpu.
_________________
Ant P. wrote:
The enterprise distros sell their binaries. Canonical sells their users.


Also... Be ignorant... Be happy! :)
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3339
Location: Rasi, Finland

PostPosted: Sun Oct 10, 2021 8:00 pm    Post subject: Reply with quote

Thanks. Gotta poke those settings too.
Sadly, it looks like power consumption will increase. :|
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum