Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
AMD GPU RX 6800 Random display freeze
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
MorgothSauron
Tux's lil' helper
Tux's lil' helper


Joined: 24 Sep 2020
Posts: 75

PostPosted: Mon Nov 13, 2023 6:38 pm    Post subject: AMD GPU RX 6800 Random display freeze Reply with quote

Hello,
Since early September I'm experiencing random display freeze with my ADM GPU. I say it is related to the GPU because /var/log/messages contains kernel messages related to amdgpu. It happened at least 5 times since I started troubleshooting. It might be a pure coincidence but the issue started around the time I started to use Kernel 6.5.

I was able to identify a pattern for this issue, but I'm not able to trigger the problem on purpose I have to wait for the issue to happen to collect any data for troubleshooting.

The freeze follows this pattern:
- Firefox (~amd64) is playing a Youtube video
- The video freezes like it is buffering but the audio is still working
- The display is not refreshing anymore. I can't Alt+Tab and the mouse cursor is not moving.
- I cannot switch to a different console (e.g. Ctrl+Alt+F1)
- The audio stops after about 5 minutes and the screen goes black with a non-blinking cursor at the top left. No text at all.

At this point I have no other option than a power reset.

I was not sure if the system was completely frozen or not. I enabled SSH to give me opportunity to try recovery (e.g. clean reboot).

I was able to connect with SSH the next time the issue happened. At least the system was still working to some extent. I tried a reboot but it didn't work. My SSH session terminated and my PC was still responding to ping after 5 minutes. I had no way to know what was happening and had to force a power reset. I know the ping response was not from a system in boot process because I have LUKS enabled and I have to enter a passphrase.

It never happened while playing a game on Linux. I do get a driver timeout from time to time when I start a specific game on Windows, but this could be a problem with the game itself and not the GPU.

I tried to search on different forums and I couldn't find much information using some keywords from the log.

I did find this https://bugzilla.kernel.org/show_bug.cgi?id=201957 but it didn't help. With kernel 6.5 the default for amdgpu.mcbp is indeed -1 compared to 6.4 where the default is 0. I tried to set the value to 0 but I still encountered the same issue. I know this post is for a different issue, but I decided to give it a try anyway.

I created /etc/modprobe/amdgpu.conf to configure mcbp=0
Code:
#
options amdgpu mcbp=0
#


I searched AMD GPU Gitlab (https://gitlab.freedesktop.org/drm/amd/) without luck. I'm checking here before trying to open a problem there.

The PC itself is located in a well ventilated space. I clean the inside of the case with a dust blower every month. I take care to not let any fan spins when I use the dust blower. I checked that the GPU is properly "seated" in the PCI slot. The GPU fans are working and will speed up under load. I didn't notice temperature issue using nvtop. This is a brand new GPU purchased from a reputable store in April 2023.

No overclocking (CPU and GPU).

/var/log/messages will contain the following message:

Code:
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 2 PID: 4758 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8242 amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables bpfilter bridge stp llc vfat fat joydev snd_hda_codec_realtek snd_hda_codec_generic amdgpu snd_sof_pci_intel_cnl snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils wireguard snd_soc_skl libchacha20poly1305 snd_soc_sst_ipc chacha_x86_64 snd_soc_sst_dsp poly1305_x86_64 snd_hda_ext_core ip6_udp_tunnel snd_soc_acpi_intel_match udp_tunnel snd_soc_acpi ledtrig_audio ipv6 snd_soc_core snd_hda_codec_hdmi snd_compress snd_pcm_dmaengine ac97_bus crc_ccitt drm_suballoc_helper intel_rapl_msr amdxcp snd_hda_intel intel_rapl_common mfd_core x86_pkg_temp_thermal snd_intel_dspcfg drm_buddy curve25519_x86_64 intel_powerclamp gpu_sched libcurve25519_generic snd_hda_codec libchacha crct10dif_pclmul
kernel:  drm_display_helper snd_hda_core ghash_clmulni_intel it87 cec snd_hwdep sha512_ssse3 drm_ttm_helper hwmon_vid snd_pcm ee1004 ttm rapl intel_cstate drm_kms_helper mei_hdcp snd_timer wmi_bmof intel_wmi_thunderbolt coretemp i2c_i801 intel_uncore pcspkr efi_pstore drm i2c_smbus snd mei_me hid_logitech_hidpp soundcore mei video backlight acpi_pad wmi intel_pch_thermal efivarfs dm_crypt trusted asn1_encoder dm_mod hid_logitech_dj sr_mod sd_mod cdrom crc32_pclmul xhci_pci crc32c_intel e1000e ahci xhci_hcd libahci
kernel: CPU: 2 PID: 4758 Comm: X Not tainted 6.5.11-gentoo-x86_64 #1
kernel: Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS PRO/Z390 AORUS PRO-CF, BIOS F12 11/05/2021
kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel: Code: 40 fd ff ff 48 8d 95 94 fd ff ff 48 8b 85 50 fd ff ff 48 8b b6 50 01 00 00 48 8b b8 78 f4 03 00 e8 11 88 20 00 e9 87 f9 ff ff <0f> 0b e9 44 f0 ff ff 49 8b 4d 28 49 39 4b 28 0f 95 85 a0 fc ff ff
kernel: RSP: 0018:ffff986ec25ab8c8 EFLAGS: 00010002
kernel: RAX: 0000000000000286 RBX: 0000000000000286 RCX: 0000000000000019
kernel: RDX: 0000000000000001 RSI: 0000000000000297 RDI: 0000000000000002
kernel: RBP: ffff986ec25abc60 R08: 0000000000000001 R09: 0000000000000000
kernel: R10: ffff8b1f40795118 R11: ffff986ec25ab82c R12: ffff8b1f40795000
kernel: R13: ffff8b1f07d80010 R14: ffff8b218a2c3400 R15: 0000000000000000
kernel: FS:  00007fea15738900(0000) GS:ffff8b269dc80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fc645976b6c CR3: 00000001068ca003 CR4: 00000000003706e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel:  ? __warn+0x7d/0x130
kernel:  ? amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel:  ? report_bug+0x16d/0x1a0
kernel:  ? handle_bug+0x3a/0x70
kernel:  ? exc_invalid_op+0x13/0x60
kernel:  ? asm_exc_invalid_op+0x16/0x20
kernel:  ? amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel:  ? amdgpu_dm_atomic_commit_tail+0x28bc/0x3930 [amdgpu]
kernel:  ? __wake_up_klogd.part.0+0x3c/0x60
kernel:  ? vprintk_emit+0x17f/0x200
kernel:  commit_tail+0x91/0x130 [drm_kms_helper]
kernel:  drm_atomic_helper_commit+0x116/0x140 [drm_kms_helper]
kernel:  drm_atomic_commit+0x93/0xc0 [drm]
kernel:  ? __pfx___drm_printfn_info+0x10/0x10 [drm]
kernel:  drm_mode_obj_set_property_ioctl+0x146/0x3a0 [drm]
kernel:  ? __pfx_drm_mode_obj_set_property_ioctl+0x10/0x10 [drm]
kernel:  drm_ioctl_kernel+0xbe/0x160 [drm]
kernel:  drm_ioctl+0x258/0x4d0 [drm]
kernel:  ? __pfx_drm_mode_obj_set_property_ioctl+0x10/0x10 [drm]
kernel:  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
kernel:  __x64_sys_ioctl+0x90/0xd0
kernel:  do_syscall_64+0x38/0x90
kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
kernel: RIP: 0033:0x7fea15cbe3fb
kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
kernel: RSP: 002b:00007fff02c67fd0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
kernel: RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fea15cbe3fb
kernel: RDX: 00007fff02c68060 RSI: 00000000c01864ba RDI: 000000000000000c
kernel: RBP: 00007fff02c68060 R08: 0000000000000093 R09: 0000000000001000
kernel: R10: 000000000ffaf041 R11: 0000000000000246 R12: 00000000c01864ba
kernel: R13: 000000000000000c R14: 000055c1aad58460 R15: 0000000000000fff
kernel:  </TASK>
kernel: ---[ end trace 0000000000000000 ]---


That specific block will repeat multiple times without little different. This block appeared 20 times the last time the issue happened. I can provide a full copy of the log if necessary.

System details (inxi -F)

Code:
System:
  Host: morgoth Kernel: 6.5.11-gentoo-x86_64 arch: x86_64 bits: 64
    Desktop: KDE Plasma v: 5.27.8 Distro: Gentoo Base System release 2.14
Machine:
  Type: Desktop System: Gigabyte product: Z390 AORUS PRO v: N/A
    serial: <superuser required>
  Mobo: Gigabyte model: Z390 AORUS PRO-CF serial: <superuser required>
    UEFI: American Megatrends v: F12 date: 11/05/2021
CPU:
  Info: 8-core model: Intel Core i7-9700K bits: 64 type: MCP cache: L2: 2 MiB
  Speed (MHz): avg: 800 min/max: 800/4900 cores: 1: 800 2: 800 3: 800 4: 800
    5: 800 6: 800 7: 800 8: 800
Graphics:
  Device-1: AMD Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] driver: amdgpu
    v: kernel
  Display: x11 server: X.org v: 1.21.1.9 with: Xwayland v: 23.2.2 driver: X:
    loaded: amdgpu unloaded: modesetting,radeon dri: radeonsi gpu: amdgpu
    resolution: 2560x1440~144Hz
  API: OpenGL v: 4.6 Mesa 23.1.8 renderer: AMD Radeon RX 6800 (navi21 LLVM
    16.0.6 DRM 3.54 6.5.11-gentoo-x86_64)
Audio:
  Device-1: Intel Cannon Lake PCH cAVS driver: snd_hda_intel
  Device-2: AMD Navi 21/23 HDMI/DP Audio driver: snd_hda_intel
  API: ALSA v: k6.5.11-gentoo-x86_64 status: kernel-api
  Server-1: PulseAudio v: 16.1 status: active
Network:
  Device-1: Intel Ethernet I219-V driver: e1000e
  IF: eno1 state: up speed: 1000 Mbps duplex: full mac: 18:c0:4d:2d:b3:7e
  IF-ID-1: virbr0 state: down mac: 52:54:00:0a:95:c4
Drives:
  Local Storage: total: 4.99 TiB used: 3.55 TiB (71.1%)
  ID-1: /dev/nvme0n1 vendor: LDLC model: F8+M.2 480 size: 447.13 GiB
  ID-2: /dev/nvme1n1 vendor: Samsung model: SSD 970 EVO Plus 1TB
    size: 931.51 GiB
  ID-3: /dev/sda vendor: Western Digital model: WD40EZRZ-22GXCB0
    size: 3.64 TiB
Partition:
  ID-1: / size: 844.04 GiB used: 467.86 GiB (55.4%) fs: btrfs dev: /dev/dm-0
  ID-2: /boot size: 487.2 MiB used: 142.7 MiB (29.3%) fs: ext4
    dev: /dev/nvme1n1p5
  ID-3: /home size: 844.04 GiB used: 467.86 GiB (55.4%) fs: btrfs
    dev: /dev/dm-0
  ID-4: /var size: 844.04 GiB used: 467.86 GiB (55.4%) fs: btrfs
    dev: /dev/dm-0
Swap:
  ID-1: swap-1 type: file size: 7.98 GiB used: 0 KiB (0.0%)
    file: /var/swapfile
Sensors:
  System Temperatures: cpu: 30.0 C pch: 46.0 C mobo: N/A gpu: amdgpu
    temp: 44.0 C
  Fan Speeds (RPM): cpu: 811 fan-2: 0 fan-3: 0 gpu: amdgpu fan: 0
Info:
  Processes: 359 Uptime: 21m Memory: available: 31.27 GiB
  used: 5.54 GiB (17.7%) Shell: Zsh inxi: 3.3.27


System information (neofetch --off)
Code:

OS: Gentoo Linux x86_64
Host: Z390 AORUS PRO
Kernel: 6.5.11-gentoo-x86_64
Uptime: 42 mins
Packages: 1408 (emerge)
Shell: zsh 5.9
Resolution: 2560x1440
DE: Plasma 5.27.8
WM: KWin
Theme: Breeze Light [Plasma], Breeze [GTK2/3]
Icons: [Plasma], breeze [GTK2/3]
Terminal: kitty
CPU: Intel i7-9700K (8) @ 4.900GHz
GPU: AMD ATI Radeon RX 6800/6800 XT / 6900 XT
Memory: 4659MiB / 32024MiB



I have the following firmware configured for AMD GPU in /etc/portage/savedconfig/sys-kernel/linux-firmware-20231030. sienna is for my current GPU and I kept navi14 for my old GPU (AMD RX 5500 XT)

Code:
amdgpu/sienna_cichlid_vcn.bin
amdgpu/sienna_cichlid_ta.bin
amdgpu/sienna_cichlid_sos.bin
amdgpu/sienna_cichlid_smc.bin
amdgpu/sienna_cichlid_sdma.bin
amdgpu/sienna_cichlid_rlc.bin
amdgpu/sienna_cichlid_pfp.bin
amdgpu/sienna_cichlid_mec2.bin
amdgpu/sienna_cichlid_mec.bin
amdgpu/sienna_cichlid_me.bin
amdgpu/sienna_cichlid_dmcub.bin
amdgpu/sienna_cichlid_ce.bin
amdgpu/navi14_ta.bin
amdgpu/navi14_vcn.bin
amdgpu/navi14_sos.bin
amdgpu/navi14_smc.bin
amdgpu/navi14_sdma1.bin
amdgpu/navi14_sdma.bin
amdgpu/navi14_rlc.bin
amdgpu/navi14_pfp_wks.bin
amdgpu/navi14_pfp.bin
amdgpu/navi14_mec2_wks.bin
amdgpu/navi14_mec2.bin
amdgpu/navi14_mec_wks.bin
amdgpu/navi14_mec.bin
amdgpu/navi14_me_wks.bin
amdgpu/navi14_me.bin
amdgpu/navi14_gpu_info.bin
amdgpu/navi14_ce_wks.bin
amdgpu/navi14_ce.bin
amdgpu/navi14_asd.bin


Any suggestion ?
Back to top
View user's profile Send private message
jpsollie
Apprentice
Apprentice


Joined: 17 Aug 2013
Posts: 291

PostPosted: Wed Nov 15, 2023 8:18 pm    Post subject: Reply with quote

MorgothSauron,

let's try to isolate the issue first:
Firefox may be using a software renderer and opengl / vulkan to render the image,
or may be using hardware video decoding. I think the former is true.

Can you use youtube downloader and play the video with eg VLC or MPV to see whether it works in a hardware accelerated environment?
_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]
Back to top
View user's profile Send private message
MorgothSauron
Tux's lil' helper
Tux's lil' helper


Joined: 24 Sep 2020
Posts: 75

PostPosted: Thu Nov 16, 2023 5:27 pm    Post subject: Reply with quote

jpsollie wrote:
MorgothSauron,

let's try to isolate the issue first:
Firefox may be using a software renderer and opengl / vulkan to render the image,
or may be using hardware video decoding. I think the former is true.

Can you use youtube downloader and play the video with eg VLC or MPV to see whether it works in a hardware accelerated environment?


Is there a way to check what Firefox is using for rendering ?

One thing I remembered after your post is that I added the hwaccel USE flag to Firefox back in May. That's still a few months before the first appearance of the issue I currently have.

I will try your suggestion and download the YouTube video for local playback with VLC or MPV. However this approach implies that a given video would trigger a problem each time.

To be honest. I never tried to play the same video a second time to see what happens. I have nothing to lose trying your suggestion. It can only provide more information to continue troubleshooting.

Right now the issue is still unpredictable. I watch Youtube for few hours every day. The issue can take days or even weeks to happen again. I know because I'm writing down when it happens and I make a copy of /var/log/messages.
Back to top
View user's profile Send private message
CooSee
Veteran
Veteran


Joined: 20 Nov 2004
Posts: 1441
Location: Earth

PostPosted: Thu Nov 16, 2023 8:53 pm    Post subject: Reply with quote

Quote:
Is there a way to check what Firefox is using for rendering ?

Code:
about:support

e.g.
Code:
Window Protocol   wayland

that's what i get on my only hyprland system - xwayland disabled.
Quote:
One thing I remembered after your post is that I added the hwaccel USE flag to Firefox

i don't use hwaccel USE flag! - no glitches - no freezes, but i use an very old RX590

have you tried with other desktop environment, e.g. gnome or maybe hyprland ?

8)
_________________
" Die Realität ist eine Illusion, die durch Mangel an ehrlicher Kommunikation entsteht "
---
" Der Mensch ist von Natur aus neugierig, was am Ende übrig bleibt ist die Gier "
Back to top
View user's profile Send private message
MorgothSauron
Tux's lil' helper
Tux's lil' helper


Joined: 24 Sep 2020
Posts: 75

PostPosted: Mon Nov 20, 2023 4:51 pm    Post subject: Reply with quote

Code:
about:support


I'm using X11 (Window Protocol = x11) since I built this system 2 years ago. I'll check if I can "transition" to wayland.

Quote:
i don't use hwaccel USE flag! - no glitches - no freezes


I only experience rare display freeze. I know I play on words, but what is being displayed is glitch-free. It just stops refreshing. No screen tearing, no visual artifact.

Quote:
have you tried with other desktop environment, e.g. gnome or maybe hyprland ?


Haven't tried other desktop environment. I only have KDE Plasma installed from the beginning and I'd like to keep it that way. I only install what I really need and I usually do a test installation in a VM first (yes, I have a gentoo VM that I maintain separately).

I'll try to remove the hwaccel for Firefox and see how it goes in the long run. I'll post back when I have new elements to share.
Back to top
View user's profile Send private message
logrusx
Veteran
Veteran


Joined: 22 Feb 2018
Posts: 1535

PostPosted: Mon Nov 20, 2023 6:18 pm    Post subject: Re: AMD GPU RX 6800 Random display freeze Reply with quote

MorgothSauron wrote:

[u]
- I cannot switch to a different console (e.g. Ctrl+Alt+F1)


Try pressing ALT+PrtSc/SysRq+R prior to attempting to switch to a different VT.

Best Regards,
Georgi
Back to top
View user's profile Send private message
CooSee
Veteran
Veteran


Joined: 20 Nov 2004
Posts: 1441
Location: Earth

PostPosted: Thu Nov 23, 2023 7:56 pm    Post subject: Reply with quote

@MorgothSauron

if it's not much to ask - can you try Gentoo Live Gui - to get sure that this is not an Hardware issue !

8)
_________________
" Die Realität ist eine Illusion, die durch Mangel an ehrlicher Kommunikation entsteht "
---
" Der Mensch ist von Natur aus neugierig, was am Ende übrig bleibt ist die Gier "
Back to top
View user's profile Send private message
MorgothSauron
Tux's lil' helper
Tux's lil' helper


Joined: 24 Sep 2020
Posts: 75

PostPosted: Wed Dec 06, 2023 6:22 pm    Post subject: Reply with quote

CooSee wrote:
@MorgothSauron

if it's not much to ask - can you try Gentoo Live Gui - to get sure that this is not an Hardware issue !

8)


I'm not sure to understand how booting from a Live ISO will help identify a hardware issue. I could try to stress test the CPU or run memory check (e.g. memtest). Not sure how to test the GPU.


The issue happened again last Sunday. I spent the whole day gaming on Windows (an old EA game that doesn't work at all on Linux) without any kind of issue. Sure it's windows, but it is the same hardware except Windows is booted from an external drive. In September / October I was playing Baldur's Gate 3 for hours on my Gentoo system without any issue. No frame drop, no freeze, no visual glitches. It just worked.

I booted Gentoo after my gaming session on Sunday and a freeze happened within an hour. Same pattern as before.

Kernel 6.5.13. Firefox ~amd64 without hwaccell. I removed the hwaccell flag about 2 weeks ago.

I still wasn't able to switch to a console. I noticed that some keyboard shortcuts were working to some level (e.g. increase / decrease volume).

Trying ALT+PrtSc/SysRq+R does nothing, even when the system is working normally. CONFIG_MAGIC_SYSRQ is enabled. Maybe I'm missing something here or I'm not doing it correctly.

I was able to play back the same video multiple times from start to finish without issue. I did this test 2 times in Firefox and 2 times with VLC (MP4 downloaded using youtube-dl).

Firefox itself was in a clean state. By this I mean that the browser cache, cookies, history is cleared each time I quit Firefox. I had only two tabs opened: Youtube and web mail.

It remains completely random, at least based on what I could find so far.

Not sure what else I can do besides testing the hardware. Also not sure it is a good idea to report a bug on the AMDGPU Gitlab considering it is unpredictable.

Edit:
Pressing ALT+PrtSc/SysRq+R do cause message to be logged (dmesg), but nothing else.[/quote]

Code:
[  +5.729001] sysrq: Keyboard mode set to system default
[  +2.238008] sysrq: Keyboard mode set to system default
[Dec 6 18:57] sysrq: Keyboard mode set to system default
[ +10.729998] sysrq: Keyboard mode set to system default
[Dec 6 18:59] sysrq: Keyboard mode set to system default
[Dec 6 19:00] sysrq: Emergency Sync
[  +7.821371] Emergency Sync complete
[Dec 6 19:25] sysrq: Keyboard mode set to system default
[  +6.525996] sysrq: Keyboard mode set to system default
[ +11.579997] sysrq: Keyboard mode set to system default
[  +8.704000] sysrq: Keyboard mode set to system default
[  +9.415007] sysrq: Keyboard mode set to system default
[  +4.569995] sysrq: Keyboard mode set to system default
[Dec 6 19:27] sysrq: Keyboard mode set to system default
[ +21.819045] sysrq: Keyboard mode set to system default
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21635

PostPosted: Wed Dec 06, 2023 6:49 pm    Post subject: Reply with quote

MorgothSauron wrote:
CooSee wrote:
if it's not much to ask - can you try Gentoo Live Gui - to get sure that this is not an Hardware issue !
I'm not sure to understand how booting from a Live ISO will help identify a hardware issue. I could try to stress test the CPU or run memory check (e.g. memtest). Not sure how to test the GPU.
I believe CooSee wanted to prove it was not a hardware issue, by having you run from presumed-good software. If the problem ceased to occur when using the presumed-good ISO, that would suggest your installed Gentoo system is at fault. If the problem persisted even in the ISO, that would suggest the hardware is at fault. The next paragraph of your response looks to me like an equivalent test. The general successes with Windows and the October session on Gentoo suggest that the hardware is not fundamentally broken. I cannot offer advice on how to debug the software though.
Back to top
View user's profile Send private message
MorgothSauron
Tux's lil' helper
Tux's lil' helper


Joined: 24 Sep 2020
Posts: 75

PostPosted: Wed Dec 06, 2023 7:02 pm    Post subject: Reply with quote

Hu wrote:
MorgothSauron wrote:
CooSee wrote:
if it's not much to ask - can you try Gentoo Live Gui - to get sure that this is not an Hardware issue !
I'm not sure to understand how booting from a Live ISO will help identify a hardware issue. I could try to stress test the CPU or run memory check (e.g. memtest). Not sure how to test the GPU.
I believe CooSee wanted to prove it was not a hardware issue, by having you run from presumed-good software. If the problem ceased to occur when using the presumed-good ISO, that would suggest your installed Gentoo system is at fault. If the problem persisted even in the ISO, that would suggest the hardware is at fault. The next paragraph of your response looks to me like an equivalent test. The general successes with Windows and the October session on Gentoo suggest that the hardware is not fundamentally broken. I cannot offer advice on how to debug the software though.


Understood. It didn't cross my mind it would be to check the software stack. I could try doing this for a while. The main "issue" is the unpredictability. Using the a Live ISO would be great if I had a way to trigger the problem. Not the case at the moment.

In the meantime I will investigate to understand why ALT+PrtSc/SysRq+R doesn't seem to do anything special.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21635

PostPosted: Wed Dec 06, 2023 7:16 pm    Post subject: Reply with quote

SysRq+R puts the keyboard into raw mode, so that Alt+F1 will be processed directly by the kernel (and switch you to tty1), so that Alt+F1 works even if Xorg is frozen. However, SysRq+R is only to enable raw mode. You still need to press keys the kernel handles to get a useful result afterward. Once the display hangs, enter raw mode, then try different ttys to see which, if any, you can switch to. I suggest this since your first attempt might be for the tty on which Xorg is running, in which case getting no result is expected.
Back to top
View user's profile Send private message
logrusx
Veteran
Veteran


Joined: 22 Feb 2018
Posts: 1535

PostPosted: Wed Dec 06, 2023 8:09 pm    Post subject: Reply with quote

@Hu, it is CTRL+ALT+F1 :)

@MorgothSauron, when my system starts artifacting, it's usually after wake up and switching through a framebufeer console, to the main graphic console (there the login screen is present, usually #1) back to my session (usually the next one, in my case #2), fixes the issue for me. It goes through some kind of graphics re-initialization. However sometimes the keyboard is blocked by X/Wayland and it doesn't respond, so I need to put the keyboard back into raw mode so that, as Hu mentioned, the kernel processes those commands directly.

If it works for you, you'll at leas be able to collect logs.

Best Regards,
Georgi
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21635

PostPosted: Wed Dec 06, 2023 8:46 pm    Post subject: Reply with quote

logrusx wrote:
@Hu, it is CTRL+ALT+F1 :)
When Xorg is in control, yes. When the kernel is in control, plain Alt+F1 will suffice.
Back to top
View user's profile Send private message
logrusx
Veteran
Veteran


Joined: 22 Feb 2018
Posts: 1535

PostPosted: Thu Dec 07, 2023 7:50 am    Post subject: Reply with quote

Hu wrote:
logrusx wrote:
@Hu, it is CTRL+ALT+F1 :)
When Xorg is in control, yes. When the kernel is in control, plain Alt+F1 will suffice.


:oops: Didn't know that, thanks : )
Back to top
View user's profile Send private message
MorgothSauron
Tux's lil' helper
Tux's lil' helper


Joined: 24 Sep 2020
Posts: 75

PostPosted: Wed Dec 13, 2023 8:02 pm    Post subject: Reply with quote

So it happened again. Same pattern.

Kernel 6.5.13-r1

Pressing ALT+PrtSc/SysRq didn't help. I could see things being logged in /var/log/messages, but I still couldn't switch to a console.

Code:
Dec 13 20:14:57 myhost kernel: sysrq: Keyboard mode set to system default
Dec 13 20:15:48 myhost kernel: sysrq: Keyboard mode set to system default
Dec 13 20:16:09 myhost kernel: sysrq: Keyboard mode set to system default
Dec 13 20:18:51 myhost kernel: sysrq: Keyboard mode set to system default
Dec 13 20:19:31 myhost kernel: sysrq: Keyboard mode set to system default


Luckily I could connect using SSH to try to gather some data. I still had to reset the power because I couldn't do clean poweroff (ssh disconnected, no display and response to ping)

There is these message that appeared in the output of dmesg (which I could find later in /var/log/kern.log)

Code:
amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
amdgpu 0000:03:00.0: [drm] *ERROR* [CONNECTOR:99:DP-1] commit wait timed out
amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out


I searched one of the line the previous kern.log and I could find the same message back in September when this issue started to happen.

Code:
Sep 29 18:56:54 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct  1 19:04:00 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct  1 19:04:41 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct  1 19:05:21 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct  1 19:06:01 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 13 19:38:37 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 13 19:39:44 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 31 20:46:28 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 31 20:47:39 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:48:17 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:48:47 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:49:48 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:50:28 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:50:58 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:51:38 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:52:08 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:52:38 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:53:08 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:53:38 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec  3 19:29:59 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec  3 19:31:15 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:15:19 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:15:59 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:16:39 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:17:19 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:17:49 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:18:29 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:19:00 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:19:30 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:20:00 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out


A freeze happened for each of the days where that message was logged.

Similar message were logged before that, but they are not "*ERROR* and they occurred every day without any display issue

Code:
Sep  1 18:50:49 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep  2 10:07:42 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep  3 10:21:38 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep  4 17:37:54 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep  5 17:43:23 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep  6 09:52:37 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep  7 18:43:41 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep  8 19:06:20 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep  9 18:12:06 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 10 10:50:25 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 11 17:47:41 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 11 18:08:34 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 12 18:09:13 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 12 19:06:17 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 13 17:58:06 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 13 19:47:15 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 14 17:55:44 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 15 18:27:58 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 16 09:34:54 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 17 11:30:27 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 18 17:29:53 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 19 17:42:08 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update


I didn't switch to kernel 6.6 yet because genkernel is broken with that kernel version and I need an initramfs (luks, btrfs). I need to "test" dracut to make sure it can create the correct initramfs to boot my system.

The good thing is that this time I found some new messages. I'll have to investigate a little bit using the latest messages I could find.
Back to top
View user's profile Send private message
CooSee
Veteran
Veteran


Joined: 20 Nov 2004
Posts: 1441
Location: Earth

PostPosted: Wed Dec 13, 2023 9:25 pm    Post subject: Reply with quote

Code:
amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
amdgpu 0000:03:00.0: [drm] *ERROR* [CONNECTOR:99:DP-1] commit wait timed out
amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out

seems to be an very old - reappearing amdgpu bug !

https://gitlab.freedesktop.org/drm/amd/-/issues/?sort=created_date&state=opened&search=%5Bdrm%5D%20%2aERROR%2a%20flip_done%20timed%20out&first_page_size=20

https://gitlab.freedesktop.org/drm/amd/-/issues/2950

https://www.phoronix.com/news/AMDGPU-Fix-For-5.19-Bug

8)
_________________
" Die Realität ist eine Illusion, die durch Mangel an ehrlicher Kommunikation entsteht "
---
" Der Mensch ist von Natur aus neugierig, was am Ende übrig bleibt ist die Gier "
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum