View previous topic :: View next topic |
Author |
Message |
MorgothSauron Tux's lil' helper
Joined: 24 Sep 2020 Posts: 75
|
Posted: Mon Nov 13, 2023 6:38 pm Post subject: AMD GPU RX 6800 Random display freeze |
|
|
Hello,
Since early September I'm experiencing random display freeze with my ADM GPU. I say it is related to the GPU because /var/log/messages contains kernel messages related to amdgpu. It happened at least 5 times since I started troubleshooting. It might be a pure coincidence but the issue started around the time I started to use Kernel 6.5.
I was able to identify a pattern for this issue, but I'm not able to trigger the problem on purpose I have to wait for the issue to happen to collect any data for troubleshooting.
The freeze follows this pattern:
- Firefox (~amd64) is playing a Youtube video
- The video freezes like it is buffering but the audio is still working
- The display is not refreshing anymore. I can't Alt+Tab and the mouse cursor is not moving.
- I cannot switch to a different console (e.g. Ctrl+Alt+F1)
- The audio stops after about 5 minutes and the screen goes black with a non-blinking cursor at the top left. No text at all.
At this point I have no other option than a power reset.
I was not sure if the system was completely frozen or not. I enabled SSH to give me opportunity to try recovery (e.g. clean reboot).
I was able to connect with SSH the next time the issue happened. At least the system was still working to some extent. I tried a reboot but it didn't work. My SSH session terminated and my PC was still responding to ping after 5 minutes. I had no way to know what was happening and had to force a power reset. I know the ping response was not from a system in boot process because I have LUKS enabled and I have to enter a passphrase.
It never happened while playing a game on Linux. I do get a driver timeout from time to time when I start a specific game on Windows, but this could be a problem with the game itself and not the GPU.
I tried to search on different forums and I couldn't find much information using some keywords from the log.
I did find this https://bugzilla.kernel.org/show_bug.cgi?id=201957 but it didn't help. With kernel 6.5 the default for amdgpu.mcbp is indeed -1 compared to 6.4 where the default is 0. I tried to set the value to 0 but I still encountered the same issue. I know this post is for a different issue, but I decided to give it a try anyway.
I created /etc/modprobe/amdgpu.conf to configure mcbp=0
Code: | #
options amdgpu mcbp=0
# |
I searched AMD GPU Gitlab (https://gitlab.freedesktop.org/drm/amd/) without luck. I'm checking here before trying to open a problem there.
The PC itself is located in a well ventilated space. I clean the inside of the case with a dust blower every month. I take care to not let any fan spins when I use the dust blower. I checked that the GPU is properly "seated" in the PCI slot. The GPU fans are working and will speed up under load. I didn't notice temperature issue using nvtop. This is a brand new GPU purchased from a reputable store in April 2023.
No overclocking (CPU and GPU).
/var/log/messages will contain the following message:
Code: | kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 2 PID: 4758 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8242 amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables bpfilter bridge stp llc vfat fat joydev snd_hda_codec_realtek snd_hda_codec_generic amdgpu snd_sof_pci_intel_cnl snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils wireguard snd_soc_skl libchacha20poly1305 snd_soc_sst_ipc chacha_x86_64 snd_soc_sst_dsp poly1305_x86_64 snd_hda_ext_core ip6_udp_tunnel snd_soc_acpi_intel_match udp_tunnel snd_soc_acpi ledtrig_audio ipv6 snd_soc_core snd_hda_codec_hdmi snd_compress snd_pcm_dmaengine ac97_bus crc_ccitt drm_suballoc_helper intel_rapl_msr amdxcp snd_hda_intel intel_rapl_common mfd_core x86_pkg_temp_thermal snd_intel_dspcfg drm_buddy curve25519_x86_64 intel_powerclamp gpu_sched libcurve25519_generic snd_hda_codec libchacha crct10dif_pclmul
kernel: drm_display_helper snd_hda_core ghash_clmulni_intel it87 cec snd_hwdep sha512_ssse3 drm_ttm_helper hwmon_vid snd_pcm ee1004 ttm rapl intel_cstate drm_kms_helper mei_hdcp snd_timer wmi_bmof intel_wmi_thunderbolt coretemp i2c_i801 intel_uncore pcspkr efi_pstore drm i2c_smbus snd mei_me hid_logitech_hidpp soundcore mei video backlight acpi_pad wmi intel_pch_thermal efivarfs dm_crypt trusted asn1_encoder dm_mod hid_logitech_dj sr_mod sd_mod cdrom crc32_pclmul xhci_pci crc32c_intel e1000e ahci xhci_hcd libahci
kernel: CPU: 2 PID: 4758 Comm: X Not tainted 6.5.11-gentoo-x86_64 #1
kernel: Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS PRO/Z390 AORUS PRO-CF, BIOS F12 11/05/2021
kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel: Code: 40 fd ff ff 48 8d 95 94 fd ff ff 48 8b 85 50 fd ff ff 48 8b b6 50 01 00 00 48 8b b8 78 f4 03 00 e8 11 88 20 00 e9 87 f9 ff ff <0f> 0b e9 44 f0 ff ff 49 8b 4d 28 49 39 4b 28 0f 95 85 a0 fc ff ff
kernel: RSP: 0018:ffff986ec25ab8c8 EFLAGS: 00010002
kernel: RAX: 0000000000000286 RBX: 0000000000000286 RCX: 0000000000000019
kernel: RDX: 0000000000000001 RSI: 0000000000000297 RDI: 0000000000000002
kernel: RBP: ffff986ec25abc60 R08: 0000000000000001 R09: 0000000000000000
kernel: R10: ffff8b1f40795118 R11: ffff986ec25ab82c R12: ffff8b1f40795000
kernel: R13: ffff8b1f07d80010 R14: ffff8b218a2c3400 R15: 0000000000000000
kernel: FS: 00007fea15738900(0000) GS:ffff8b269dc80000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fc645976b6c CR3: 00000001068ca003 CR4: 00000000003706e0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel: <TASK>
kernel: ? amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel: ? __warn+0x7d/0x130
kernel: ? amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel: ? report_bug+0x16d/0x1a0
kernel: ? handle_bug+0x3a/0x70
kernel: ? exc_invalid_op+0x13/0x60
kernel: ? asm_exc_invalid_op+0x16/0x20
kernel: ? amdgpu_dm_atomic_commit_tail+0x3884/0x3930 [amdgpu]
kernel: ? amdgpu_dm_atomic_commit_tail+0x28bc/0x3930 [amdgpu]
kernel: ? __wake_up_klogd.part.0+0x3c/0x60
kernel: ? vprintk_emit+0x17f/0x200
kernel: commit_tail+0x91/0x130 [drm_kms_helper]
kernel: drm_atomic_helper_commit+0x116/0x140 [drm_kms_helper]
kernel: drm_atomic_commit+0x93/0xc0 [drm]
kernel: ? __pfx___drm_printfn_info+0x10/0x10 [drm]
kernel: drm_mode_obj_set_property_ioctl+0x146/0x3a0 [drm]
kernel: ? __pfx_drm_mode_obj_set_property_ioctl+0x10/0x10 [drm]
kernel: drm_ioctl_kernel+0xbe/0x160 [drm]
kernel: drm_ioctl+0x258/0x4d0 [drm]
kernel: ? __pfx_drm_mode_obj_set_property_ioctl+0x10/0x10 [drm]
kernel: amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
kernel: __x64_sys_ioctl+0x90/0xd0
kernel: do_syscall_64+0x38/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8
kernel: RIP: 0033:0x7fea15cbe3fb
kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
kernel: RSP: 002b:00007fff02c67fd0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
kernel: RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fea15cbe3fb
kernel: RDX: 00007fff02c68060 RSI: 00000000c01864ba RDI: 000000000000000c
kernel: RBP: 00007fff02c68060 R08: 0000000000000093 R09: 0000000000001000
kernel: R10: 000000000ffaf041 R11: 0000000000000246 R12: 00000000c01864ba
kernel: R13: 000000000000000c R14: 000055c1aad58460 R15: 0000000000000fff
kernel: </TASK>
kernel: ---[ end trace 0000000000000000 ]--- |
That specific block will repeat multiple times without little different. This block appeared 20 times the last time the issue happened. I can provide a full copy of the log if necessary.
System details (inxi -F)
Code: | System:
Host: morgoth Kernel: 6.5.11-gentoo-x86_64 arch: x86_64 bits: 64
Desktop: KDE Plasma v: 5.27.8 Distro: Gentoo Base System release 2.14
Machine:
Type: Desktop System: Gigabyte product: Z390 AORUS PRO v: N/A
serial: <superuser required>
Mobo: Gigabyte model: Z390 AORUS PRO-CF serial: <superuser required>
UEFI: American Megatrends v: F12 date: 11/05/2021
CPU:
Info: 8-core model: Intel Core i7-9700K bits: 64 type: MCP cache: L2: 2 MiB
Speed (MHz): avg: 800 min/max: 800/4900 cores: 1: 800 2: 800 3: 800 4: 800
5: 800 6: 800 7: 800 8: 800
Graphics:
Device-1: AMD Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] driver: amdgpu
v: kernel
Display: x11 server: X.org v: 1.21.1.9 with: Xwayland v: 23.2.2 driver: X:
loaded: amdgpu unloaded: modesetting,radeon dri: radeonsi gpu: amdgpu
resolution: 2560x1440~144Hz
API: OpenGL v: 4.6 Mesa 23.1.8 renderer: AMD Radeon RX 6800 (navi21 LLVM
16.0.6 DRM 3.54 6.5.11-gentoo-x86_64)
Audio:
Device-1: Intel Cannon Lake PCH cAVS driver: snd_hda_intel
Device-2: AMD Navi 21/23 HDMI/DP Audio driver: snd_hda_intel
API: ALSA v: k6.5.11-gentoo-x86_64 status: kernel-api
Server-1: PulseAudio v: 16.1 status: active
Network:
Device-1: Intel Ethernet I219-V driver: e1000e
IF: eno1 state: up speed: 1000 Mbps duplex: full mac: 18:c0:4d:2d:b3:7e
IF-ID-1: virbr0 state: down mac: 52:54:00:0a:95:c4
Drives:
Local Storage: total: 4.99 TiB used: 3.55 TiB (71.1%)
ID-1: /dev/nvme0n1 vendor: LDLC model: F8+M.2 480 size: 447.13 GiB
ID-2: /dev/nvme1n1 vendor: Samsung model: SSD 970 EVO Plus 1TB
size: 931.51 GiB
ID-3: /dev/sda vendor: Western Digital model: WD40EZRZ-22GXCB0
size: 3.64 TiB
Partition:
ID-1: / size: 844.04 GiB used: 467.86 GiB (55.4%) fs: btrfs dev: /dev/dm-0
ID-2: /boot size: 487.2 MiB used: 142.7 MiB (29.3%) fs: ext4
dev: /dev/nvme1n1p5
ID-3: /home size: 844.04 GiB used: 467.86 GiB (55.4%) fs: btrfs
dev: /dev/dm-0
ID-4: /var size: 844.04 GiB used: 467.86 GiB (55.4%) fs: btrfs
dev: /dev/dm-0
Swap:
ID-1: swap-1 type: file size: 7.98 GiB used: 0 KiB (0.0%)
file: /var/swapfile
Sensors:
System Temperatures: cpu: 30.0 C pch: 46.0 C mobo: N/A gpu: amdgpu
temp: 44.0 C
Fan Speeds (RPM): cpu: 811 fan-2: 0 fan-3: 0 gpu: amdgpu fan: 0
Info:
Processes: 359 Uptime: 21m Memory: available: 31.27 GiB
used: 5.54 GiB (17.7%) Shell: Zsh inxi: 3.3.27 |
System information (neofetch --off)
Code: |
OS: Gentoo Linux x86_64
Host: Z390 AORUS PRO
Kernel: 6.5.11-gentoo-x86_64
Uptime: 42 mins
Packages: 1408 (emerge)
Shell: zsh 5.9
Resolution: 2560x1440
DE: Plasma 5.27.8
WM: KWin
Theme: Breeze Light [Plasma], Breeze [GTK2/3]
Icons: [Plasma], breeze [GTK2/3]
Terminal: kitty
CPU: Intel i7-9700K (8) @ 4.900GHz
GPU: AMD ATI Radeon RX 6800/6800 XT / 6900 XT
Memory: 4659MiB / 32024MiB |
I have the following firmware configured for AMD GPU in /etc/portage/savedconfig/sys-kernel/linux-firmware-20231030. sienna is for my current GPU and I kept navi14 for my old GPU (AMD RX 5500 XT)
Code: | amdgpu/sienna_cichlid_vcn.bin
amdgpu/sienna_cichlid_ta.bin
amdgpu/sienna_cichlid_sos.bin
amdgpu/sienna_cichlid_smc.bin
amdgpu/sienna_cichlid_sdma.bin
amdgpu/sienna_cichlid_rlc.bin
amdgpu/sienna_cichlid_pfp.bin
amdgpu/sienna_cichlid_mec2.bin
amdgpu/sienna_cichlid_mec.bin
amdgpu/sienna_cichlid_me.bin
amdgpu/sienna_cichlid_dmcub.bin
amdgpu/sienna_cichlid_ce.bin
amdgpu/navi14_ta.bin
amdgpu/navi14_vcn.bin
amdgpu/navi14_sos.bin
amdgpu/navi14_smc.bin
amdgpu/navi14_sdma1.bin
amdgpu/navi14_sdma.bin
amdgpu/navi14_rlc.bin
amdgpu/navi14_pfp_wks.bin
amdgpu/navi14_pfp.bin
amdgpu/navi14_mec2_wks.bin
amdgpu/navi14_mec2.bin
amdgpu/navi14_mec_wks.bin
amdgpu/navi14_mec.bin
amdgpu/navi14_me_wks.bin
amdgpu/navi14_me.bin
amdgpu/navi14_gpu_info.bin
amdgpu/navi14_ce_wks.bin
amdgpu/navi14_ce.bin
amdgpu/navi14_asd.bin |
Any suggestion ? |
|
Back to top |
|
|
jpsollie Apprentice
Joined: 17 Aug 2013 Posts: 291
|
Posted: Wed Nov 15, 2023 8:18 pm Post subject: |
|
|
MorgothSauron,
let's try to isolate the issue first:
Firefox may be using a software renderer and opengl / vulkan to render the image,
or may be using hardware video decoding. I think the former is true.
Can you use youtube downloader and play the video with eg VLC or MPV to see whether it works in a hardware accelerated environment? _________________ The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img] |
|
Back to top |
|
|
MorgothSauron Tux's lil' helper
Joined: 24 Sep 2020 Posts: 75
|
Posted: Thu Nov 16, 2023 5:27 pm Post subject: |
|
|
jpsollie wrote: | MorgothSauron,
let's try to isolate the issue first:
Firefox may be using a software renderer and opengl / vulkan to render the image,
or may be using hardware video decoding. I think the former is true.
Can you use youtube downloader and play the video with eg VLC or MPV to see whether it works in a hardware accelerated environment? |
Is there a way to check what Firefox is using for rendering ?
One thing I remembered after your post is that I added the hwaccel USE flag to Firefox back in May. That's still a few months before the first appearance of the issue I currently have.
I will try your suggestion and download the YouTube video for local playback with VLC or MPV. However this approach implies that a given video would trigger a problem each time.
To be honest. I never tried to play the same video a second time to see what happens. I have nothing to lose trying your suggestion. It can only provide more information to continue troubleshooting.
Right now the issue is still unpredictable. I watch Youtube for few hours every day. The issue can take days or even weeks to happen again. I know because I'm writing down when it happens and I make a copy of /var/log/messages. |
|
Back to top |
|
|
CooSee Veteran
Joined: 20 Nov 2004 Posts: 1441 Location: Earth
|
Posted: Thu Nov 16, 2023 8:53 pm Post subject: |
|
|
Quote: | Is there a way to check what Firefox is using for rendering ? |
e.g.
Code: | Window Protocol wayland |
that's what i get on my only hyprland system - xwayland disabled.
Quote: | One thing I remembered after your post is that I added the hwaccel USE flag to Firefox |
i don't use hwaccel USE flag! - no glitches - no freezes, but i use an very old RX590
have you tried with other desktop environment, e.g. gnome or maybe hyprland ?
_________________ " Die Realität ist eine Illusion, die durch Mangel an ehrlicher Kommunikation entsteht "
---
" Der Mensch ist von Natur aus neugierig, was am Ende übrig bleibt ist die Gier " |
|
Back to top |
|
|
MorgothSauron Tux's lil' helper
Joined: 24 Sep 2020 Posts: 75
|
Posted: Mon Nov 20, 2023 4:51 pm Post subject: |
|
|
I'm using X11 (Window Protocol = x11) since I built this system 2 years ago. I'll check if I can "transition" to wayland.
Quote: | i don't use hwaccel USE flag! - no glitches - no freezes |
I only experience rare display freeze. I know I play on words, but what is being displayed is glitch-free. It just stops refreshing. No screen tearing, no visual artifact.
Quote: | have you tried with other desktop environment, e.g. gnome or maybe hyprland ? |
Haven't tried other desktop environment. I only have KDE Plasma installed from the beginning and I'd like to keep it that way. I only install what I really need and I usually do a test installation in a VM first (yes, I have a gentoo VM that I maintain separately).
I'll try to remove the hwaccel for Firefox and see how it goes in the long run. I'll post back when I have new elements to share. |
|
Back to top |
|
|
logrusx Veteran
Joined: 22 Feb 2018 Posts: 1535
|
Posted: Mon Nov 20, 2023 6:18 pm Post subject: Re: AMD GPU RX 6800 Random display freeze |
|
|
MorgothSauron wrote: |
[u]
- I cannot switch to a different console (e.g. Ctrl+Alt+F1)
|
Try pressing ALT+PrtSc/SysRq+R prior to attempting to switch to a different VT.
Best Regards,
Georgi |
|
Back to top |
|
|
CooSee Veteran
Joined: 20 Nov 2004 Posts: 1441 Location: Earth
|
Posted: Thu Nov 23, 2023 7:56 pm Post subject: |
|
|
@MorgothSauron
if it's not much to ask - can you try Gentoo Live Gui - to get sure that this is not an Hardware issue !
_________________ " Die Realität ist eine Illusion, die durch Mangel an ehrlicher Kommunikation entsteht "
---
" Der Mensch ist von Natur aus neugierig, was am Ende übrig bleibt ist die Gier " |
|
Back to top |
|
|
MorgothSauron Tux's lil' helper
Joined: 24 Sep 2020 Posts: 75
|
Posted: Wed Dec 06, 2023 6:22 pm Post subject: |
|
|
CooSee wrote: | @MorgothSauron
if it's not much to ask - can you try Gentoo Live Gui - to get sure that this is not an Hardware issue !
|
I'm not sure to understand how booting from a Live ISO will help identify a hardware issue. I could try to stress test the CPU or run memory check (e.g. memtest). Not sure how to test the GPU.
The issue happened again last Sunday. I spent the whole day gaming on Windows (an old EA game that doesn't work at all on Linux) without any kind of issue. Sure it's windows, but it is the same hardware except Windows is booted from an external drive. In September / October I was playing Baldur's Gate 3 for hours on my Gentoo system without any issue. No frame drop, no freeze, no visual glitches. It just worked.
I booted Gentoo after my gaming session on Sunday and a freeze happened within an hour. Same pattern as before.
Kernel 6.5.13. Firefox ~amd64 without hwaccell. I removed the hwaccell flag about 2 weeks ago.
I still wasn't able to switch to a console. I noticed that some keyboard shortcuts were working to some level (e.g. increase / decrease volume).
Trying ALT+PrtSc/SysRq+R does nothing, even when the system is working normally. CONFIG_MAGIC_SYSRQ is enabled. Maybe I'm missing something here or I'm not doing it correctly.
I was able to play back the same video multiple times from start to finish without issue. I did this test 2 times in Firefox and 2 times with VLC (MP4 downloaded using youtube-dl).
Firefox itself was in a clean state. By this I mean that the browser cache, cookies, history is cleared each time I quit Firefox. I had only two tabs opened: Youtube and web mail.
It remains completely random, at least based on what I could find so far.
Not sure what else I can do besides testing the hardware. Also not sure it is a good idea to report a bug on the AMDGPU Gitlab considering it is unpredictable.
Edit:
Pressing ALT+PrtSc/SysRq+R do cause message to be logged (dmesg), but nothing else.[/quote]
Code: | [ +5.729001] sysrq: Keyboard mode set to system default
[ +2.238008] sysrq: Keyboard mode set to system default
[Dec 6 18:57] sysrq: Keyboard mode set to system default
[ +10.729998] sysrq: Keyboard mode set to system default
[Dec 6 18:59] sysrq: Keyboard mode set to system default
[Dec 6 19:00] sysrq: Emergency Sync
[ +7.821371] Emergency Sync complete
[Dec 6 19:25] sysrq: Keyboard mode set to system default
[ +6.525996] sysrq: Keyboard mode set to system default
[ +11.579997] sysrq: Keyboard mode set to system default
[ +8.704000] sysrq: Keyboard mode set to system default
[ +9.415007] sysrq: Keyboard mode set to system default
[ +4.569995] sysrq: Keyboard mode set to system default
[Dec 6 19:27] sysrq: Keyboard mode set to system default
[ +21.819045] sysrq: Keyboard mode set to system default
|
|
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21635
|
Posted: Wed Dec 06, 2023 6:49 pm Post subject: |
|
|
MorgothSauron wrote: | CooSee wrote: | if it's not much to ask - can you try Gentoo Live Gui - to get sure that this is not an Hardware issue ! | I'm not sure to understand how booting from a Live ISO will help identify a hardware issue. I could try to stress test the CPU or run memory check (e.g. memtest). Not sure how to test the GPU. | I believe CooSee wanted to prove it was not a hardware issue, by having you run from presumed-good software. If the problem ceased to occur when using the presumed-good ISO, that would suggest your installed Gentoo system is at fault. If the problem persisted even in the ISO, that would suggest the hardware is at fault. The next paragraph of your response looks to me like an equivalent test. The general successes with Windows and the October session on Gentoo suggest that the hardware is not fundamentally broken. I cannot offer advice on how to debug the software though. |
|
Back to top |
|
|
MorgothSauron Tux's lil' helper
Joined: 24 Sep 2020 Posts: 75
|
Posted: Wed Dec 06, 2023 7:02 pm Post subject: |
|
|
Hu wrote: | MorgothSauron wrote: | CooSee wrote: | if it's not much to ask - can you try Gentoo Live Gui - to get sure that this is not an Hardware issue ! | I'm not sure to understand how booting from a Live ISO will help identify a hardware issue. I could try to stress test the CPU or run memory check (e.g. memtest). Not sure how to test the GPU. | I believe CooSee wanted to prove it was not a hardware issue, by having you run from presumed-good software. If the problem ceased to occur when using the presumed-good ISO, that would suggest your installed Gentoo system is at fault. If the problem persisted even in the ISO, that would suggest the hardware is at fault. The next paragraph of your response looks to me like an equivalent test. The general successes with Windows and the October session on Gentoo suggest that the hardware is not fundamentally broken. I cannot offer advice on how to debug the software though. |
Understood. It didn't cross my mind it would be to check the software stack. I could try doing this for a while. The main "issue" is the unpredictability. Using the a Live ISO would be great if I had a way to trigger the problem. Not the case at the moment.
In the meantime I will investigate to understand why ALT+PrtSc/SysRq+R doesn't seem to do anything special. |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21635
|
Posted: Wed Dec 06, 2023 7:16 pm Post subject: |
|
|
SysRq+R puts the keyboard into raw mode, so that Alt+F1 will be processed directly by the kernel (and switch you to tty1), so that Alt+F1 works even if Xorg is frozen. However, SysRq+R is only to enable raw mode. You still need to press keys the kernel handles to get a useful result afterward. Once the display hangs, enter raw mode, then try different ttys to see which, if any, you can switch to. I suggest this since your first attempt might be for the tty on which Xorg is running, in which case getting no result is expected. |
|
Back to top |
|
|
logrusx Veteran
Joined: 22 Feb 2018 Posts: 1535
|
Posted: Wed Dec 06, 2023 8:09 pm Post subject: |
|
|
@Hu, it is CTRL+ALT+F1 :)
@MorgothSauron, when my system starts artifacting, it's usually after wake up and switching through a framebufeer console, to the main graphic console (there the login screen is present, usually #1) back to my session (usually the next one, in my case #2), fixes the issue for me. It goes through some kind of graphics re-initialization. However sometimes the keyboard is blocked by X/Wayland and it doesn't respond, so I need to put the keyboard back into raw mode so that, as Hu mentioned, the kernel processes those commands directly.
If it works for you, you'll at leas be able to collect logs.
Best Regards,
Georgi |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21635
|
Posted: Wed Dec 06, 2023 8:46 pm Post subject: |
|
|
logrusx wrote: | @Hu, it is CTRL+ALT+F1 | When Xorg is in control, yes. When the kernel is in control, plain Alt+F1 will suffice. |
|
Back to top |
|
|
logrusx Veteran
Joined: 22 Feb 2018 Posts: 1535
|
Posted: Thu Dec 07, 2023 7:50 am Post subject: |
|
|
Hu wrote: | logrusx wrote: | @Hu, it is CTRL+ALT+F1 | When Xorg is in control, yes. When the kernel is in control, plain Alt+F1 will suffice. |
Didn't know that, thanks : ) |
|
Back to top |
|
|
MorgothSauron Tux's lil' helper
Joined: 24 Sep 2020 Posts: 75
|
Posted: Wed Dec 13, 2023 8:02 pm Post subject: |
|
|
So it happened again. Same pattern.
Kernel 6.5.13-r1
Pressing ALT+PrtSc/SysRq didn't help. I could see things being logged in /var/log/messages, but I still couldn't switch to a console.
Code: | Dec 13 20:14:57 myhost kernel: sysrq: Keyboard mode set to system default
Dec 13 20:15:48 myhost kernel: sysrq: Keyboard mode set to system default
Dec 13 20:16:09 myhost kernel: sysrq: Keyboard mode set to system default
Dec 13 20:18:51 myhost kernel: sysrq: Keyboard mode set to system default
Dec 13 20:19:31 myhost kernel: sysrq: Keyboard mode set to system default |
Luckily I could connect using SSH to try to gather some data. I still had to reset the power because I couldn't do clean poweroff (ssh disconnected, no display and response to ping)
There is these message that appeared in the output of dmesg (which I could find later in /var/log/kern.log)
Code: | amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
amdgpu 0000:03:00.0: [drm] *ERROR* [CONNECTOR:99:DP-1] commit wait timed out
amdgpu 0000:03:00.0: [drm] *ERROR* flip_done timed out
amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out |
I searched one of the line the previous kern.log and I could find the same message back in September when this issue started to happen.
Code: | Sep 29 18:56:54 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 1 19:04:00 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 1 19:04:41 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 1 19:05:21 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 1 19:06:01 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 13 19:38:37 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 13 19:39:44 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 31 20:46:28 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Oct 31 20:47:39 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:48:17 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:48:47 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:49:48 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:50:28 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:50:58 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:51:38 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:52:08 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:52:38 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:53:08 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Nov 10 21:53:38 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 3 19:29:59 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 3 19:31:15 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:15:19 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:15:59 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:16:39 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:17:19 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:17:49 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:18:29 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:19:00 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:19:30 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out
Dec 13 20:20:00 myhost kernel: amdgpu 0000:03:00.0: [drm] *ERROR* [PLANE:65:plane-5] commit wait timed out |
A freeze happened for each of the days where that message was logged.
Similar message were logged before that, but they are not "*ERROR* and they occurred every day without any display issue
Code: | Sep 1 18:50:49 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 2 10:07:42 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 3 10:21:38 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 4 17:37:54 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 5 17:43:23 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 6 09:52:37 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 7 18:43:41 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 8 19:06:20 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 9 18:12:06 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 10 10:50:25 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 11 17:47:41 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 11 18:08:34 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 12 18:09:13 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 12 19:06:17 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 13 17:58:06 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 13 19:47:15 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 14 17:55:44 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 15 18:27:58 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 16 09:34:54 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 17 11:30:27 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 18 17:29:53 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update
Sep 19 17:42:08 myhost kernel: amdgpu 0000:03:00.0: amdgpu: [drm] [PLANE:65:plane-5] async flip with non-fast update |
I didn't switch to kernel 6.6 yet because genkernel is broken with that kernel version and I need an initramfs (luks, btrfs). I need to "test" dracut to make sure it can create the correct initramfs to boot my system.
The good thing is that this time I found some new messages. I'll have to investigate a little bit using the latest messages I could find. |
|
Back to top |
|
|
CooSee Veteran
Joined: 20 Nov 2004 Posts: 1441 Location: Earth
|
|
Back to top |
|
|
|