Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[Solved] Possible XFS bug in kernel 5.15.32-r1 - how to dbg?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Tue May 17, 2022 9:45 am    Post subject: [Solved] Possible XFS bug in kernel 5.15.32-r1 - how to dbg? Reply with quote

I was doing a benchmark of linux file systems and I'm running into some very strange behavior with XFS and kernel 5.15.32-r1. When I try to extract an uncompressed archive on this newly created (empty) XFS file system, after first low tens of gigabytes extracted the process hangs in iowait indefinitely. One CPU core is 100% occupied with iowait, the other CPU core is idle (on 2-core Intel Celeron G1610T).

No other filesystem tested before or after this (ext2, ext3, ext4, reiserfs3, jfs, nilfs2, f2fs, btrfs, zfs) on the same drive has shown this behavior. When I downgraded the kernel to 5.10.109, the XFS started working again. Kernel versions higher than 5.15 seem to be affected, I tried 5.17.1 and 5.17.6, they both hang up after a few minutes.

No error is reported to the system log or to dmesg when the process hangs. No error shows on stdout or stderr of the tar process either.

Next I want to address the elephant in the (other) room. This is not a SMR problem. None of the disks present in the test setup are SMR. All are CMR, and while they certainly are not brand new, they are all in good working condition.

I searched recent kernel bugs related to XFS and only found ones about mounting a corrupted XFS file system. Mine is not corrupted, it is in fact newly created.

Here are some screenshots when it hangs up: dstat, top

Is there any way I can be useful to the community? How can I trace the bug and where can I submit my findings?

UPDATE 2022-05-26: should be solved in future upstream kernel versions.


Last edited by araxon on Thu May 26, 2022 11:02 am; edited 1 time in total
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Tue May 17, 2022 11:17 am    Post subject: Reply with quote

Which command do you run to extract the umcompressed archive?

What's the size of the uncompressed archive?

Do I understand correctly that the uncompressed archive is on another volume? And that your computer hangs while the extraction program writes files to the newly created XFS volume?

Which kernels do you use? Vanilla kernels? Or Gentoo kernels?
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Tue May 17, 2022 11:42 am    Post subject: Reply with quote

mike155 wrote:
Which command do you run to extract the umcompressed archive?


Code:
tar -x -f /var/backup/test.tar -C /mnt/test


mike155 wrote:
What's the size of the uncompressed archive?


The archive file size is 760.3 GiB. It hangs up when it has extracted somewhere between 15GiB and 32GiB (it varies from one run to another).

mike155 wrote:
Do I understand correctly that the uncompressed archive is on another volume? And that your computer hangs while the extraction program writes files to the newly created XFS volume?


Yes, the archive is on SW RAID1 (/dev/sda3 and /dev/sdb3) formatted to ext4 mounted as /
while the XFS is on single disk (/dev/sdc1) mounted as /mnt/test
Yes, the XFS volume is pristine, newly created.
The extracting process hangs in iowait, but the computer remains perfectly usable, I can even read and write the XFS volume itself from another terminal window.

mike155 wrote:
Which kernels do you use? Vanilla kernels? Or Gentoo kernels?


I'm using Gentoo kernels (sys-kernel/gentoo-sources).
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Tue May 17, 2022 11:56 am    Post subject: Reply with quote

Retry with a vanilla kernel from https://www.kernel.org. If it happens there too, it might be time to contact the kernel maintainers. Take a look at this document, section "What information should I include when reporting a problem?"
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Tue May 17, 2022 12:05 pm    Post subject: Reply with quote

mike155 wrote:
Retry with a vanilla kernel from https://www.kernel.org. If it happens there too, it might be time to contact the kernel maintainers. Take a look at this document, section "What information should I include when reporting a problem?"


Thank you for your advice. I will try vanilla kernel and then see where to go from there.

Is sys-kernel/vanilla-kernel sufficient, or should I build it from git to be sure?
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Tue May 17, 2022 12:29 pm    Post subject: Reply with quote

I think it's a good idea to use a kernel from http://www.kernel.org - or, even better, to clone the stable kernel git repository from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

Code:
mkdir /usr/src/linux-stable-git
cd /usr/src/linux-stable-git
git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
git checkout linux-5.15.y

Make sure that you're not chasing a bug that was introduced by Gentoo.
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Tue May 17, 2022 3:05 pm    Post subject: Reply with quote

I compiled the latest kernel from the kernel.org (5.18.0-rc7), rebooted and tried the test again. Shortly after it is stuck in iowait, just 22GiB into the archive. So, the bug is not exclusive to Gentoo.
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Thu May 19, 2022 8:54 am    Post subject: Reply with quote

I compiled a more mainstream version of sys-kernel/gentoo-sources-5.15.32-r1 (removed my .config file and let genkernel to fill it with default options) and lo and behold, in this kernel I could not make it go stuck anymore.

The only difference in XFS options were ("my" config on the left, the genkernel default config on the right):
Code:
CONFIG_XFS_FS=y                                               | CONFIG_XFS_FS=m
# CONFIG_XFS_QUOTA is not set                                 | CONFIG_XFS_QUOTA=y
# CONFIG_XFS_POSIX_ACL is not set                             | CONFIG_XFS_POSIX_ACL=y
# CONFIG_XFS_RT is not set                                    | CONFIG_XFS_RT=y
# CONFIG_XFS_ONLINE_SCRUB is not set                          | CONFIG_XFS_ONLINE_SCRUB=y
                                                              > # CONFIG_XFS_ONLINE_REPAIR is not set

However, after I altered my old kernel config to contain these values and rebooting, I'm still triggering the bug. It may not be a XFS issue after all.
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Thu May 19, 2022 9:17 am    Post subject: Reply with quote

Section 'What information should I include when reporting a problem?' recommends to run the commands below if the kernel hangs:
Code:
# echo w > /proc/sysrq-trigger
# dmesg

Are there any hung processes?
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Thu May 19, 2022 12:09 pm    Post subject: Reply with quote

mike155 wrote:
Section 'What information should I include when reporting a problem?' recommends to run the commands below if the kernel hangs:
Code:
# echo w > /proc/sysrq-trigger
# dmesg

Are there any hung processes?

Certainly!
Code:
[10271.222656] sysrq: Show Blocked State
[10271.222747] task:xfsaild/sdc1    state:D stack:    0 pid: 7298 ppid:     2 flags:0x00004000
[10271.222754] Call Trace:
[10271.222756]  <TASK>
[10271.222758]  __schedule+0x1c9/0x510
[10271.222767]  schedule+0x3f/0xa0
[10271.222771]  schedule_timeout+0x7c/0xf0
[10271.222777]  ? init_timer_key+0x30/0x30
[10271.222782]  xfsaild+0xcd/0x640
[10271.222788]  ? xfs_trans_ail_cursor_first+0x70/0x70
[10271.222792]  kthread+0x122/0x140
[10271.222798]  ? set_kthread_struct+0x30/0x30
[10271.222802]  ret_from_fork+0x22/0x30
[10271.222806]  </TASK>
[10271.222808] task:tar             state:D stack:    0 pid: 7303 ppid:  7302 flags:0x00004000
[10271.222812] Call Trace:
[10271.222812]  <TASK>
[10271.222814]  __schedule+0x1c9/0x510
[10271.222818]  ? lock_timer_base+0x5c/0x80
[10271.222822]  schedule+0x3f/0xa0
[10271.222826]  schedule_timeout+0x7c/0xf0
[10271.222830]  ? init_timer_key+0x30/0x30
[10271.222834]  io_schedule_timeout+0x47/0x70
[10271.222838]  congestion_wait+0x79/0xd0
[10271.222844]  ? wait_woken+0x60/0x60
[10271.222848]  xfs_buf_alloc_pages+0xd0/0x1b0
[10271.222853]  xfs_buf_get_map+0x259/0x300
[10271.222858]  xfs_trans_get_buf_map+0xa9/0x120
[10271.222863]  xfs_ialloc_inode_init+0x129/0x2d0
[10271.222867]  ? xfs_ialloc_ag_alloc+0x1df/0x630
[10271.222870]  xfs_ialloc_ag_alloc+0x1df/0x630
[10271.222874]  xfs_dialloc+0x1b4/0x720
[10271.222878]  xfs_create+0x1d7/0x450
[10271.222883]  xfs_generic_create+0x114/0x2d0
[10271.222888]  path_openat+0x510/0xe10
[10271.222892]  do_filp_open+0xad/0x150
[10271.222895]  ? xfs_blockgc_clear_iflag+0x93/0xb0
[10271.222898]  ? xfs_iunlock+0x52/0x90
[10271.222903]  do_sys_openat2+0x91/0x150
[10271.222908]  __x64_sys_openat+0x4e/0x90
[10271.222912]  do_syscall_64+0x43/0x90
[10271.222918]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[10271.222924] RIP: 0033:0x7fbe40f5a572
[10271.222928] RSP: 002b:00007ffc1ca6f2e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[10271.222931] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbe40f5a572
[10271.222934] RDX: 00000000000809c1 RSI: 00005606893c4270 RDI: 0000000000000004
[10271.222936] RBP: 0000000000000180 R08: 000000000000c0c0 R09: 00005606893bf5f0
[10271.222938] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000000
[10271.222940] R13: 00000000000809c1 R14: 00005606893c4270 R15: 00005606881e4248
[10271.222942]  </TASK>


There is a slight variance when I run your command again after a while:

Code:
[10504.513978] sysrq: Show Blocked State
[10504.514069] task:xfsaild/sdc1    state:D stack:    0 pid: 7298 ppid:     2 flags:0x00004000
[10504.514076] Call Trace:
[10504.514078]  <TASK>
[10504.514080]  __schedule+0x1c9/0x510
[10504.514089]  schedule+0x3f/0xa0
[10504.514093]  schedule_timeout+0x7c/0xf0
[10504.514098]  ? init_timer_key+0x30/0x30
[10504.514103]  xfsaild+0xcd/0x640
[10504.514109]  ? xfs_trans_ail_cursor_first+0x70/0x70
[10504.514114]  kthread+0x122/0x140
[10504.514119]  ? set_kthread_struct+0x30/0x30
[10504.514124]  ret_from_fork+0x22/0x30
[10504.514128]  </TASK>
[10504.514129] task:tar             state:D stack:    0 pid: 7303 ppid:  7302 flags:0x00004000
[10504.514133] Call Trace:
[10504.514134]  <TASK>
[10504.514135]  __schedule+0x1c9/0x510
[10504.514138]  ? lock_timer_base+0x5c/0x80
[10504.514142]  schedule+0x3f/0xa0
[10504.514146]  schedule_timeout+0x7c/0xf0
[10504.514150]  ? init_timer_key+0x30/0x30
[10504.514154]  io_schedule_timeout+0x47/0x70
[10504.514157]  congestion_wait+0x79/0xd0
[10504.514164]  ? wait_woken+0x60/0x60
[10504.514168]  xfs_buf_alloc_pages+0xd0/0x1b0
[10504.514173]  xfs_buf_get_map+0x259/0x300
[10504.514177]  ? xfs_buf_item_init+0x150/0x160      <==== this line is new
[10504.514184]  xfs_trans_get_buf_map+0xa9/0x120
[10504.514189]  xfs_ialloc_inode_init+0x129/0x2d0
[10504.514193]  ? xfs_ialloc_ag_alloc+0x1df/0x630
[10504.514196]  xfs_ialloc_ag_alloc+0x1df/0x630
[10504.514200]  xfs_dialloc+0x1b4/0x720
[10504.514203]  xfs_create+0x1d7/0x450
[10504.514208]  xfs_generic_create+0x114/0x2d0
[10504.514213]  path_openat+0x510/0xe10
[10504.514217]  do_filp_open+0xad/0x150
[10504.514220]  ? xfs_blockgc_clear_iflag+0x93/0xb0
[10504.514224]  ? xfs_iunlock+0x52/0x90
[10504.514229]  do_sys_openat2+0x91/0x150
[10504.514234]  __x64_sys_openat+0x4e/0x90
[10504.514238]  do_syscall_64+0x43/0x90
[10504.514244]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[10504.514251] RIP: 0033:0x7fbe40f5a572
[10504.514254] RSP: 002b:00007ffc1ca6f2e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[10504.514258] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbe40f5a572
[10504.514260] RDX: 00000000000809c1 RSI: 00005606893c4270 RDI: 0000000000000004
[10504.514262] RBP: 0000000000000180 R08: 000000000000c0c0 R09: 00005606893bf5f0
[10504.514264] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000000
[10504.514266] R13: 00000000000809c1 R14: 00005606893c4270 R15: 00005606881e4248
[10504.514269]  </TASK>
Back to top
View user's profile Send private message
araxon
Tux's lil' helper
Tux's lil' helper


Joined: 25 May 2011
Posts: 83

PostPosted: Tue May 24, 2022 8:17 am    Post subject: Reply with quote

To whom it may concern, this thread continues here: https://bugzilla.kernel.org/show_bug.cgi?id=216007
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Sat Jun 04, 2022 7:15 pm    Post subject: Reply with quote

A similar bug or the same bug under ext4: https://forums.gentoo.org/viewtopic-t-1151138.html
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum