Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
btrfs raid1 + NVMe trouble
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Atha
Apprentice
Apprentice


Joined: 22 Sep 2004
Posts: 229

PostPosted: Sun Jan 13, 2019 8:17 pm    Post subject: btrfs raid1 + NVMe trouble Reply with quote

Hi!

After a hardware and software upgrade I have this strange error, which makes booting non-automatic and during operation it even gets worse.

I have a desktop system, specs: AMD Ryzen X1800, AMD X370 chipset (ASUS PRIME X370-Pro). The NVMe M.2 SSD is an Intel SSD 660p, the device is available via /dev/nvme0n1p<1-n>. My root filesystem is btrfs. Everything worked stable.

I then upgraded using a SilverStone SST-ECM20 PCIe 3.0 x4 to M.2 (NGFF) expansion card (the additional SATA to M.2 on the card is unused, it wouldn't be routed through PCIe anyway but instead would use an extra SATA connector seated on the expansion card). It has a Crucial P1 SSD installed.
After the upgrade everything worked at first. I moved Windows from the Intel 660p SSD to the new Crucial P1 SSD and changed the boot setup. Windows 10 is running stable, even for hours.

On Gentoo Linux everything looked stable as well, at first. I then created an additional swap partition on the Crucial P1, which is available via /dev/nvme1n1p<1-n>. I had this running only for a short period, but there seemed to be no issues.

Then I enhanced the btrfs root partition / to be a raid1 setup. I followed this guide, section Conversion:
Code:
btrfs device add /dev/nvme1n1p6 /
btrfs balance start -dconvert=raid1 -mconvert=raid1 /

This worked very well:
Code:
btrfs filesystem show
Label: 'Gentoo Linux'  uuid: 4afba786-357f-4ac1-972c-363491cbcda5
        Total devices 2 FS bytes used 19.62GiB
        devid    1 size 119.67GiB used 25.06GiB path /dev/nvme0n1p9
        devid    2 size 119.67GiB used 25.06GiB path /dev/nvme1n1p6

I than encountered, what I believed, THE ONE issue: booting. I use profile default/linux/amd64/17.1/desktop/plasma/systemd (dev). I use GRUB2 and systemd, genkernel with initramfs for the kernel (currently 4.20.1), with which I was yet unable to make the boot work without me prompting for either /dev/nvme0n1p9 or /dev/nvme1n1p6. I added rootflags=device=/dev/nvme0n1p9,device=/dev/nvme1n1p6 in /boot/grub/grub.cfg but then boot fails completely. Without it I can boot after entering the second btrfs device. What also works is, naturally, rootflags=degraded, but this is not ideal at all.
Code:
[    5.867152] BTRFS: device label Gentoo Linux devid 1 transid 23347 /dev/nvme0n1p9
[    5.868289] BTRFS info (device nvme0n1p9): disk space caching is enabled
[    5.869082] BTRFS info (device nvme0n1p9): has skinny extents
[    5.870543] BTRFS error (device nvme0n1p9): devid 2 uuid a7405d5c-7da4-4256-b74c-acc167421ebe is missing
[    5.871843] BTRFS error (device nvme0n1p9): failed to read the system array: -2
[    5.911463] BTRFS error (device nvme0n1p9): open_ctree failed
[    5.929190] BTRFS info (device nvme0n1p9): disk space caching is enabled
[    5.930355] BTRFS info (device nvme0n1p9): has skinny extents
[    5.932072] BTRFS error (device nvme0n1p9): devid 2 uuid a7405d5c-7da4-4256-b74c-acc167421ebe is missing
[    5.933246] BTRFS error (device nvme0n1p9): failed to read the system array: -2
[    5.951477] BTRFS error (device nvme0n1p9): open_ctree failed

... prompt to enter device, I enter "/dev/nvme1n1p6[Enter]" (EN keyboard layout, my keyboard is DE) ...

[   15.494032] BTRFS: device label Gentoo Linux devid 2 transid 23339 /dev/nvme1n1p6
[   15.495858] BTRFS info (device nvme0n1p9): disk space caching is enabled
[   15.496771] BTRFS info (device nvme0n1p9): has skinny extents
[   15.498399] BTRFS error (device nvme0n1p9): bad tree block start, want 54001668096 have 0
[   15.499602] BTRFS error (device nvme0n1p9): bad tree block start, want 63352754176 have 0
[   15.500751] BTRFS error (device nvme0n1p9): parent transid verify failed on 65498460160 wanted 23347 found 23339
[   15.501989] BTRFS error (device nvme0n1p9): parent transid verify failed on 67642740736 wanted 23347 found 23326
[   15.503151] BTRFS error (device nvme0n1p9): parent transid verify failed on 64424452096 wanted 23347 found 23326
[   15.504310] BTRFS error (device nvme0n1p9): parent transid verify failed on 61304840192 wanted 23347 found 23339
[   15.505445] BTRFS error (device nvme0n1p9): parent transid verify failed on 63353196544 wanted 23347 found 23339
[   15.506663] BTRFS error (device nvme0n1p9): bad tree block start, want 67497779200 have 0
[   15.507771] BTRFS info (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26649, flush 17, corrupt 0, gen 0
[   15.508738] BTRFS error (device nvme0n1p9): parent transid verify failed on 64427171840 wanted 23347 found 23337
[   15.509729] BTRFS error (device nvme0n1p9): parent transid verify failed on 61204533248 wanted 23345 found 23337
[   15.510547] BTRFS error (device nvme0n1p9): parent transid verify failed on 61304733696 wanted 23345 found 23339
[   15.511372] BTRFS error (device nvme0n1p9): bad tree block start, want 59047186432 have 0
[   15.512421] BTRFS error (device nvme0n1p9): bad tree block start, want 61304995840 have 0
[   15.513469] BTRFS error (device nvme0n1p9): bad tree block start, want 61304782848 have 0
[   15.514622] BTRFS error (device nvme0n1p9): parent transid verify failed on 66597998592 wanted 23346 found 23312
[   15.515497] BTRFS error (device nvme0n1p9): bad tree block start, want 61304860672 have 0
[   15.516695] BTRFS error (device nvme0n1p9): parent transid verify failed on 66754658304 wanted 23346 found 23339
[   15.518004] BTRFS error (device nvme0n1p9): bad tree block start, want 66918699008 have 0
[   15.518905] BTRFS error (device nvme0n1p9): bad tree block start, want 61304897536 have 0
[   15.519799] BTRFS error (device nvme0n1p9): bad tree block start, want 67597086720 have 0
[   15.526800] BTRFS info (device nvme0n1p9): enabling ssd optimizations

I then also get a couple of warning, presumable due to the last shutdown being done on only /dev/nvme0n1 (without /dev/nvme1n1).
Code:
[   17.791467] BTRFS error (device nvme0n1p9): space cache generation (23337) does not match inode (23346)
[   17.792542] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 55108960256, rebuilding it now
[   17.794504] BTRFS error (device nvme0n1p9): csum mismatch on free space cache
[   17.795506] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 56182702080, rebuilding it now
[   17.797610] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304786944 (dev /dev/nvme1n1p6 sector 16363112)
[   17.799180] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304795136 (dev /dev/nvme1n1p6 sector 16363128)
[   17.799457] BTRFS error (device nvme0n1p9): csum mismatch on free space cache
[   17.801444] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 58330185728, rebuilding it now
[   17.803327] BTRFS error (device nvme0n1p9): space cache generation (23337) does not match inode (23347)
[   17.804084] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 60477669376, rebuilding it now
[   17.805333] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304807424 (dev /dev/nvme1n1p6 sector 16363152)
[   17.807370] BTRFS error (device nvme0n1p9): csum mismatch on free space cache
[   17.808147] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 61551411200, rebuilding it now
[   17.809497] BTRFS error (device nvme0n1p9): space cache generation (23339) does not match inode (23347)
[   17.810291] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 62625153024, rebuilding it now
[   17.810739] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304889344 (dev /dev/nvme1n1p6 sector 16363312)
[   17.812388] BTRFS error (device nvme0n1p9): csum mismatch on free space cache
[   17.812676] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304893440 (dev /dev/nvme1n1p6 sector 16363320)
[   17.813268] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 63698894848, rebuilding it now
[   17.815550] BTRFS error (device nvme0n1p9): space cache generation (23339) does not match inode (23347)
[   17.816423] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 64772636672, rebuilding it now
[   17.817856] BTRFS error (device nvme0n1p9): space cache generation (23339) does not match inode (23347)
[   17.818745] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 65846378496, rebuilding it now
[   17.819052] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304901632 (dev /dev/nvme1n1p6 sector 16363336)
[   17.821663] BTRFS error (device nvme0n1p9): space cache generation (23339) does not match inode (23347)
[   17.821959] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304848384 (dev /dev/nvme1n1p6 sector 16363232)
[   17.822618] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 66920120320, rebuilding it now
[   17.824522] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304832000 (dev /dev/nvme1n1p6 sector 16363200)
[   17.826770] BTRFS error (device nvme0n1p9): csum mismatch on free space cache
[   17.827646] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304852480 (dev /dev/nvme1n1p6 sector 16363240)
[   17.827785] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 67993862144, rebuilding it now
[   17.830710] BTRFS error (device nvme0n1p9): space cache generation (23339) does not match inode (23346)
[   17.831720] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 69067603968, rebuilding it now
[   17.833271] BTRFS error (device nvme0n1p9): space cache generation (23339) does not match inode (23347)
[   17.834290] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 70141345792, rebuilding it now
[   17.836000] BTRFS error (device nvme0n1p9): space cache generation (23339) does not match inode (23346)
[   17.837024] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 71215087616, rebuilding it now
[   17.838406] BTRFS error (device nvme0n1p9): space cache generation (23339) does not match inode (23345)
[   17.839292] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 72288829440, rebuilding it now
[   17.840751] BTRFS warning (device nvme0n1p9): failed to load free space cache for block group 73362571264, rebuilding it now
[   19.185454] Adding 67108860k swap on /dev/nvme1n1p5.  Priority:-2 extents:1 across:67108860k SSDscFS
[   19.254449] Adding 67108860k swap on /dev/nvme0n1p4.  Priority:-3 extents:1 across:67108860k SSDscFS
[   21.641021] btree_readpage_end_io_hook: 439 callbacks suppressed
[   21.641023] BTRFS error (device nvme0n1p9): bad tree block start, want 61304905728 have 0
[   21.643895] repair_io_failure: 802 callbacks suppressed
[   21.643898] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304905728 (dev /dev/nvme1n1p6 sector 16363344)
[   21.664134] BTRFS error (device nvme0n1p9): bad tree block start, want 63353053184 have 0
[   21.665511] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 63353053184 (dev /dev/nvme1n1p6 sector 20363632)
[   21.666431] verify_parent_transid: 475 callbacks suppressed
[   21.666432] BTRFS error (device nvme0n1p9): parent transid verify failed on 67647361024 wanted 23346 found 23320
[   21.668166] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 67647361024 (dev /dev/nvme1n1p6 sector 28750952)
[   21.668979] BTRFS error (device nvme0n1p9): bad tree block start, want 63353106432 have 0
[   21.670331] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 63353106432 (dev /dev/nvme1n1p6 sector 20363736)
[   21.679915] BTRFS error (device nvme0n1p9): parent transid verify failed on 63353098240 wanted 23345 found 23339
[   21.681350] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 63353098240 (dev /dev/nvme1n1p6 sector 20363720)
[   21.682577] BTRFS error (device nvme0n1p9): parent transid verify failed on 64421150720 wanted 23345 found 23339
[   21.683892] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 64421150720 (dev /dev/nvme1n1p6 sector 22449760)
[   21.684975] BTRFS warning (device nvme0n1p9): csum failed root 5 ino 3311643 off 0 csum 0x037b1994 expected csum 0x27f45f6e mirror 1
[   21.686438] BTRFS info (device nvme0n1p9): read error corrected: ino 3311643 off 0 (dev /dev/nvme1n1p6 sector 2560)
[   22.775645] BTRFS error (device nvme0n1p9): parent transid verify failed on 66596499456 wanted 23345 found 23339
[   22.775850] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 66596499456 (dev /dev/nvme1n1p6 sector 26698488)
[   22.776526] BTRFS error (device nvme0n1p9): parent transid verify failed on 66579021824 wanted 23347 found 23312
[   22.776682] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 66579021824 (dev /dev/nvme1n1p6 sector 26664352)
[   22.787160] BTRFS error (device nvme0n1p9): bad tree block start, want 68672991232 have 0
[   22.787349] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 68672991232 (dev /dev/nvme1n1p6 sector 30754136)
[   22.876908] BTRFS error (device nvme0n1p9): bad tree block start, want 67033698304 have 0
[   41.529569] BTRFS error (device nvme0n1p9): bad tree block start, want 61304750080 have 0
[   41.529777] repair_io_failure: 1 callbacks suppressed
[   41.529780] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304750080 (dev /dev/nvme1n1p6 sector 16363040)
[   41.529858] BTRFS warning (device nvme0n1p9): csum failed root 5 ino 3311647 off 0 csum 0x98f94189 expected csum 0xca101a9f mirror 1
[   41.529876] BTRFS warning (device nvme0n1p9): csum failed root 5 ino 3311647 off 4096 csum 0x98f94189 expected csum 0xa8130115 mirror 1
[   41.529886] BTRFS warning (device nvme0n1p9): csum failed root 5 ino 3311647 off 8192 csum 0x21c4eead expected csum 0x487b651b mirror 1
[   41.529901] BTRFS warning (device nvme0n1p9): csum failed root 5 ino 3311647 off 0 csum 0x98f94189 expected csum 0xca101a9f mirror 1
[   41.529905] BTRFS warning (device nvme0n1p9): csum failed root 5 ino 3311647 off 4096 csum 0x98f94189 expected csum 0xa8130115 mirror 1
[   41.530002] BTRFS warning (device nvme0n1p9): csum failed root 5 ino 3311647 off 8192 csum 0x21c4eead expected csum 0x487b651b mirror 1
[   41.530025] BTRFS info (device nvme0n1p9): read error corrected: ino 3311647 off 0 (dev /dev/nvme1n1p6 sector 16363080)
[   41.530107] BTRFS info (device nvme0n1p9): read error corrected: ino 3311647 off 4096 (dev /dev/nvme1n1p6 sector 16363088)
[   41.530134] BTRFS info (device nvme0n1p9): read error corrected: ino 3311647 off 8192 (dev /dev/nvme1n1p6 sector 16363096)
[   46.300126] BTRFS error (device nvme0n1p9): parent transid verify failed on 66581753856 wanted 23347 found 23339
[   46.300276] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 66581753856 (dev /dev/nvme1n1p6 sector 26669688)
[   46.300884] BTRFS error (device nvme0n1p9): parent transid verify failed on 66596421632 wanted 23347 found 23339
[   46.301030] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 66596421632 (dev /dev/nvme1n1p6 sector 26698336)

And a couple of more errors of this sort during operation:
Code:
[  142.432796] BTRFS error (device nvme0n1p9): parent transid verify failed on 69752815616 wanted 23329 found 23310
[  142.432956] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 69752815616 (dev /dev/nvme1n1p6 sector 32863168)
[  295.139256] BTRFS warning (device nvme0n1p9): checksum error at logical 52946178048 on dev /dev/nvme1n1p6, physical 19304448, root 5, inode 907104, offset 0, length 3286, links 1 (path: etc/default/grub)
[  295.139262] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26649, flush 17, corrupt 1, gen 0
[  295.141874] BTRFS error (device nvme0n1p9): fixed up error at logical 52946178048 on dev /dev/nvme1n1p6
[  295.144309] BTRFS warning (device nvme0n1p9): checksum error at logical 52945190912 on dev /dev/nvme1n1p6, physical 18317312, root 5, inode 771596, offset 0, length 3555, links 1 (path: etc/grub.d/05_linux-static)
[  295.144313] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26649, flush 17, corrupt 2, gen 0
[  296.643498] BTRFS error (device nvme0n1p9): fixed up error at logical 52945190912 on dev /dev/nvme1n1p6
[  297.909986] BTRFS warning (device nvme0n1p9): checksum error at logical 54001668096 on dev /dev/nvme1n1p6, physical 1074794496: metadata leaf (level 0) in tree 3
[  297.909990] BTRFS warning (device nvme0n1p9): checksum error at logical 54001668096 on dev /dev/nvme1n1p6, physical 1074794496: metadata leaf (level 0) in tree 3
[  297.909992] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26649, flush 17, corrupt 3, gen 0
[  297.914210] BTRFS error (device nvme0n1p9): fixed up error at logical 54001668096 on dev /dev/nvme1n1p6
[  298.917602] BTRFS error (device nvme0n1p9): parent transid verify failed on 70814625792 wanted 23344 found 23310
[  298.927825] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 70814625792 (dev /dev/nvme1n1p6 sector 34937016)
[  298.928078] BTRFS error (device nvme0n1p9): parent transid verify failed on 61304713216 wanted 23344 found 23339
[  298.932317] BTRFS info (device nvme0n1p9): read error corrected: ino 0 off 61304713216 (dev /dev/nvme1n1p6 sector 16362968)
[  298.950141] BTRFS warning (device nvme0n1p9): checksum error at logical 54849617920 on dev /dev/nvme1n1p6, physical 1922744320, root 5, inode 3311641, offset 0, length 2585, links 1 (path: etc/cups/printers.conf.O)
[  298.950147] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26649, flush 17, corrupt 4, gen 0
[  298.960694] BTRFS error (device nvme0n1p9): fixed up error at logical 54849617920 on dev /dev/nvme1n1p6
......

BUT THE REAL ISSUE is that after some time the system now looses /dev/nvme1n1 completely! This is the Crucial P1 NVMe M.2 PCIe SSD. The system continues to work after some minutes of non-responsiveness. btrfs logs the failed raid1 entity:
Code:
[  502.999688] nvme nvme1: I/O 225 QID 6 timeout, aborting
[  502.999705] nvme nvme1: I/O 100 QID 12 timeout, aborting
[  502.999710] nvme nvme1: I/O 101 QID 12 timeout, aborting
[  502.999715] nvme nvme1: I/O 102 QID 12 timeout, aborting
[  533.207576] nvme nvme1: I/O 225 QID 6 timeout, reset controller
[  544.983592] nvme nvme1: I/O 11 QID 0 timeout, reset controller
[  605.403827] INFO: task systemd:2816 blocked for more than 120 seconds.
[  605.403830]       Tainted: G           O    T 4.20.1-gentoo-RYZEN #1
[  605.403831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.403833] systemd         D    0  2816      1 0x00000000
[  605.403836] Call Trace:
[  605.403844]  __schedule+0x21c/0x720
[  605.403847]  schedule+0x27/0x80
[  605.403850]  io_schedule+0x11/0x40
[  605.403853]  wait_on_page_bit+0x11d/0x200
[  605.403855]  ? __page_cache_alloc+0x20/0x20
[  605.403859]  read_extent_buffer_pages+0x257/0x300
[  605.403863]  btree_read_extent_buffer_pages+0xc2/0x230
[  605.403865]  ? alloc_extent_buffer+0x35e/0x390
[  605.403868]  read_tree_block+0x5c/0x80
[  605.403871]  read_block_for_search.isra.13+0x1a9/0x380
[  605.403874]  btrfs_search_slot+0x226/0x970
[  605.403876]  btrfs_lookup_inode+0x63/0xfc
[  605.403879]  btrfs_iget_path+0x67e/0x770
[  605.403882]  btrfs_lookup_dentry+0x478/0x570
[  605.403885]  btrfs_lookup+0x18/0x40
[  605.403888]  path_openat+0xbbd/0x13e0
[  605.403891]  do_filp_open+0xa7/0x110
[  605.403894]  do_sys_open+0x18e/0x230
[  605.403896]  __x64_sys_openat+0x1f/0x30
[  605.403899]  do_syscall_64+0x55/0x100
[  605.403901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  605.403904] RIP: 0033:0x7f57bc1a731a
[  605.403909] Code: Bad RIP value.
[  605.403911] RSP: 002b:00007ffe14628540 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  605.403913] RAX: ffffffffffffffda RBX: 00007ffe14628638 RCX: 00007f57bc1a731a
[  605.403914] RDX: 00000000000a0100 RSI: 0000562ae1fd7dd0 RDI: 00000000ffffff9c
[  605.403915] RBP: 0000000000000008 R08: 91824bee752ca339 R09: 00007f57bbf11540
[  605.403917] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ae1fd7de6
[  605.403918] R13: 0000562ae1fd7b10 R14: 00007ffe146285c0 R15: 0000562ae1fa6168
[  655.735860] nvme nvme1: Device not ready; aborting reset
[  655.776155] print_req_error: I/O error, dev nvme1n1, sector 1214058880
[  655.776163] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26650, flush 17, corrupt 18, gen 8
[  655.776170] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26651, flush 17, corrupt 18, gen 8
[  655.776182] print_req_error: I/O error, dev nvme1n1, sector 1214059136
[  655.776184] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26652, flush 17, corrupt 18, gen 8
[  655.776188] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26653, flush 17, corrupt 18, gen 8
[  655.776191] print_req_error: I/O error, dev nvme1n1, sector 1214059392
[  655.776194] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26654, flush 17, corrupt 18, gen 8
[  655.776197] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26655, flush 17, corrupt 18, gen 8
[  655.776202] print_req_error: I/O error, dev nvme1n1, sector 1214059648
[  655.776205] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26656, flush 17, corrupt 18, gen 8
[  655.776208] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26657, flush 17, corrupt 18, gen 8
[  655.776212] print_req_error: I/O error, dev nvme1n1, sector 1214059904
[  655.776214] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26658, flush 17, corrupt 18, gen 8
[  655.776218] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32670, rd 26659, flush 17, corrupt 18, gen 8
[  655.776222] print_req_error: I/O error, dev nvme1n1, sector 1214060160
[  655.776227] print_req_error: I/O error, dev nvme1n1, sector 1214060416
[  655.776230] print_req_error: I/O error, dev nvme1n1, sector 1214060672
[  655.776236] print_req_error: I/O error, dev nvme1n1, sector 1214060928
[  655.776239] print_req_error: I/O error, dev nvme1n1, sector 1214061184
[  655.776340] nvme nvme1: Abort status: 0x7
[  655.776347] nvme nvme1: Abort status: 0x7
[  655.776349] nvme nvme1: Abort status: 0x7
[  655.776351] nvme nvme1: Abort status: 0x7
[  716.336303] nvme nvme1: Device not ready; aborting reset
[  716.336308] nvme nvme1: Removing after probe failure status: -19
[  716.384295] nvme nvme1: Device not ready; aborting reset
[  716.412576] btrfs_dev_stat_print_on_error: 60 callbacks suppressed
[  716.412581] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32674, rd 26728, flush 17, corrupt 18, gen 8
[  716.412584] print_req_error: 25 callbacks suppressed
[  716.412587] print_req_error: I/O error, dev nvme1n1, sector 1251491984
[  716.412592] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32674, rd 26729, flush 17, corrupt 18, gen 8
[  716.412607] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32675, rd 26729, flush 17, corrupt 18, gen 8
[  716.412614] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32675, rd 26730, flush 17, corrupt 18, gen 8
[  716.412618] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32675, rd 26731, flush 17, corrupt 18, gen 8
[  716.412622] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32675, rd 26732, flush 17, corrupt 18, gen 8
[  716.412625] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32675, rd 26733, flush 17, corrupt 18, gen 8
[  716.412629] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32675, rd 26735, flush 17, corrupt 18, gen 8
[  716.412634] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32675, rd 26737, flush 17, corrupt 18, gen 8
[  716.412639] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 32676, rd 26737, flush 17, corrupt 18, gen 8
[  716.447227] BTRFS warning (device nvme0n1p9): lost page write due to IO error on /dev/nvme1n1p6
[  716.447237] BTRFS warning (device nvme0n1p9): lost page write due to IO error on /dev/nvme1n1p6
[  716.447985] BTRFS error (device nvme0n1p9): error writing primary super block to device 2
[  716.448376] nvme nvme1: failed to set APST feature (-19)
[ 3593.450821] btrfs_dev_stat_print_on_error: 7454 callbacks suppressed
[ 3593.450824] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36550, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.450853] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36551, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.450867] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36552, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.450880] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36553, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.450892] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36554, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.451274] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36555, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.451290] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36556, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.451302] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36557, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.451315] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36558, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.451730] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36559, rd 30412, flush 18, corrupt 18, gen 8
[ 3593.512658] BTRFS warning (device nvme0n1p9): lost page write due to IO error on /dev/nvme1n1p6
[ 3593.512668] BTRFS warning (device nvme0n1p9): lost page write due to IO error on /dev/nvme1n1p6
[ 3593.514621] BTRFS error (device nvme0n1p9): error writing primary super block to device 2
[ 3728.318457] btrfs_dev_stat_print_on_error: 401 callbacks suppressed
[ 3728.318460] BTRFS error (device nvme0n1p9): bdev /dev/nvme1n1p6 errs: wr 36959, rd 30413, flush 19, corrupt 18, gen 8
......

After that I can work on my system like before, only that /dev/nvme1n1 is completely gone:
Code:
# ls -Alh /dev/nvme*
crw------- 1 root root 244,  0 13. Jän 18:18 /dev/nvme0
brw-rw---- 1 root disk 259,  0 13. Jän 18:18 /dev/nvme0n1
brw-rw---- 1 root disk 259,  8 13. Jän 18:18 /dev/nvme0n1p1
brw-rw---- 1 root disk 259, 17 13. Jän 18:18 /dev/nvme0n1p10
brw-rw---- 1 root disk 259, 18 13. Jän 18:18 /dev/nvme0n1p11
brw-rw---- 1 root disk 259, 19 13. Jän 18:18 /dev/nvme0n1p12
brw-rw---- 1 root disk 259,  9 13. Jän 18:18 /dev/nvme0n1p2
brw-rw---- 1 root disk 259, 10 13. Jän 18:18 /dev/nvme0n1p3
brw-rw---- 1 root disk 259, 11 13. Jän 18:18 /dev/nvme0n1p4
brw-rw---- 1 root disk 259, 12 13. Jän 18:18 /dev/nvme0n1p5
brw-rw---- 1 root disk 259, 13 13. Jän 18:18 /dev/nvme0n1p6
brw-rw---- 1 root disk 259, 14 13. Jän 18:18 /dev/nvme0n1p7
brw-rw---- 1 root disk 259, 15 13. Jän 18:18 /dev/nvme0n1p8
brw-rw---- 1 root disk 259, 16 13. Jän 18:18 /dev/nvme0n1p9

# cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/nvme0n1p4                          partition       67108860        0       -3
/dev/nvme1n1p5\040(deleted)             partition       67108860        0       -2

# btrfs filesystem show /
Label: 'Gentoo Linux'  uuid: 4afba786-357f-4ac1-972c-363491cbcda5
        Total devices 2 FS bytes used 19.62GiB
        devid    1 size 119.67GiB used 25.06GiB path /dev/nvme0n1p9
        *** Some devices missing



The device is still listed when running lspci:
Code:
# dmesg | grep nvme
[    2.445139] nvme nvme0: pci function 0000:01:00.0
[    2.445190] nvme nvme1: pci function 0000:04:00.0
[    2.659406] nvme nvme1: missing or invalid SUBNQN field.
[    2.664904]  nvme1n1: p1 p2 p3 p4 p5 p6
[    2.665411]  nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12
-- snip --

# lspci -s 0:1:0.0 -nn
01:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:f1a5] (rev 03)

# lspci -s 0:4:0.0 -nn
04:00.0 Non-Volatile memory controller [0108]: Micron/Crucial Technology Device [c0a9:2263] (rev 03)


My first thought was that the controller and the SSD might have gotten too hot and simply shut itself down, but that can hardly be the case since I used smartctl before and it showed a nice temperature of 32°C on the Crucial P1 (30°C on the Intel 660p). According to smartctl the trigger is at 70°C or so...

I've never encountered such an error before. Could this be because of the btrfs raid1? It worked before... But why?!?

It could be a hardware failure of either the SilverStone SST-ECM20 PCIe 3.0 x4 to M.2 (NGFF) expansion card or the Crucial P1 M.2 PCIe SSD. But I doubt that since it does run Windows 10 stable from this very SSD. And it also did work on Linux too before the btrfs raid1 change.

I could revert from raid1 to single for the root btrfs. But 1st I don't know how, 2nd I could accomplish almost the same by adding rootflags=degraded to my kernel commandline, and 3rd I would like to find the real fault. This simply shouldn't happen.

Help is highly appreciated. Thanks.
Back to top
View user's profile Send private message
Atha
Apprentice
Apprentice


Joined: 22 Sep 2004
Posts: 229

PostPosted: Mon Jan 14, 2019 6:08 pm    Post subject: Reply with quote

I again couldn't boot right away into Linux. Telling my system to use the NVMe didn't do the trick right after turning the PC on. Windows didn't boot either after a warm reboot (Ctrl+Alt+Del). After turning the PC off and back on, Windows 10 booted and working is stable. Windows 10 is on the Crucial P1 SSD, /dev/nvme1n1. After a reboot I could get into Linux again - with the prompt to enter the device for the btrfs raid1 again, only that /dev/nvme1n1p6 seemed to have been picked by the system and I had to enter /dev/nvme0n1p9 for the boot process to continue...
Code:
# ls -Al /dev/nvme*
crw------- 1 root root 244,  0 14. Jän 18:56 /dev/nvme0
brw-rw---- 1 root disk 259,  0 14. Jän 18:56 /dev/nvme0n1
brw-rw---- 1 root disk 259,  2 14. Jän 18:56 /dev/nvme0n1p1
brw-rw---- 1 root disk 259, 11 14. Jän 18:56 /dev/nvme0n1p10
brw-rw---- 1 root disk 259, 12 14. Jän 18:56 /dev/nvme0n1p11
brw-rw---- 1 root disk 259, 13 14. Jän 18:56 /dev/nvme0n1p12
brw-rw---- 1 root disk 259,  3 14. Jän 18:56 /dev/nvme0n1p2
brw-rw---- 1 root disk 259,  4 14. Jän 18:56 /dev/nvme0n1p3
brw-rw---- 1 root disk 259,  5 14. Jän 18:56 /dev/nvme0n1p4
brw-rw---- 1 root disk 259,  6 14. Jän 18:56 /dev/nvme0n1p5
brw-rw---- 1 root disk 259,  7 14. Jän 18:56 /dev/nvme0n1p6
brw-rw---- 1 root disk 259,  8 14. Jän 18:56 /dev/nvme0n1p7
brw-rw---- 1 root disk 259,  9 14. Jän 18:56 /dev/nvme0n1p8
brw-rw---- 1 root disk 259, 10 14. Jän 18:56 /dev/nvme0n1p9
crw------- 1 root root 244,  1 14. Jän 18:58 /dev/nvme1
brw-rw---- 1 root disk 259,  1 14. Jän 18:56 /dev/nvme1n1
brw-rw---- 1 root disk 259, 14 14. Jän 18:56 /dev/nvme1n1p1
brw-rw---- 1 root disk 259, 15 14. Jän 18:56 /dev/nvme1n1p2
brw-rw---- 1 root disk 259, 16 14. Jän 18:56 /dev/nvme1n1p3
brw-rw---- 1 root disk 259, 17 14. Jän 18:56 /dev/nvme1n1p4
brw-rw---- 1 root disk 259, 18 14. Jän 18:56 /dev/nvme1n1p5
brw-rw---- 1 root disk 259, 19 14. Jän 18:56 /dev/nvme1n1p6

I again see lots of error messages in dmesg concerning btrfs...
Back to top
View user's profile Send private message
DawgG
l33t
l33t


Joined: 17 Sep 2003
Posts: 866

PostPosted: Wed Jan 16, 2019 11:31 am    Post subject: Reply with quote

if you do not have the necessary btrfs-tools inside your initramfs the
Code:
rootflags=device=/dev/nvme0n1p9,device=/dev/nvme1n1p6
in /boot/grub/grub.cfg is (probably) not enough.
i use raid-1 on a couple of btrfs-filesystems but i have never converted one - i guess it's safest to do the conversion when booted from, a rescue-cd or sth. like it and do the btrfs balance right away.
what works for me with a raid-1-rootfs is using an "embedded" initramfs as descibed here: https://wiki.gentoo.org/wiki/Btrfs/Native_System_Root_Guide (starting after "Embedding an initram filesystem")
(before mounting the rootfs on raid you need the nbeccessary btrfs-tools to "assemble" the raid-1).

OTOH looking at all the fs-errors it might be better and faster to backup the data from your rootfs (if you haven't already :wink: ), create a new and clean raid-1-btrfs and copy the data back and use an initramfs as mentioned above.
GOOD LUCK!
_________________
DUMM KLICKT GUT.
Back to top
View user's profile Send private message
Atha
Apprentice
Apprentice


Joined: 22 Sep 2004
Posts: 229

PostPosted: Sun Jan 20, 2019 2:00 pm    Post subject: Reply with quote

I reverted the raid1 for now. This worked like a charm:
Code:
btrfs balance start -dconvert=single -mconvert=single /
btrfs device delete /dev/nvme1n1p6 /

Surprisingly, the balance action took quite long while the device delete action finished almost instantly.

After reverting the fstab and grub cmdline options my system starts normally again.

I am not sure why I kept loosing the second PCIe NVMe... I don't think this is 100% related to btrfs, but I am not sure. Now I am not using the second NVMe (/dev/nvme1n1) at all except for swap, which normally isn't used since I have enough RAM anyway. I will keep an eye if the device gets inaccessible again in the future. Specifically, this recurring error:
Code:
[  502.999688] nvme nvme1: I/O 225 QID 6 timeout, aborting
[  502.999705] nvme nvme1: I/O 100 QID 12 timeout, aborting
[  502.999710] nvme nvme1: I/O 101 QID 12 timeout, aborting
[  502.999715] nvme nvme1: I/O 102 QID 12 timeout, aborting
[  533.207576] nvme nvme1: I/O 225 QID 6 timeout, reset controller
[  544.983592] nvme nvme1: I/O 11 QID 0 timeout, reset controller
[  605.403827] INFO: task systemd:2816 blocked for more than 120 seconds.
[  605.403830]       Tainted: G           O    T 4.20.1-gentoo-RYZEN #1
[  605.403831] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.403833] systemd         D    0  2816      1 0x00000000
[  605.403836] Call Trace:
[  605.403844]  __schedule+0x21c/0x720
[  605.403847]  schedule+0x27/0x80
[  605.403850]  io_schedule+0x11/0x40
[  605.403853]  wait_on_page_bit+0x11d/0x200
[  605.403855]  ? __page_cache_alloc+0x20/0x20
[  605.403859]  read_extent_buffer_pages+0x257/0x300
[  605.403863]  btree_read_extent_buffer_pages+0xc2/0x230
[  605.403865]  ? alloc_extent_buffer+0x35e/0x390
[  605.403868]  read_tree_block+0x5c/0x80
[  605.403871]  read_block_for_search.isra.13+0x1a9/0x380
[  605.403874]  btrfs_search_slot+0x226/0x970
[  605.403876]  btrfs_lookup_inode+0x63/0xfc
[  605.403879]  btrfs_iget_path+0x67e/0x770
[  605.403882]  btrfs_lookup_dentry+0x478/0x570
[  605.403885]  btrfs_lookup+0x18/0x40
[  605.403888]  path_openat+0xbbd/0x13e0
[  605.403891]  do_filp_open+0xa7/0x110
[  605.403894]  do_sys_open+0x18e/0x230
[  605.403896]  __x64_sys_openat+0x1f/0x30
[  605.403899]  do_syscall_64+0x55/0x100
[  605.403901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  605.403904] RIP: 0033:0x7f57bc1a731a
[  605.403909] Code: Bad RIP value.
[  605.403911] RSP: 002b:00007ffe14628540 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  605.403913] RAX: ffffffffffffffda RBX: 00007ffe14628638 RCX: 00007f57bc1a731a
[  605.403914] RDX: 00000000000a0100 RSI: 0000562ae1fd7dd0 RDI: 00000000ffffff9c
[  605.403915] RBP: 0000000000000008 R08: 91824bee752ca339 R09: 00007f57bbf11540
[  605.403917] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ae1fd7de6
[  605.403918] R13: 0000562ae1fd7b10 R14: 00007ffe146285c0 R15: 0000562ae1fa6168
[  655.735860] nvme nvme1: Device not ready; aborting reset


Thanks. If anyone has an idea what could cause it, please share your thoughts...

@DawgG: I have a backup of the root partition. I didn't use it yet since I want to see if the btrfs as single will remain to show error messages... Thanks for your reply!
Back to top
View user's profile Send private message
DawgG
l33t
l33t


Joined: 17 Sep 2003
Posts: 866

PostPosted: Tue Jan 22, 2019 12:48 pm    Post subject: Reply with quote

i suggest you re-create the rootfs with btrfs (w/out raid), use the data from your backup on it and not use the faulty btrfs-partition on the other nvme (nvme-1). when the system is up again, you can re-create a btrfs-partition on nvme-1 and see how it behaves.
some measure of security/redundancy could then be achieved by copying or 'btrfs-send'-ing data from your rootfs to that.
GOOD LUCK!
_________________
DUMM KLICKT GUT.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum