BTRFS partiion repair

Message

krotuss · Post by **krotuss** » Fri Nov 17, 2023 10:37 pm

Hi,

I took ddrescue image of my failed SSD, and put it on the replacement drive. After that I have also resized single btrfs partition, which is topic of this thread. to fill new bigger SSD. This probably wasn't the best idea to do with filesystem that has errors, but that is what I have done. Once system was booting I have used cp /dev/null to hunt down corrupted files, created ddrescue backups of those files and removed originals. I was left with one directory that is impossible to delete, because it forces partition to remount ro, and still lot of checksum erros in btrfs scrub and check utils. btrfs scrubs also aborts prematurely. Finally I have used btrfs check --repair in attempt to repair those errors, but that doesn't seem to make situation any better. It still reports errors, in fact it now reports capital ERRORs and also errror in dmesg right after boot.

Code: Select all

[   51.908206] BTRFS critical (device dm-0): corrupt leaf: block=385507557376 slot=18 extent bytenr=1234911232 len=16384 invalid tree block info level, have 98 expect [0, 7]
[   51.908216] BTRFS error (device dm-0): read time tree block corruption detected on logical 385507557376 mirror 1
[   51.908676] BTRFS critical (device dm-0): corrupt leaf: block=385507557376 slot=18 extent bytenr=1234911232 len=16384 invalid tree block info level, have 98 expect [0, 7]
[   51.908683] BTRFS error (device dm-0): read time tree block corruption detected on logical 385507557376 mirror 1

This is big disappointment, because I was expecting dataloss but I was hoping that I will be able to get filesystem into "healthy" state.

What course of action do you recommend? Should I try to fix filesystem or recreate it from scratch and transfer files from backup image (cp, rsync, btrfs send/receive?)?

krotuss · Post by **krotuss** » Fri Nov 17, 2023 10:49 pm

... funny story.
"btrfs check --repair" displays warning, but between "look, there appear numbers" and me, not being native speaker, churning through the message timer expired and program moved on. So if you intend to use btrfs, make sure that you are fast readed.

Post by **toralf** » Sat Nov 18, 2023 9:52 am

A year ago I used the btrfs repair tool - after that I needed a new Laptop. (And I switched back to ext4).

krotuss · Post by **krotuss** » Sat Nov 18, 2023 11:26 am

It seems that focus of most articles that I have read is how to mount damaged fs to be able to pull out data off it, rather than repairing it. So maybe my expectation was unrealistic.

My plan is to mkfs.btrfs new filesystem and copy data from image. I am leaning towards rsync, but what tool and cmd line switches do you recommend? I want to preserve as much information as possible, including timestamps, extended attr, user/owner etc. Source will emit IO errors, so it should be able to either skip over, or deal with damaged files by copying as much data as possible (in ddrescue way). In all cases, it should provide log with list of damaged files. It is root filesystem, so it should be able to deal with special files as well (/dev). Thanks.

wanne32 · Post by **wanne32** » Sat Nov 18, 2023 12:59 pm

"btrfs check --repair" displays warning, but between "look, there appear numbers" and me, not being native speaker, churning through the message timer expired and program moved on.

--repair is really dangerous but it asks you to type in yes or somewhat before it destroys anything. So this shouldn't be the problem. Still do not use it. btrfs scrub and in worst case btrfs rescue should be more than enough.

and also errror in dmesg right after boot.

Sounds for me that they are not only at boot. Can you post dmsg? Or more specificaly the part that is added, if you do scrub?
In the end the idea of btrfs is to ensure that your data is unmodified. So if the disk produces errors (or wrong data) it will stop working. ext does that in a very similar way with the difference that it won't recognize most of the errors, since it does not have checksums. FAT is the other way around it will still work in garbage in garbage out mode even if there are so obvious inconsistencies like free +used space is not the capacity occur. File should have 5GiB but ends after 3GiB? FAT will return a 3GiB file without an error ZFS will usually try to repair it but ultimately give anything if it can't btrfs instead will expect you to replace the broken device. In best case you just add a new device to your raid and remove the old one. (this can be also done in the opposite order) In the worst case you can do btrfs restore to copy the data to an other device.
But it has the big problem of not saying that so explicitly – The kernel writes errors in the kernel log. But honestly who is reading that? – And so most users start throwing more and more destructive operations at the filesystem and are later annoyed about loosing data as a result of that.

krotuss · Post by **krotuss** » Sat Nov 18, 2023 2:06 pm

Code: Select all

[29693.601930] BTRFS info (device dm-0): scrub: started on devid 1
[29698.650938] BTRFS warning (device dm-0): tree block 1234206720 mirror 1 has bad bytenr, has 5632815842398209513 want 1234206720
[29698.652009] BTRFS warning (device dm-0): tree block 1234468864 mirror 1 has bad csum, has 0x43fedf34 want 0x407df98d
[29698.652081] BTRFS warning (device dm-0): tree block 1234370560 mirror 1 has bad csum, has 0xf45c755a want 0x187652db
[29698.652112] BTRFS critical (device dm-0): corrupt leaf: block=385507557376 slot=18 extent bytenr=1234911232 len=16384 invalid tree block info level, have 98 expect [0, 7]
[29698.652120] BTRFS error (device dm-0): read time tree block corruption detected on logical 385507557376 mirror 1
[29698.653378] BTRFS warning (device dm-0): tree block 1234206720 mirror 0 has bad bytenr, has 5632815842398209513 want 1234206720
[29698.653396] BTRFS warning (device dm-0): checksum/header error at logical 1234206720 on dev /dev/mapper/root, physical 1234206720: metadata leaf (level 0) in tree 7
[29698.653402] BTRFS warning (device dm-0): checksum/header error at logical 1234206720 on dev /dev/mapper/root, physical 1234206720: metadata leaf (level 0) in tree 7
[29698.653408] btrfs_dev_stat_inc_and_print: 59 callbacks suppressed
[29698.653410] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 26418, flush 0, corrupt 85720, gen 0
[29698.653416] BTRFS error (device dm-0): unable to fixup (regular) error at logical 1234206720 on dev /dev/mapper/root
[29698.653458] BTRFS warning (device dm-0): tree block 1234239488 mirror 1 has bad csum, has 0xdc50936e want 0xbbd40cc2
[29698.653845] BTRFS warning (device dm-0): tree block 1234599936 mirror 1 has bad csum, has 0x75c8e224 want 0x20cb3e37
[29698.654058] BTRFS warning (device dm-0): tree block 1234370560 mirror 0 has bad csum, has 0xf45c755a want 0x187652db
[29698.654074] BTRFS warning (device dm-0): checksum error at logical 1234370560 on dev /dev/mapper/root, physical 1234370560: metadata leaf (level 0) in tree 7
[29698.654080] BTRFS warning (device dm-0): checksum error at logical 1234370560 on dev /dev/mapper/root, physical 1234370560: metadata leaf (level 0) in tree 7
[29698.654084] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 26418, flush 0, corrupt 85721, gen 0
[29698.654088] BTRFS error (device dm-0): unable to fixup (regular) error at logical 1234370560 on dev /dev/mapper/root
[29698.655005] BTRFS warning (device dm-0): tree block 1234239488 mirror 0 has bad csum, has 0xdc50936e want 0xbbd40cc2
[29698.655016] BTRFS warning (device dm-0): checksum error at logical 1234239488 on dev /dev/mapper/root, physical 1234239488: metadata leaf (level 0) in tree 7
[29698.655019] BTRFS warning (device dm-0): checksum error at logical 1234239488 on dev /dev/mapper/root, physical 1234239488: metadata leaf (level 0) in tree 7
[29698.655022] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 26418, flush 0, corrupt 85722, gen 0
[29698.655026] BTRFS error (device dm-0): unable to fixup (regular) error at logical 1234239488 on dev /dev/mapper/root
[29698.655169] BTRFS warning (device dm-0): tree block 1234468864 mirror 0 has bad csum, has 0x43fedf34 want 0x407df98d
[29698.655178] BTRFS warning (device dm-0): checksum error at logical 1234468864 on dev /dev/mapper/root, physical 1234468864: metadata leaf (level 0) in tree 7
[29698.655181] BTRFS warning (device dm-0): checksum error at logical 1234468864 on dev /dev/mapper/root, physical 1234468864: metadata leaf (level 0) in tree 7
[29698.655184] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 26418, flush 0, corrupt 85723, gen 0
[29698.655187] BTRFS error (device dm-0): unable to fixup (regular) error at logical 1234468864 on dev /dev/mapper/root
[29698.655251] BTRFS warning (device dm-0): tree block 1234534400 mirror 1 has bad csum, has 0xf9ec15a4 want 0x541381ec
[29698.655632] BTRFS warning (device dm-0): tree block 1234599936 mirror 0 has bad csum, has 0x75c8e224 want 0x20cb3e37
[29698.655646] BTRFS warning (device dm-0): checksum error at logical 1234599936 on dev /dev/mapper/root, physical 1234599936: metadata node (level 1) in tree 7
[29698.655650] BTRFS warning (device dm-0): checksum error at logical 1234599936 on dev /dev/mapper/root, physical 1234599936: metadata node (level 1) in tree 7
[29698.655656] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 26418, flush 0, corrupt 85724, gen 0
[29698.655660] BTRFS error (device dm-0): unable to fixup (regular) error at logical 1234599936 on dev /dev/mapper/root
[29698.655941] BTRFS warning (device dm-0): tree block 1234534400 mirror 0 has bad csum, has 0xf9ec15a4 want 0x541381ec
[29698.655949] BTRFS warning (device dm-0): checksum error at logical 1234534400 on dev /dev/mapper/root, physical 1234534400: metadata leaf (level 0) in tree 7
[29698.655952] BTRFS warning (device dm-0): checksum error at logical 1234534400 on dev /dev/mapper/root, physical 1234534400: metadata leaf (level 0) in tree 7
[29698.655955] BTRFS error (device dm-0): bdev /dev/mapper/root errs: wr 0, rd 26418, flush 0, corrupt 85725, gen 0
[29698.655957] BTRFS error (device dm-0): unable to fixup (regular) error at logical 1234534400 on dev /dev/mapper/root
[29698.655985] BTRFS info (device dm-0): scrub: not finished on devid 1 with status: -5

wanne32 · Post by **wanne32** » Sat Nov 18, 2023 3:16 pm

Wondering that there are no explicit hardware errors thrown. Only checksums. This is not what good disks do. They have also checksums and should recognize the errors too. But for SSDs its more complicated. This is the full dmsg since the scrub?
What I am wondering more is that mirror 0 and 1 are living on dm-0 is this a dup device? Is it living on top of a dm-crypt or dm-raid device? An error in the dm-0 device would tell why there are no hardware errors. But both of them are rock solid.
Can you post also btrfs filesystem usage / and smartctl -a of the ssd?

krotuss · Post by **krotuss** » Sat Nov 18, 2023 3:32 pm

This is ddrescue image transferred onto replacement SSD, therefore there are no hardware errors.

Filesystem in question is located on single dmcrypt device.

Code: Select all

btrfs filesystem usage / 
Overall:
    Device size:                   1.82TiB
    Device allocated:            909.02GiB
    Device unallocated:          953.47GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                        702.44GiB
    Free (estimated):              1.13TiB      (min: 1.13TiB)
    Free (statfs, df):             1.13TiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:900.01GiB, Used:698.32GiB (77.59%)
   /dev/mapper/root      900.01GiB

Metadata,single: Size:8.98GiB, Used:4.11GiB (45.81%)
   /dev/mapper/root        8.98GiB

System,single: Size:36.00MiB, Used:128.00KiB (0.35%)
   /dev/mapper/root       36.00MiB

Unallocated:
   /dev/mapper/root      953.47GiB

Can it still be fixed, ie. put into internally consistent state, or should I create new fs and transfer files from image?

krotuss · Post by **krotuss** » Sat Nov 18, 2023 8:20 pm

Is "rsync -avxHAX --numeric-ids /mnt/source/ /mnt/destination" right command to get as close as possible to 1:1 copy, or is there something that I should add/omit? For example that command retries on files that fail to copy, can it be prevented?

wanne32 · Post by **wanne32** » Sat Nov 18, 2023 10:30 pm

You even set system manually to single? Couldn't the space for additional 128KiB for data security on a 1TiB disk? This kills the most repair options. Or was that done by --repair?

Is "rsync -avxHAX --numeric-ids /mnt/source/ /mnt/destination" right command to get as close as possible to 1:1 copy,

Before you do that you can do btrfs restore [DEV] [DIR] which restores all files to a given dictionary. – It has the -i option to restore even files that have checksum errors. You can also do mount -o ro,rescue=all and then btrfs send to recover all your snapshots etc. and so would be the closest to a full1:1 copy Or just mount -o ro,rescue=ignorebadroots to ignore files with checksum errors.

Just for interest: What is btrfs device stats saying?

krotuss · Post by **krotuss** » Sat Nov 18, 2023 11:06 pm

Code: Select all

btrfs device stats /
[/dev/mapper/root].write_io_errs    0
[/dev/mapper/root].read_io_errs     26418
[/dev/mapper/root].flush_io_errs    0
[/dev/mapper/root].corruption_errs  85725
[/dev/mapper/root].generation_errs  0

I haven't, knowingly, set it to single (I don't even know what does that means).

I have read that "btrfs rescue" can bring back older version of files and generally artifacts, I don't want anything like that (especially if those artifacts are not clearly marked). Currently partition mounts, system boots from it, and I am happy with data that are accessible on it. My only problem is that it contains, seemingly, irreparable errors, so using rsync to transfer files to new filesystem seem like viable option for me. I intend to keep original image for some time, in case that I notice anything important missing.

wanne · Post by **wanne** » Sun Nov 19, 2023 4:52 am

Code: Select all

[/dev/mapper/root].write_io_errs    0
[/dev/mapper/root].read_io_errs     26418 
[/dev/mapper/root].corruption_errs  85725

No write errors still a lot of missing and wrong data. Looks ugly. Which SSD manufacturer was that?

My only problem is that it contains, seemingly, irreparable errors, so using rsync to transfer files to new filesystem seem like viable option for me. I intend to keep original image for some time, in case that I notice anything important missing.

Like I said consider the mount options. – Even if you are using rsync. If you use btrfs send you will keep data that rsync can't handle like different versions of files, deduplicated data etc.

I have read that "btrfs rescue" can bring back older version of files and generally artifacts

These are not artifacts. If you have generation_errs (which you do not have) it cant determine which is the newest fully functional version. (usually because you didn't umount your filesystem and later killed the log somehow) If you would use rsync it will return nothing. btrfs rescue will restore a plausible version. But be aware: btrfs rescue is not btrfs restore first one repairs an broken filesystem. (The command you wanted to use instead of the destructive btrfs check --repair) btrfs restore is a command to copy data from a broken device. (Kind of like rsync but with the ability to recover broken files.)

I haven't, knowingly, set it to single (I don't even know what does that means).

System stores where data is located. Since this information is not very big but very important btrfs stores these twice by default. (Like ext4 stores the superblock even log(filesystem size) times.) Most recovery options will using the second copy to restore corrupted data. You can disable writing a second copy for system and metadata with the -m single option on mkfs.btrfs or remove the second copy with btrfs balance -sconvert=single

krotuss · Post by **krotuss** » Sun Nov 19, 2023 10:06 am

Its Samsung 870 EVO, there is a thread linked in original post.

I have used btrfs in the first place to have checksums and prevent silent data rot. Bringing back plausible versions, or files with corrupted data but fixed checksums goes right against that. That's unless btrfs can clearly mark those files as corrupted, or put them into separate subvol, etc.

Will btrfs send work on filesystem where btrfs scrub aborts? Will be result consistent filesystem, ie won't it copy the errors? I haven't used any "advanced" features of btrfs like snapshots, etc., so there probably wont be much data deduplication on that partition.

"-m single" seems to be mkfs.btrfs default for SSDs.

wanne32 · Post by **wanne32** » Tue Nov 21, 2023 8:19 pm

Will btrfs send work on filesystem where btrfs scrub aborts?

Only if you set the mentioned mount options.

That's unless btrfs can clearly mark those files as corrupted, or put them into separate subvol, etc.

In Theory it writes the blocks into dmsg there you can get the inode from and from the inode the filename. IMHO this is no usable way. But I do not know a better one. I use btrfs always in raid1 or dup mode. There scrub will just overwrite the broken variant with a proper one. That makes things much easier.

have used btrfs in the first place to have checksums and prevent silent data rot. Bringing back plausible versions, or files with corrupted data but fixed checksums goes right against that.

As I said: Both variants have the option to do that you do not have to set these options. Bigger problem: --repair will have fixed some of the wrong checksums anyway... The reason why I would go with restore instead of rsync is that I know how it will just skip folders that are non readable. I don't know how rsync is behaving in such cases. But you can just try,

Red Hat wrote:"-m single" seems to be mkfs.btrfs default for SSDs.

Red Had uses really old kernels... But I assume your filesystem is also older.

man wrote:Up to version 5.14 there was a detection of a SSD device (more precisely if it's a rotational device, determined by the contents of file /sys/block/DEV/queue/rotational) that used to select single. This has changed in version 5.15 to be always dup.

The Idea why they did that was that SSDs tend to do deduplication anyway to reduce wear. And so storing 2 times the same data makes no sense. This would be not true for your encrypted device. Since there are many other variants where this is not true the behavior was changed.