LVM cache recurring corruption -- how to reinitialize device

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

Hello all!

Since yesterday I have strange problem with my LVM RAID 5 (HDD) + LVM cache (on SSD) setup. Here are some details:

- this is relatively new minimal setup (new install on 25th of August 2017) on J3455-ITX (4 core Celeron), only cryptsetup, LVM RAID 5 (builtin), 1 x SSD as /dev/sda (rootfs/system is not encypted), 5 x HDD Samsung 1,5 TB encrypted: /dev/sdb1 -> /dev/mapper/crypt1 and so on
- on top of crypt[1-5] devices is LVM RAID 5 (no MD Raid layer)
- added 100 GB SSD cache on encypted /dev/sda4 partition -- referene: https://rwmj.wordpress.com/2014/05/22/using-lvms-new-cache-feature/

All was working OK for like 3 weeks. Two days ago I had to replace one HDD as it began to fail.
So:
- I had to uncache the LVM (went ok) -- reference: https://rwmj.wordpress.com/2014/05/23/removing-the-cache-from-an-lv/
- removed one drive
- added new encrypted 1,5 TB drive
- resynced LVM RAID 5
- fsck -- all OK
- LVM, filesystem status -- healthy

- then I reattached SSD cache. All seemed to be working OK, cache was up and running
- first reboot: LVM missing, cache device corrupted

I did cache removal / attach before a few times as a test before the drive was replaced. It went without any errors then.

Since the cache corruption I had to do a manual recovery to uncache the LVM and get access to the data: I had to edit vgcfgbackup by hand and do vgcfgrestore.
This is similar https://www.redhat.com/archives/linux-lvm/2016-December/msg00015.html not a single tool could help me to uncache LVM with corrupt cache.

Anyway, after manual recovery the data was intact, so I tried to do it again.

I did pvremove on ssd, pvreate, vgextend and so on. None of those commands displayed any error message, so I was sure the SSD cache was properly reinitialized

LVM was cached until next reboot (today), when it went missing again. Seems it is not usable now for reason unknown to me.

Is there any way to check / wipe the SSD cache partition (apart from overwriting it with /dev/zero)?

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

I don't understand this:

MageSlayer · Posted: Fri Sep 15, 2017 9:15 am Post subject:

Are you sure your SSD is ok?

Roman_Gruber · Posted: Fri Sep 15, 2017 11:59 am Post subject:

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

SSD seems to be healthy.
/dev/sda2 is 16GB system partition that has no trouble reading, writing, updating.
SMART info is clean, dmesg also, no errors, even crc32.

But I'll convert sda4 to plain ext4 and make some tests with files.

vg0-cache0meta0 is hidden by LVM after it is added to pool as a csche for HDD. Hence you can't check it explicitly.

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

SDD is OK, I just did long SMART test and 25GB copy and md5 checksum test on BTRFS partition (of course I rebooted the machine in the meantime):

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

Seems I'm getting onto something:

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

Writing to raw (unencrypted) sda4 device seems OK, no storm here:

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

OK, the question is:
why raw write to encrypted device is much faster (I suspect some kind of buffer in dm-mapper layer?) than raw write to unencrypted device?

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

Small success here: just like in the referenced thread, upgrading the kernel to 4.13.x seems to have solved the "superfast writes" and device kicked out of dm-mapper:

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

I reinitialized the LVM cache and I'm at loss here:

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

This is the last episode in this series

(I hope) -- or "how I learned to stop worrying and love the cache".

After doing extensive testing on unencrypted device (I even moved the partition to another location by 20 gigabytes, also used smaller size), wiped with zeroes, I came to conclusion that check_cache "superblock corruption" status is probably a bug. Even tried with downgraded to 0.6.1 version.
I disabled the cache_check in lvm.conf and my LVM RAID 5 with BTRFS survived 3 reboots already and btrfsck after each reboot (uncached and cached -- no errors).