A few BTRFS questions...

The_Great_Sephiroth · Posted: Fri Jun 24, 2016 3:02 am Post subject: A few BTRFS questions...

I am about to begin using BTRFS in a RAID environment both at home and at my job, and as such I switched my home partition on my work laptop to BTRFS today. I did this by moving everything in /home, except lost+found to an external disk. I then shrunk the partition in front of it from 500GiB to 250GiB, deleted the home partition (last partition on disk) which was 4xxGiB, and recreated it at almost 750GiB, formatted it as BTRFS with the parameters "-d dup -m dup -L Home", mounted it with "compress=zlib" since space is more important to me, and moved my data back. All is good, but I do have a few questions.

First, I am not sure how the copy-on-write works. It is enabled and my understanding of it is limited. I did read that disabling CoW will disable checksums and compression, so I do not want it disabled. However, I would like to understand it so I know what's going on beneath the surface. I have two separate ideas on how it works, but I am not sure which is correct.

The first idea is that it works like this. A file is created. Later I edit the file either by appending or changing the data in it. When I save the new version it copies the existing data to a new spot and appends my new data, then marks the old location as free. This would prevent fragmentation, but I do not see the point since we can defragment BTRFS. I assume the old location would be wiped when I do a weekly scrub.

The second idea is that it works kind of like shadow copies in NTFS, allowing the original to remain and simply point to the current version. This would allow me to "roll back" changes to a file if I needed to, but this would also take more space on the system. Maybe a scrub would delete older copies?

So which is right? Are they both wrong? If it does work like shadow copies, how would I roll back a file? Thanks for helping me get my head around this!

Second, how does the checksum stuff work? What I mean is, assume I have a picture stored and a sector it's on fails. I have duplicate data and metadata, so I know it can be saved, but when does this checking occur and how would I know? Does this just magically happen and the system logs show the errors, or do I somehow manually make it check? Maybe during a scrub?
_________________
Ever picture systemd as what runs "The Borg"?

haarp · Guru Joined: 31 Oct 2007 Posts: 535

Both are wrong. copy-on-write means that you can create copies of files for which data regions are only created when the copy is written to. Try it:

The_Great_Sephiroth · Posted: Fri Jun 24, 2016 12:47 pm Post subject:

I read that article ages ago, but just now took the dive into BTRFS. The whole CoW thing just had me a tad confused. I had read something that said something along the lines of when a file is appended, the entire file is written elsewhere and, once writing is complete, it marks the old file location as free. In fact I know I read something about old versions of files also, on the BTRFS mount options page. Below is the information on the page about the "nodatacow" mount option, which I am NOT using.

Roman_Gruber · Posted: Fri Jun 24, 2016 2:50 pm Post subject:

The_Great_Sephiroth · Posted: Fri Jun 24, 2016 4:39 pm Post subject:

BTRFS is good enough for Google, Amazon, and many others. If people would check the dates on posts reporting issues with BTRFS, almost all of them are 2010~2012 or earlier. It is stable and works well unless you mean RAID 5 or 6, though it is incomplete, lacking a few extras like data deduplication.

As for ext4, I have been using it since 2009 without a single issue on MANY systems.

I was also referring to rolling back as stated on the BTRFS mount options page, which I quoted.
_________________
Ever picture systemd as what runs "The Borg"?

vaxbrat · l33t Joined: 05 Oct 2005 Posts: 731 Location: DC Burbs

Copy on Write (CoW) is what gives btrfs its special sauce over ext4 and xfs. Approach it by understanding the concept of extent based files first. The concept is that when you create a file, rather than just allocating all of the blocks for it at once, you instead create a map of extents that will eventually gets its contents. You may also hear of this concept as "sparse file allocation".

The traditional concept of a copy based file is one end of the extreme for CoW. The file simply consists of a single extent because you wrote it all at once in a sequential manner. At the other exteme, you may have an extent for each 512 or whatever size block you write to the file as individual i/o's. The procress of btrfs "defragmentation" is the process of taking this large number of extents and merging them to into a much smaller number of multi-block extents.

The concept of a snapshot of either an entire filesystem or using the reflink option to cp such as:

krinn · Watchman Joined: 02 May 2003 Posts: 7470

Ant P. · Watchman Joined: 18 Apr 2009 Posts: 6920

Since vaxbrat gave good answers for the rest already...

vaxbrat · l33t Joined: 05 Oct 2005 Posts: 731 Location: DC Burbs

First remember that btfs filesystems consist of both a data and a metadata component as well as a reserved area for system. For example, here is a btrfs version of the df command for a single disk filesystem:

vaxbrat · l33t Joined: 05 Oct 2005 Posts: 731 Location: DC Burbs

I just looked around my cluster and realized that 4 out of 6 of my hosts are running btrfs as their system roots on ssd drives. The other two are ext4 on ssd but that's only because I haven't touched them since I built them a number of years back before btrfs became tenable as a system drive (pre-grub2). When I looked at the history of my now outdated howto:

https://wiki.gentoo.org/wiki/Btrfs/Native_System_Root_Guide

I realized that I've been using it as system roots since around Jan of 2014 and probably first started doing that over Christmas break of 2013. On an unrelated note, I suspect I've been using ssd's for my system drives now since at least the beginning of 2013 or whenever it was that TRIM support got added to ext4 in whatever kernel it was (2.6.29?). I've yet to see one of my ssd drives get "bricked" in case some of you out there are still fence sitting about that. I'm probably still playing Russian Roulette with an OCZ Vertex 3 running around somewhere

There's another somewhat confusing concept about what btrfs does for raids and the traditional approaches using your choice of "serious" hardware raid cards (eg HP Smart Array, LSI, etc), mobo based pseudoraid such as the native Intel junk (aka "scary raid" and called that for a reason), mdadm based software raids and, of course, lvm volumes. zfs and btrfs can work with all of these, but you really want to cut out the middleman. In fact, btrfs is perfectly happy if you give it just a plain old drive without even bothering to put a partition table on it.

The traditional raid concept is that you get and install n of exactly the same size drive (preferably the same brand and model #) and then buy one or a few extra as spares. This is because the hardware raid controllers tend to get their panties in a bunch if you give them a hodgepodge of odd sized drives to work with (and btw interfere with Dell or HP's profit model for their "Enterprise" offerings). The smart player who does a software raid such as on mdadm and thus freeing their data from being held hostage by a proprietary controller quickly finds that their array size is based on the smallest drive that got added.

lvm, zfs and btrfs do things a bit differently in that individual drives get strung together and then the raid i/o concept gets applied in a somewhat smarter fashion. You can throw whatever you want into the pool and then it will get used in as an efficient manner as possible. So lets go back and look at my ceph osd.0 filesystem again:

The_Great_Sephiroth · Posted: Fri Jul 01, 2016 9:31 pm Post subject:

Thanks for the in-depth answers, everybody. I have stumbled onto a new thing I am curious about. It seems that multiple partitions have identical UUID's on a new system. This is a brand-new system with brand-new drives. I partitioned /dev/sda using parted, then cloned it to /dev/sdb with "sgdisk -R /dev/sdb /dev/sda" and randomized the new partition UUID's with "sgdisk -G /dev/sdb". After that I created the filesystems with btrfs. If it was a small partition (less than 6GiB) I used "mkfs.btrfs -m raid1 -d raid1 -L Whatever -M /dev/sda2 devsdb2". If it was above 6GiB I removed the "-M" parameter. Now however, when I list disks by UUID, I show only sdb or sda. Both partitions have the same UUID, such as sda5 and sdb5. When listing by partuuid, I see them all and they are all unique. What is going on here? When I list with "btrfs fi show /dev/sd<x><y>" it shows a UUID which matches one of the UUIDs listed for the disk.

So why are the UUID's identical when I list /dev/disk/by-uuid and why do they match the UUID when I list either partition involved in said RAID1 array? For example, why does the RAID1 for /dev/sda2 and /dev/sdb2 have the same UUID listed for /dev/sda2 AND /dev/sdb2 in /dev/disk/by-uuid? I thought that duplicate UUID's was a bad thing?
_________________
Ever picture systemd as what runs "The Borg"?

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

Ant P. · Watchman Joined: 18 Apr 2009 Posts: 6920

It's been a while since I tried using btrfs raid but that's by design, it's how btrfs finds drives when assembling the raid at boot time (also means, among other things, you can mount an entire btrfs raid by mounting any single disk of the array; none of that /dev/mapper/ business to remember)

The_Great_Sephiroth · Posted: Fri Jul 01, 2016 9:56 pm Post subject:

OK, that makes sense. So if UUID abcd1234 pointed to /dev/sda1 and /dev/sdb1, if sda died, it could mount sdb and keep trucking until I replace sda. Thanks! I was worried.
_________________
Ever picture systemd as what runs "The Borg"?

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

Same story with LVM by the way, each PV has it's own UUID but they all share the same metadata, volume group name, volume group UUID, ... the only redundant part here is the metadata itself so regardless which disk is missing, LVM can actually tell which is which and which is missing. On the outside, the LVM UUIDs are rarely used at all, it relies a lot more on names, which is why if you put disks of two systems together and they happen to be using the same volume group name you gotta rename one first (the rare case where you have to resort to UUID to identify one of them). There are quite a lot more UUIDs than you see on the surface so it's easy to get confused. Some UUIDs matter more than others... (you can totally use the same PARTUUID for all your partitions and never notice a problem, you can get by without using them at all)