Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
A few BTRFS questions...
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
The_Great_Sephiroth
Veteran
Veteran


Joined: 03 Oct 2014
Posts: 1602
Location: Fayetteville, NC, USA

PostPosted: Fri Jun 24, 2016 3:02 am    Post subject: A few BTRFS questions... Reply with quote

I am about to begin using BTRFS in a RAID environment both at home and at my job, and as such I switched my home partition on my work laptop to BTRFS today. I did this by moving everything in /home, except lost+found to an external disk. I then shrunk the partition in front of it from 500GiB to 250GiB, deleted the home partition (last partition on disk) which was 4xxGiB, and recreated it at almost 750GiB, formatted it as BTRFS with the parameters "-d dup -m dup -L Home", mounted it with "compress=zlib" since space is more important to me, and moved my data back. All is good, but I do have a few questions.

First, I am not sure how the copy-on-write works. It is enabled and my understanding of it is limited. I did read that disabling CoW will disable checksums and compression, so I do not want it disabled. However, I would like to understand it so I know what's going on beneath the surface. I have two separate ideas on how it works, but I am not sure which is correct.

The first idea is that it works like this. A file is created. Later I edit the file either by appending or changing the data in it. When I save the new version it copies the existing data to a new spot and appends my new data, then marks the old location as free. This would prevent fragmentation, but I do not see the point since we can defragment BTRFS. I assume the old location would be wiped when I do a weekly scrub.

The second idea is that it works kind of like shadow copies in NTFS, allowing the original to remain and simply point to the current version. This would allow me to "roll back" changes to a file if I needed to, but this would also take more space on the system. Maybe a scrub would delete older copies?

So which is right? Are they both wrong? If it does work like shadow copies, how would I roll back a file? Thanks for helping me get my head around this!

Second, how does the checksum stuff work? What I mean is, assume I have a picture stored and a sector it's on fails. I have duplicate data and metadata, so I know it can be saved, but when does this checking occur and how would I know? Does this just magically happen and the system logs show the errors, or do I somehow manually make it check? Maybe during a scrub?
_________________
Ever picture systemd as what runs "The Borg"?
Back to top
View user's profile Send private message
haarp
Guru
Guru


Joined: 31 Oct 2007
Posts: 535

PostPosted: Fri Jun 24, 2016 8:54 am    Post subject: Reply with quote

Both are wrong. copy-on-write means that you can create copies of files for which data regions are only created when the copy is written to. Try it:

Code:
cp --reflink=auto hugefile hugefile2


is instantaneous. Both files point to the same data blocks. Only when you edit hugefile2 will it actually copy the blocks and write them. This is also a very useful concept for snapshotting.

As for checksum, afaik they are verified on every read and on a scrub. You can manually trigger scrubs too.

This article might prove to be interesting: http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/
Back to top
View user's profile Send private message
The_Great_Sephiroth
Veteran
Veteran


Joined: 03 Oct 2014
Posts: 1602
Location: Fayetteville, NC, USA

PostPosted: Fri Jun 24, 2016 12:47 pm    Post subject: Reply with quote

I read that article ages ago, but just now took the dive into BTRFS. The whole CoW thing just had me a tad confused. I had read something that said something along the lines of when a file is appended, the entire file is written elsewhere and, once writing is complete, it marks the old file location as free. In fact I know I read something about old versions of files also, on the BTRFS mount options page. Below is the information on the page about the "nodatacow" mount option, which I am NOT using.
Quote:

Do not copy-on-write data for newly created files, existing files are unaffected. This also turns off checksumming! IOW, nodatacow implies nodatasum. datacow is used to ensure the user either has access to the old version of a file, or to the newer version of the file. datacow makes sure we never have partially updated files written to disk. nodatacow gives slight performance boost by directly overwriting data (like ext[234]), at the expense of potentially getting partially updated files on system failures. Performance gain is usually < 5% unless the workload is random writes to large database files, where the difference can become very large. NOTE: switches off compression !

Also note that disabling CoW kills BTRFS in general according to that. No checksums or compression!
_________________
Ever picture systemd as what runs "The Borg"?
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3846
Location: Austro Bavaria

PostPosted: Fri Jun 24, 2016 2:50 pm    Post subject: Reply with quote

Quote:
I am about to begin using BTRFS in a RAID environment both at home and at my job


Are you sure to use something not very well proven for work?

When its work, the data needs to have some sort of data integrity and security that the data will not be corrupted and so on. so a well tested fs like ext3 may be the better choice. I have my reason why i wrote ext3. even ext4 is not that well tested.

Quote:
how would I roll back a file? T


lvm2 + snapshots worth looking into it.
rsync-backup or any other backup solution may be also worth loooking into which generates a data structure with dates for the files.
Back to top
View user's profile Send private message
The_Great_Sephiroth
Veteran
Veteran


Joined: 03 Oct 2014
Posts: 1602
Location: Fayetteville, NC, USA

PostPosted: Fri Jun 24, 2016 4:39 pm    Post subject: Reply with quote

BTRFS is good enough for Google, Amazon, and many others. If people would check the dates on posts reporting issues with BTRFS, almost all of them are 2010~2012 or earlier. It is stable and works well unless you mean RAID 5 or 6, though it is incomplete, lacking a few extras like data deduplication.

As for ext4, I have been using it since 2009 without a single issue on MANY systems.

I was also referring to rolling back as stated on the BTRFS mount options page, which I quoted.
_________________
Ever picture systemd as what runs "The Borg"?
Back to top
View user's profile Send private message
vaxbrat
l33t
l33t


Joined: 05 Oct 2005
Posts: 731
Location: DC Burbs

PostPosted: Fri Jun 24, 2016 10:03 pm    Post subject: CoW is the special sauce for btrfs and zfs Reply with quote

Copy on Write (CoW) is what gives btrfs its special sauce over ext4 and xfs. Approach it by understanding the concept of extent based files first. The concept is that when you create a file, rather than just allocating all of the blocks for it at once, you instead create a map of extents that will eventually gets its contents. You may also hear of this concept as "sparse file allocation".

The traditional concept of a copy based file is one end of the extreme for CoW. The file simply consists of a single extent because you wrote it all at once in a sequential manner. At the other exteme, you may have an extent for each 512 or whatever size block you write to the file as individual i/o's. The procress of btrfs "defragmentation" is the process of taking this large number of extents and merging them to into a much smaller number of multi-block extents.

The concept of a snapshot of either an entire filesystem or using the reflink option to cp such as:

Code:
cp --reflink=always original_file snapshotted_file


takes the extent map that exists for the source and creates new files with initially the same map. Then the files may begin to drift apart extent wise as writes update either one or the other file. Because we are pushing extent maps back and forth and not duping entire blocks of data on disc, the operation takes on the order of a second or less even for very large files such as the container for a virtual machine image. The other neat thing is that this operation is atomic. Thus you can snapshot a hot VM without worrying about its consistency.

The Holy Grail known as filesystem de-dupe looks at the extent maps for the filesystem from another angle. If the checksums (SHA512 or whatever) of two extents are identical, the de-dupe algorithm can discard one or the other and then use the remaining extent in the maps of whatever files that included the two.

So CoW and the btrfs checksumming capability are not that related. On your other recent thread,

https://forums.gentoo.org/viewtopic-t-1046730-highlight-.html

I mentioned to make sure that you format the raid striped mirrorset so that both metadata (-m raid10) and data (-d raid10) are set. By default, btrfs only dupes the metadata piece which includes the extent maps. If you don't dupe your data portion, your scrubbing will not be able to detect and heal bitrot on the actual file data.

Now that you know how snapshots happen, you can also understand why my friend with the ReadyNAS had steam coming out of his ears as he watched rsync after rsync get put onto his NAS for almost a year while the filesystem grew slowly. Granted he was getting a boatload of extent maps piling up in new inodes, but the basic data (mostly source code) wasn't changing all that much day to day.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7470

PostPosted: Sat Jun 25, 2016 10:17 pm    Post subject: Reply with quote

The_Great_Sephiroth wrote:
BTRFS is good enough for Google, Amazon, and many others. If people would check the dates on posts reporting issues with BTRFS, almost all of them are 2010~2012 or earlier. It is stable and works well unless you mean RAID 5 or 6, though it is incomplete, lacking a few extras like data deduplication.

Even BTRFS users will tell you not to do raid with it.
First i'm afraid about your hope on BTRFS maturity and stability age. From what i see, 2016 still have lot of bugs, that's not so bad, it mean it's alive.

But i really doubt google would use such FS, google don't care about features of FS, they should only care about performance, they should use a impossible to think amount of disks, and i don't think a FS with plenty features (and features generally mean more options, more complexity, and of course lower performance) will help them, while some faster and lighter will do better work as they are for sure using raid. A kind of specially made FS for networking and raid will sure do help them more or some kind of duplication error correction or things like that, making sure the FS is stable and datas is written fast everywhere and reliably, they don't really have problem with saving disk space or any need of software fs feature where hardware could do it.

So your "good enough for google" is for me some propaganda more than real fact. And while i was thinking what fs google could use, i found they have just done it the way they wish it. https://en.wikipedia.org/wiki/Google_File_System

For the same reasons, i'm also afraid Amazon using it might be more than sure false also, even i have no link to provide ; while reality sometimes makes logic lie, generally logic and reality just match.

While i see no problem if you use BTRFS, you have said to use it at work, and i feel the need to warn you that your assumption that because someone has tell you google and amazon (and whatever big companies that use lot of datas) use it and it makes you confident about its reliability, might have just foul you.
And while it is perfectly fine to follow some random suggestions from random users in random forum (yeah, just like mines!) that might have no impact, when it comes to important things (like your work datas), you better rethink twice about following them without checking.

If we take your post, you might be risking your work datas without knowing, but you also might risk other work datas that might follow you because, yeah google use BTRFS!
Reliability is the only way to keep your data and work safe, even backup won't help, as backing bad or wrong datas won't make them back as good.


But BTRFS is an awesome FS to use, NASA, NSA and CERN use it. (*)



*: in case you miss, that's a joke, i don't think they would also use such kind of FS, seriously
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Sat Jun 25, 2016 11:05 pm    Post subject: Re: A few BTRFS questions... Reply with quote

Since vaxbrat gave good answers for the rest already...
The_Great_Sephiroth wrote:
Second, how does the checksum stuff work? What I mean is, assume I have a picture stored and a sector it's on fails. I have duplicate data and metadata, so I know it can be saved, but when does this checking occur and how would I know? Does this just magically happen and the system logs show the errors, or do I somehow manually make it check? Maybe during a scrub?

I think that happens at read time but don't have enough data to be sure without reading kernel code; I've been really lucky with hard drive reliability for the last 6-7 years...

Putting a `btrfs scrub start -B /` command in a cron job is a very good idea all the same, do it. It'll run without adversely affecting performance.
Back to top
View user's profile Send private message
vaxbrat
l33t
l33t


Joined: 05 Oct 2005
Posts: 731
Location: DC Burbs

PostPosted: Sun Jun 26, 2016 1:51 am    Post subject: how checksumming works Reply with quote

First remember that btfs filesystems consist of both a data and a metadata component as well as a reserved area for system. For example, here is a btrfs version of the df command for a single disk filesystem:

Code:
thufir ~ # btrfs fi df /thufirraid
Data, single: total=2.63TiB, used=2.62TiB
System, DUP: total=8.00MiB, used=304.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=6.00GiB, used=4.65GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B


By default, btrfs duplicates the metadata and system pools but does not mirror or raid the data pool. Thus thufirraid will be able to detect bit rot, but will not be able to do anything to heal. it.

Here is an example of a simple two disk mirror which happens to be one of my ceph object stores:

Code:
thufir ~ # btrfs fi df /var/lib/ceph/osd/ceph-0
Data, RAID1: total=4.00TiB, used=3.99TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=8.00MiB, used=592.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=35.00GiB, used=31.97GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B


btrfs will be able to detect and heal bit rot in this object store as the ceph OSD does its normal job of reading and writing object shards with it. On the write of an extent for a file, btrfs calculates a checksum of the data that will go into the extent and then this information is stored in the metadata pool as the extent map for the file is updated. Writes to update the metadata pool also have a corresponding checksum calculated and then stored.

Whenever an extent is read back from disk as a part of normal i/o, the btrfs checksum also gets retrieved and then used to verify the data. If it doesn't match, the other drive in a mirror set or the xor drive in a raid set gets read to get the alternative buffer for data. If the checksum verifies ok with the other buffer, it is then used to re-write the extent back to the drive or drives that have suffered the bit rot. The whole process is transparent to the user, but you will see btrfs kernel messages in your /var/log/messages whenever it decides that it needed to heal an extent. I have actually seen these myself.

So while your normal use of your mirrored or raided btrfs filesystem will take care of bit rot automatically, you are probably not going to touch every part of your data on a normal basis. Thus we have the btrfs scrub command. It schedules a background process which walks through everything in the system and metadata pools and then everything in the list of extents to read all of the data that has been written out. You can continue to read and write your filesystem while this is going on, but you will notice the added latency as your process competes with the background scrub.

In my case, I have to manually schedule the scrubs on /thufirraid. However I don't need to for the ceph object stores. The ceph cluster breaks up the contents of its object stores into a series of Placement Groups (PGs). Each PG consists of a Primary which goes into a directory tree in one OSD daemon's filesystem and then 1 or more replicas which go to other OSD filesystems, usually on entirely different host machines. If you want to know more start here:

http://docs.ceph.com/docs/master/rados/operations/crush-map/

ceph also computes and uses checksums as it breaks objects into shards and then distributes them to PGs out on the OSDs. It has its own concept of background processes which, among other things, perform scrubs of the PGs automatically. It's designed to touch both the metadata for a PG (normal scrub) and the data itself (deep scrub) of every PG in the cluster every two weeks. During this process, btrfs would transparently heal any bit rot that may have happened. Anything else that may possibly go wrong will then get handled by the ceph consistency check. Having at least two replicas of every object and more than one host means that I can lose a whole machine without losing my data.

So to counter the nay sayers I state unequivocally that, YES, I do use btrfs for production data at my work and I manage to sleep very well at night as a consequence. I even trust it on consumer hardware at home which lacks ECC memory, but that's because I use ceph to provide defense in depth and also very good performance since my i/o spreads out across 4 hosts acting as OSD servers.
Back to top
View user's profile Send private message
vaxbrat
l33t
l33t


Joined: 05 Oct 2005
Posts: 731
Location: DC Burbs

PostPosted: Sun Jun 26, 2016 4:22 am    Post subject: btrfs as system drive and how it raids or mirrors Reply with quote

I just looked around my cluster and realized that 4 out of 6 of my hosts are running btrfs as their system roots on ssd drives. The other two are ext4 on ssd but that's only because I haven't touched them since I built them a number of years back before btrfs became tenable as a system drive (pre-grub2). When I looked at the history of my now outdated howto:

https://wiki.gentoo.org/wiki/Btrfs/Native_System_Root_Guide

I realized that I've been using it as system roots since around Jan of 2014 and probably first started doing that over Christmas break of 2013. On an unrelated note, I suspect I've been using ssd's for my system drives now since at least the beginning of 2013 or whenever it was that TRIM support got added to ext4 in whatever kernel it was (2.6.29?). I've yet to see one of my ssd drives get "bricked" in case some of you out there are still fence sitting about that. I'm probably still playing Russian Roulette with an OCZ Vertex 3 running around somewhere :P

There's another somewhat confusing concept about what btrfs does for raids and the traditional approaches using your choice of "serious" hardware raid cards (eg HP Smart Array, LSI, etc), mobo based pseudoraid such as the native Intel junk (aka "scary raid" and called that for a reason), mdadm based software raids and, of course, lvm volumes. zfs and btrfs can work with all of these, but you really want to cut out the middleman. In fact, btrfs is perfectly happy if you give it just a plain old drive without even bothering to put a partition table on it.

The traditional raid concept is that you get and install n of exactly the same size drive (preferably the same brand and model #) and then buy one or a few extra as spares. This is because the hardware raid controllers tend to get their panties in a bunch if you give them a hodgepodge of odd sized drives to work with (and btw interfere with Dell or HP's profit model for their "Enterprise" offerings). The smart player who does a software raid such as on mdadm and thus freeing their data from being held hostage by a proprietary controller quickly finds that their array size is based on the smallest drive that got added.

lvm, zfs and btrfs do things a bit differently in that individual drives get strung together and then the raid i/o concept gets applied in a somewhat smarter fashion. You can throw whatever you want into the pool and then it will get used in as an efficient manner as possible. So lets go back and look at my ceph osd.0 filesystem again:

Code:
thufir ~ # cat /proc/mounts | grep ceph
/dev/sdc /var/lib/ceph/osd/ceph-0 btrfs rw,noatime,compress=lzo,space_cache,autodefrag 0 0
192.168.2.5,192.168.2.6:/ /kroll ceph rw,noatime,name=admin,secret=<hidden>,acl 0 0


That /kroll filesystem is actually an instance of cephfs, which runs on top of the ceph object store and provides a "mostly posix compliant" filesystem that I can share out over nfs and samba to Winders and non-ceph cluster members. Whatever I write there gets 'sharded" and written out to 2 different btrfs mirrors in object stores somewhere on my 4 osd hosts. The two ip address in the mount entry are the primary and backup Metadata Servers (MDS) that manage the filesystem.

Notice the /dev/sdc for the btrfs mount. I'm using drives without partition tables since they are used entirely for btrfs and thus mbr or guid based tables are not necessary. I'm letting btrfs do its own raid management for the filesystem.

Here's the df again:

Code:
thufir ~ # btrfs fi df /var/lib/ceph/osd/ceph-0
Data, RAID1: total=4.00TiB, used=3.99TiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=8.00MiB, used=592.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=35.00GiB, used=31.97GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B


Notice the nice even numbers in the total versus used for each of the filesystem pools. btrfs will dynamically grow the pools as necessary to hold its data until it runs out of real estate. Unlike zfs, it can also do a shrink (via a btrfs balance operation).

Remember I mentioned that the osd.0 store was actually a "simple two drive mirror"? I bent the truth a little:

Code:
thufir ~ # btrfs fi show
Label: 'thufirraid'  uuid: 5f6e51a3-d8e7-41e1-bdb9-3cd9be0bf7fe
        Total devices 1 FS bytes used 2.63TiB
        devid    1 size 3.64TiB used 2.64TiB path /dev/sdb

Label: 'cephosd0'  uuid: 87a86762-05f6-44fa-860b-f96df085d967
        Total devices 3 FS bytes used 4.02TiB
        devid    1 size 3.64TiB used 2.69TiB path /dev/sdc
        devid    2 size 3.64TiB used 2.69TiB path /dev/sdd
        devid    3 size 3.64TiB used 2.69TiB path /dev/sde

Label: 'pny128_1'  uuid: 7d382834-3b5f-413c-98ad-f313bcae2ca4
        Total devices 1 FS bytes used 23.10GiB
        devid    1 size 115.21GiB used 26.04GiB path /dev/sdk2


My ceph object store actually spans three 4 tb drives and not the traditional two so I still have some breathing room:

Code:
thufir ~ # df
Filesystem                  1K-blocks        Used  Available Use% Mounted on
/dev/sda2                   197113132    47609744  139467576  26% /
/dev/sdb                   3907018584  2825318076 1079365828  73% /thufirraid
/dev/sdb                   3907018584  2825318076 1079365828  73% /raid
/dev/sdb                   3907018584  2825318076 1079365828  73% /home
/dev/sdc                   5860527876  4319065228 1027085716  81% /var/lib/ceph/osd/ceph-0
192.168.2.5,192.168.2.6:/ 23442108416 18475167744 4966940672  79% /kroll


btrfs does a "raid 1" thing by writing data to 2 of the 3 drives in the set while making sure that things are evenly spead out. I would need to give it four or more drives to do a raid10 mirror where it would begin to do striping as it wrote. I could have set this up as a raid5 array and bascially have 8tb of space instead of six, but I would be trading read performance for more storage space. There's also the whole "btrfs isn't ready for prime time raid5 or raid6" thing that people seem to be harping on. But drives are cheap and on a cluster you can see almost forever:

Code:
thufir ~ # ceph -w
    cluster 1798897a-f0c9-422d-86b3-d4933a12c7ac
     health HEALTH_WARN
            too many PGs per OSD (512 > max 300)
            noout flag(s) set
     monmap e10: 5 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,3=192.168.2.4:6789/0,4=192.168.2.5:6789/0,5=192.168.2.6:6789/0}
            election epoch 11964, quorum 0,1,2,3,4 0,1,3,4,5
      fsmap e1525: 1/1/1 up {0:0=2=up:active}, 1 up:standby
     osdmap e26424: 4 osds: 4 up, 4 in
            flags noout
      pgmap v17704231: 768 pgs, 4 pools, 7209 GB data, 8529 kobjects
            16709 GB used, 4736 GB / 22356 GB avail
                 767 active+clean
                   1 active+clean+scrubbing

2016-06-26 00:05:52.214055 mon.0 [INF] pgmap v17704231: 768 pgs: 1 active+clean+scrubbing, 767 active+clean; 7209 GB data, 16709 GB used, 4736 GB / 22356 GB avail


Ignore warnings about too many PGs. I used to do single drive btrfs filesystems as object stores before I changed strategies to going with mirror sets and then dropping the ceph replica count back down from 3 to 2. It saves me a fair amount of memory and lets me take advantage of the btrfs self healing from bit rot. noout is useful for very small clusters like this (Cern runs one with hundreds of nodes and petabytes of storage). I'm down to 4tb free on /kroll... will need to add some more object store hosts at some point 8)

One final thing: Did you notice the pny128_1 entry when I looked at btrfs filesystems? That's actually a 128gb usb3 thumb drive that I use to go back and forth between home and work. I put a gentoo live DVD iso on the first 4gb of it and then used the rest as a btrfs volume. I've been doing this or one based on the latest system rescue cd iso since the sweet spot of the thumb drive market hit around 32gb.
Back to top
View user's profile Send private message
The_Great_Sephiroth
Veteran
Veteran


Joined: 03 Oct 2014
Posts: 1602
Location: Fayetteville, NC, USA

PostPosted: Fri Jul 01, 2016 9:31 pm    Post subject: Reply with quote

Thanks for the in-depth answers, everybody. I have stumbled onto a new thing I am curious about. It seems that multiple partitions have identical UUID's on a new system. This is a brand-new system with brand-new drives. I partitioned /dev/sda using parted, then cloned it to /dev/sdb with "sgdisk -R /dev/sdb /dev/sda" and randomized the new partition UUID's with "sgdisk -G /dev/sdb". After that I created the filesystems with btrfs. If it was a small partition (less than 6GiB) I used "mkfs.btrfs -m raid1 -d raid1 -L Whatever -M /dev/sda2 devsdb2". If it was above 6GiB I removed the "-M" parameter. Now however, when I list disks by UUID, I show only sdb or sda. Both partitions have the same UUID, such as sda5 and sdb5. When listing by partuuid, I see them all and they are all unique. What is going on here? When I list with "btrfs fi show /dev/sd<x><y>" it shows a UUID which matches one of the UUIDs listed for the disk.

So why are the UUID's identical when I list /dev/disk/by-uuid and why do they match the UUID when I list either partition involved in said RAID1 array? For example, why does the RAID1 for /dev/sda2 and /dev/sdb2 have the same UUID listed for /dev/sda2 AND /dev/sdb2 in /dev/disk/by-uuid? I thought that duplicate UUID's was a bad thing?
_________________
Ever picture systemd as what runs "The Borg"?
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2977
Location: Germany

PostPosted: Fri Jul 01, 2016 9:48 pm    Post subject: Reply with quote

Quote:
I thought that duplicate UUID's was a bad thing?


RAID is a special case, as long as it's redundant it still has to work with either disk missing, so the UUID has to be on both disks.

It's the same for the much older mdadm RAID, you have the array UUID which is for the whole array regardless of the number of disks (and this is specified in mdadm.conf) and then there's another device UUID unique to each individual device. None of these show up in /dev/disk/by-uuid at all since those are more like background UUIDs like PARTUUID, in /dev/disk/by-uuid you get the UUID of whatever is actually stored on the RAID itself.

This works for mdadm because that's a distinct layer for RAID, btrfs rolls filesystem and raid into one so it doesn't have this luxury.

It shouldn't matter which device the /dev/disk/by-uuid symlink points to as long as it's one of the devices that belong to btrfs, and then btrfs can go and look for its other devices by itself.

It would probably be bad if there was something else with the same UUID which isn't btrfs or a completely separate instance of btrfs.
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Fri Jul 01, 2016 9:51 pm    Post subject: Reply with quote

It's been a while since I tried using btrfs raid but that's by design, it's how btrfs finds drives when assembling the raid at boot time (also means, among other things, you can mount an entire btrfs raid by mounting any single disk of the array; none of that /dev/mapper/ business to remember)
Back to top
View user's profile Send private message
The_Great_Sephiroth
Veteran
Veteran


Joined: 03 Oct 2014
Posts: 1602
Location: Fayetteville, NC, USA

PostPosted: Fri Jul 01, 2016 9:56 pm    Post subject: Reply with quote

OK, that makes sense. So if UUID abcd1234 pointed to /dev/sda1 and /dev/sdb1, if sda died, it could mount sdb and keep trucking until I replace sda. Thanks! I was worried.
_________________
Ever picture systemd as what runs "The Borg"?
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2977
Location: Germany

PostPosted: Fri Jul 01, 2016 9:58 pm    Post subject: Reply with quote

Same story with LVM by the way, each PV has it's own UUID but they all share the same metadata, volume group name, volume group UUID, ... the only redundant part here is the metadata itself so regardless which disk is missing, LVM can actually tell which is which and which is missing. On the outside, the LVM UUIDs are rarely used at all, it relies a lot more on names, which is why if you put disks of two systems together and they happen to be using the same volume group name you gotta rename one first (the rare case where you have to resort to UUID to identify one of them). There are quite a lot more UUIDs than you see on the surface so it's easy to get confused. Some UUIDs matter more than others... (you can totally use the same PARTUUID for all your partitions and never notice a problem, you can get by without using them at all)
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum