Removing hard links from backups (I want original onlies)

Message

LIsLinuxIsSogood · Post by **LIsLinuxIsSogood** » Tue Jul 04, 2017 8:45 am

Hi all,
My current backup schematic and folder structure is basically causing me some devastating eye soreness by having to look at the date and backup routine in order to decipher what the heck it is (alpha.0, alpha.1, beta.0, beta.1, beta.2, etc.).

So that is the problem I am running into. My own attempt at a solution is going to be to safely remove all the "duplicate entries" of hard links within the directory structure, and then to get a much clearer viewpoint, of the incremental saving that was done when attempting to backup the system.

My goal is ultimately to safely remove folder by folder in this directory structure all the way back to the "beginning of time" -- for the folder, keeping in tact the necessary files as they change, and even some of the very old ones -- but not necessarily every file. Hence, the result will be that I will not only clear the clutter of much to much hard-links in the folder structure, something that I'm not sure if there is another way to do such as with querying it from within the shell, or some file manager, etc. But, another thing that I hope to accomplish is by reducing the actual files (original backups) to reduce the overall size of the backup to about 1 GB, when currently the footprint or space utilization from rsnapshot would appear to be about 10-15 GB, or way too much for me!! I am certain there are entire folders that I will remove, and in some rare cases it is even possible that files or folders will need to be left alone, but probably not for any reason other than it being the location of the file so to speak (of the newest file, which I could then copy or save elsewhere, and then perhaps choose whether or not recovering an older version of the file is an option I'd like to go with...my gut says no to that, and I feel it should save that effort at least for another run in with some other similar disaster.) To keep this more manageable, I believe my questions are the following - and this will help greatly, so thanks in advance:

If I remove any group of old files, what is the best all around approach to doing so, that I could make sure to eliminate both the hard linked file and the file that originated the copy for the remaining linked ones. And how might I do this safely, so as not to jeopardize losing the actual file first, but only after removing the links to the file, leaving the original as the only one left. That would be the holy grail if I can get there. I don't think it is too difficult, but I am still putting it to the forum to check to see if my logic is somehow not working well at this time!

That and the documentation for rsnapshot has got me hella confused. Does that make sense? For example, if I have a folder in the backup with source directory /usr/sbin, then I would like to be able to remove all links to the files but not the originating file backup that would be located in the backup directory specified in rsnapshot.conf. I will probably no longer keep the incremental backup for anything other than a folder like /etc and /home, but for now I have gone a bit further (and fallen short at the same time, forgetting to backup the /home directory, I basically only have the backups for /etc and /var that I want). I think given the situation, I would like to create or save a copy somewhere, but then immediately free up the majority of the associated space with having every past version of every document also saved. Right? Isn't that what that means?

At this point I am open to learning about the backups, and salvage what I can out of the previous files. At least in terms of my missing a chance with the /home folder I have a monthly backup that I run which is done bit for bit (a bare metal recovery type of tool) so that I can always go in there later back to a long time ago and fetch those files if needed. Woop woop.

But that is a way that I am sort of leaning with the majority of backups, since it seems more reasonable almost at this stage to be recovering (a la Windows system recovery) then performing all the extensive work involved in bringing up the actual file location from the specific backup run that has the file with rsnapshot. In terms of the shell, I wonder if there is some simple utility that I am missing (almost "ls"-like ) that could show me which files are hard-linked and which are the original backups? I assume there is. Any help would be appreciated. Then the thing I really want help with is how to, after finding the original files/folders do something to identify all the links to it, and whether or not those links can be removed without removing a similarly named file that is not a link but another version that should not get removed.

Any suggestions? Thank you.

Thanks and sorry if none of this seems to make perfect sense. It's late and I'm looking for a way to get started on this later once I have a better idea of the preferred method of reviewing the files. The backups were done using rsnapshot, and cronjobs for a sequence of routine backups, as I mentioned already (I think).

Put another way...I need help with Removing hard links from backups (I want original onlies)

Update:
After running a simple command for the hardlinks to a file, I am now left with the basic point of confusion about which of these files is the actual file and is the oldest? There must be some other way of separating that file (other than the time), I would imagine...unless there isn't really!

Code: Select all

playby backups # find . -samefile weekly.0/localhost/bin/chgrp 
./daily.3/localhost/bin/chgrp
./daily.6/localhost/bin/chgrp
./weekly.0/localhost/bin/chgrp
./monthly.0/localhost/bin/chgrp
./daily.4/localhost/bin/chgrp
./weekly.1/localhost/bin/chgrp
./daily.5/localhost/bin/chgrp
./weekly.3/localhost/bin/chgrp
./daily.2/localhost/bin/chgrp
./weekly.2/localhost/bin/chgrp

mv · Post by mv » Tue Jul 04, 2017 10:22 am

I haven't read the whole posting. But if you are looking for a way to check whether two files are hardlinks of each other, you have to check whether their inode is the same. The latter can be obtained with

Code: Select all

stat --printf=%i filename

Atom2 · Post by **Atom2** » Tue Jul 04, 2017 11:26 am

I have to admit that I haven't read your complete post as well, but it occurs to me that you have an incorrect perception on what an "original" file is and what hard links are.

Simply put: A file consists of a number of blocks allocated to and controlled by one "management structure" called an i-node. The i-node number is just an index into the complete list of all files on the partition (or file system) and each and every list entry (i.e. the i-node) contains all the meta-information about a file (e.g. allocated blocks, access rights, time of last access, modification and change of i-node, link count, etc.) with one noteable exception: Its name. The data of the file on the other hand is stored in blocks referenced by the i-node.

A directory entry for a file then links the i-node (which records all meta-information about a file except its name) with a name in a directory (the "file name" and, together with the names of all intermediate directories traversed starting from the root directory up to and including the directory the name exists in, the full path name of a file) under which the data associated with the i-node (and referenced by the block numbers recorded within the i-node) is accessible. This is achieved by storing the file name together with the i-node number (i.e. the index into the full i-node list) in a special file of type directory.

If a file is accessible through only one file name, the link count in the i-node is one; if a file is accessible through various file names (probably even residing in different directories), the link count is greater than one.

Now linking this all together, it is obvious that there's no way to distinguish the original file name from another file name created later on: Both are just (hard) links to one and the same i-node and as such share all the meta-information and (through the recorded list of allocated data blocks in the i-node) the data stored in the file. If you delete one file name, the connection between this specific file name and the i-node (through the recorded i-node number in the directory) is cleared for just that one file name; furthermore the link count in the i-node is decreased by one. All other file names and the rest of the file's meta-information (stored in the i-node) remain in place. Once the link count reaches zero, the data blocks associated with the file are freed and also the i-node is being made available for re-use.

I hope this makes sense. Atom2

The Doctor · Post by **The Doctor** » Tue Jul 04, 2017 6:25 pm

A hard link is an "original" file.

The way a file system works is that the data is written to disk and then another entry is made in an index. The index entry tells the OS where the data is located. A hardlink makes another index entry but does not affect the data. The data is only removed if all hardlinks are deleted. There is no "original" file.

A soft link is an index that knows the address of the original index. If the original index is changed or removed it breaks. The advantage of a soft link is that it can link across file systems where a hardlink cannot.

For an incremental backup you want hardlinks. Then your data will only be backed up once per version thus reducing size. A 1 GB file that never changes will be 1 GB no matter how many backups you preform. If you us cp for a backup then the file size will be the number of backups you keep.

For reducing the size of your backup there are only a few options. The first option is compression. This is a very bad option because you risk not being able to extract your files. You also must keep redundant copies of files which will defeat the purpose of compression. The second option is to use a smart system which will create hardlinks whenever possible. Of course it is necessary to reduce the number of files you backup to only things that need to be backed up. You may also want to discard old backups.

Rsnapshot already compresses the backup as much as possible without using compression. You can set how far back in time your backups go and how extensive they are. I recommend monthly for the system, weekly for sensitive settings like world, and daily for user data. I wouldn't keep less than 4 or 5 iterations just in case.

After running a simple command for the hardlinks to a file, I am now left with the basic point of confusion about which of these files is the actual file and is the oldest? There must be some other way of separating that file (other than the time), I would imagine...unless there isn't really!
Code: Select all
playby backups # find . -samefile weekly.0/localhost/bin/chgrp 
./daily.3/localhost/bin/chgrp 
./daily.6/localhost/bin/chgrp 
./weekly.0/localhost/bin/chgrp 
./monthly.0/localhost/bin/chgrp 
./daily.4/localhost/bin/chgrp 
./weekly.1/localhost/bin/chgrp 
./daily.5/localhost/bin/chgrp 
./weekly.3/localhost/bin/chgrp 
./daily.2/localhost/bin/chgrp 
./weekly.2/localhost/bin/chgrp 

This means that these are all exactly the same file. There is no oldest, no newest, no original, and no copy. This means that no, what you want does not make any sense.

EDIT: Although the fact that you have the same file appearing in daily, monthly, and weekly suggests that you have horribly misconfigured rsnapshot. Monthly should grab only the bits that don't change much, like the operating system. Weekly should grab more change prone and sensitive areas like /etc/ or your world file, and daily should grab user data (/home/). If you fix your system to exclude unnecessary things the size of your backup should decrease considerably.

I'd make absolutely sure your not trying to (unnecessarily) backup video as they are large files and generally in no danger of being lost.

The Doctor · Post by **The Doctor** » Tue Jul 04, 2017 6:54 pm

If it helps, here is how I manage my rsnapshot backups. My/etc/rsnapshot.d/monthly.conf

Code: Select all

/etc/rsnapshot.d/monthly.conf 
include_conf	/etc/rsnapshot.d/base.conf

# Monthly (6 increments)
retain	monthly	6
exclude		/home/
exclude		/tmp/**
exclude		/usr/portage/distfiles/**
exclude		/boot/
exclude		/var/tmp/ccache/**
backup		/		tomcat/
backup		/home/doctor/windowsvm/		tomcat/

As you can see, I grab my operating system but nothing extra like distfiles, cache files, or temporary files. I also grab my windows VM. Since this vm is mostly only used for gaming I don't need it to be particularly current.

My/etc/rsnapshot.d/weekly.conf

Code: Select all

include_conf	/etc/rsnapshot.d/base.conf

# Weekly (12 increments)
retain	weekly	12
backup	/etc/	tomcat/
backup		/var/lib/portage/world	tomcat/
backup		/boot/			tomcat/

this grabs my kernel, world file, and /etc/. These moderately sensitive to change but not too hard to fix if only a week or two out of date.

My /etc/rsnapshot.d/daily.conf

Code: Select all

include_conf	/etc/rsnapshot.d/base.conf

# Daily (30 increments)
retain	daily	30
backup	/home/doctor	tomcat/
exclude	/home/doctor/windowsvm/	tomcat/

This grabs everything I'm working on daily. I exclude the VM so I don't save unnecessary changes. My videos are saved elsewhere so there is no need to exclude them specifically.

If I need to restore I can rsync my latest monthly, then weekly, the daily snapshots. This restores my system completely minus a few bits and pieces that don't really matter. It takes all of 10 minutes or so to complete.

Backups are always going to be bigger than your working system. My backups have stabilized at ~700 GB while my system ~400 GB.

Jaglover · Post by **Jaglover** » Tue Jul 04, 2017 7:20 pm

Note, rsnapshot uses hard links only if it can. If you think it is wasting space it may not use hard links at all, depends on your setup.

LIsLinuxIsSogood · Post by **LIsLinuxIsSogood** » Tue Jul 04, 2017 11:54 pm

Since this is a laptop that I am working on, it is limited HD space, yes. Thank you for that. Thanks to Atom2 as well for an excellent explanation about inodes and how they link to the files and folders. If I am still confused about some of it, especially whether or not rsnapshot is the right tool for my purposes. I think I will give another try with rsync and set it up to sync with my 3TB desktop HD. That makes the most sense. Ultimately, what I want to know and if it was something that would be possible I would rethink this completely, but that is to have a way of recreating the entire folder structure minus those areas that I don't want to backup (similar to one of the backup utilities) and then use that to run a routine script that would sync to the larger hard drive. That way I could leave the limited space on my laptop alone, and still have an easy way to perform some backup (incremental is not important really anymore). I'd like to do something that takes less space locally.

Is this a job for rsync, maybe? I already do a complete archive of everything and have restored it on occasion so it is working. For convenience purposes, could someone please explain the point of doing a network backup and making sure that the state of the filesystem does not make it so easily corrupted either at the time of executing. Thanks, and yes I understand that this is sort of a new topic.