Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
You can't trust hard drives. Now what?
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3310
Location: Rasi, Finland

PostPosted: Fri Apr 08, 2022 9:43 am    Post subject: You can't trust hard drives. Now what? Reply with quote

I'll try to keep this short.

Most of the hard drives and RAID implementations don't care if some bit of your data is corrupted. They'll serve it as-is. You'd need to manually issue scrub or whatever to find, and possibly correct, the bit rot. Too many storage setups rely on hard drive telling it has some corrupted bits. If I'm not mistaken it was more common back in the days for (enterprise) hard drives to have 520 byte sectors instead of 512. A smart enough storage solution (filesystem, raid implementation or what have you) would then use the 8 extra bytes for parity/checksum data and really present the the disk to the rest of the OS as if it had 512 byte sectors.
Sounds good, right? Yes. Online checksumming of all your data. Awesome! But... As far as I know, those 520 sector drives are a rare find now days and the price is astronomical compared to regular hard drives.

Enter the new generation of filesystems: zfs, btrfs, bcachefs... etc.
These filesystems have better checksumming. I have used btrfs quite a long time. It's good, but not completely ready (and you can't put swapfile on a multidisk btrfs). zfs has licensing problem, and using it is more complex than btrfs.

So I've finally settled to lvm handling my raid and logical volumes. So far ext4 and xfs have been my choice of filesystems.
I tend to lean against ext4 on lvm. I think lvm even supports shrinking of lv that has ext4 on it.

But as far as I know ext4 and xfs, both "only" checksum the metadata. Is this enough? I haven't dug deep enough.
Also Linux mdraid under it doesn't do automatic error detection and serve you the uncorrupted data. Again the scrubbing is a separate process, not "online".

So what kind of setups you guys use with your important data storage? Obviously backups, but corrupted data can lurk into backups too. Incremental backups usually saves from those kind of disasters. Keeping backups for important data is important, but it saves a lot of hazzle if your main data storage can fix corruptions while serving the file, or give you the uncorrupted data from some other disk. And eventually, preferably, hotswap the broken disk to a new one and sync.

The reason I wrote this post is that there seems to be no robust solution to this. btrfs is very close, but as I said, it's not complete. zfs seems very complex, but it seems to be pretty much at the top when it comes to data integrity.

I'm looking forward to read about your experiences with different kinds of storage setups.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3103

PostPosted: Fri Apr 08, 2022 11:33 am    Post subject: Reply with quote

Well, there is ceph too, which I think does check the data on reads in addition to scheduled scrubbing.
It's meant to be a distributed storage spanning at least hundreds of disks, but it is possible to use it as a "single node cluster" AKA raid for the localhost.
Unfortunately, it's quite resource hungry (RAM in particular), and sensitive to latency.
Write amplification doesn't help performance either. While it is possible to use HDD for data without too much of a penalty, some of the components (monitors and OSD logs) effectively must be backed by SSD for the whole thing to remain responsive.
Also, 1Gbps ethernet is not good enough for clustering, even if bandwidth is not a problem, you need 10Gbps adapters to keep latency down.

So... Yeah, it takes quite a bit of hardware and dedication to set this up.


I wonder how often do hard drives makes mistakes though. It doesn't happen every day after all. And you can get a flipped bit in the electronics too (CPU/RAM)
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Fri Apr 08, 2022 4:28 pm    Post subject: Reply with quote

Zucca,

Read Partial-response maximum-likelihood and tremble in fear for your data.
The drive doesn't read it, it guesses. :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9645
Location: almost Mile High in the USA

PostPosted: Sat Apr 09, 2022 2:15 am    Post subject: Reply with quote

All hard drives are ECC protected. They can correct single bit errors and detect double bit errors. More than that it's not guaranteed but likely these are caught too. You need to make clear what condition you're talking about however - disks exceedingly rarely return corrupt data from the disk to the disk read buffers on the drive and not know about it and thus you will get an out of band indication that the sector was read incorrectly.

However end to end protection is not guaranteed as is not a function of the drive itself - this is dependent on the entire system from the drive through the disk controller, memory, busses, CPU, caches, etc., etc.

Of all the corruption I've seen so far, it's most likely due to:

- Bad memory. I had lots of this happen, by far this is the most likely cause of corruption.
- Chipset/motherboard issues
- Overclocking. I've found that most CPUs when run within specs runs correctly.

I have not seen cosmic ray errors causing errors but this ends up being silent data corruption as I don't have end to end data protection. Not much I can do about this without spending $ARM$LEG on hardware.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2912
Location: Edge of marsh USA

PostPosted: Sat Apr 09, 2022 4:51 am    Post subject: Reply with quote

I think we are right to worry. I have some partial protection through serial, full backups. 100% of my personal data is read every night as it's archived, and I attend to my logs looking for errors. The serial backups include nightly, weekly, and monthly sets, plus an off-line set that, depending on a point in time, the oldest will be anywhere from 5-8 weeks old.

I admit that it's naive to count on read errors, but my hope is that bit flipping may worsen over a short period of time and eventually result in a read error, and I think it's the best I can reasonably do. I suppose I could transition to a more modern file system, but for now I use ext4.

The operating system gets backed up weekly (also at night and automatically) with weekly, monthly, and off-line set. I don't really worry about the OS. It can be re-created, but it's comforting to have the backups.

I'm glad someone brings up this topic from time-to-time.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/17.1/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9645
Location: almost Mile High in the USA

PostPosted: Sat Apr 09, 2022 5:31 am    Post subject: Reply with quote

How often are people actually seeing silent data corruption, and do you have end to end protection?

Then there's the partially preventable problems:

If you had a crash whether it be a uncontrolled power outage, pressing reset/magic sysrq, kill (whether by OOMkiller or not), software/driver bugs, segmentation faults, kernel oops?

Disgruntled employee?

If you had the "preventable problems" (versus the hardware caused issues -- the hardware did do exactly what it was asked to do) then these require a different source of mitigations ... Can't blame the hard drive for these types of problems.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Goverp
Veteran
Veteran


Joined: 07 Mar 2007
Posts: 1972

PostPosted: Sat Apr 09, 2022 8:17 am    Post subject: Reply with quote

eccerr0r wrote:
All hard drives are ECC protected. ...
However end to end protection is not guaranteed as is not a function of the drive itself - this is dependent on the entire system from the drive through the disk controller, memory, busses, CPU, caches, etc., etc. ...

so install ECC memory - which may well require a new motherboard - rather than worry about drives.
More importantly, use only bug-free software and OS. ;-)
_________________
Greybeard
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9645
Location: almost Mile High in the USA

PostPosted: Sat Apr 09, 2022 3:02 pm    Post subject: Reply with quote

so... question back to the original posting -- what is the nature/rootcause of the errors seen? What is the exact failure mode that's being seen that you're worried about?

Is it due to random failure due to cosmic ray strike? (Blame hardware but you need more hardware. This by far is exceedingly rare but if you need to cover all bases it needs to be accounted for.)

Hardware not specified properly to meet requirements or hardware failure? (Blame hardware. Buy better hardware. As I have a lot of cheap hardware this is my biggest failure mode by far.)

or simply an inconsistency due to someone crashing the program before it gracefully exits? (Don't blame hardware, this is a software issue. This affects us all...)

All three of these are errors that may show up as a problem in your data but they need to be handled differently.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2912
Location: Edge of marsh USA

PostPosted: Sat Apr 09, 2022 3:38 pm    Post subject: Reply with quote

Just a reminder, Zucca's concern, and mine, is bit-rot, i.e. random flipped bits.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/17.1/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Sat Apr 09, 2022 3:57 pm    Post subject: Reply with quote

figueroa,

There is a logical contradiction here.

You want a way to make a system that is imperfect detect and maybe correct its imperfections.
Its possible to make the incidence of undetected imperfection as small as you like but not reduce it to zero.
How much do you want to spend?

Thought experiment ...

Your PC fails POST, so its faulty ... or is it?
Did the POST get it wrong?

If it's really faulty, you expect this faulty equipment to pinpoint the fault, that asking too much.
Code:
Keyboard Error ...
Press F1 to continue


Lots of engineers have spent a large part of their careers trying to address this problem.
You really really, don't want to go flying with a faulty flight control computer.

Why the focus on the hard drive?
The entire system is involved and can contribute to random bit flips.

Oh, on a side note, the DRAM/Cosmic ray problem was much reduced by two changes to DRAM manufacture.
1. Changing from P-Channel to N-Channel storage.
2. Reducing the radioactive content of the packages used for DRAMs
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9645
Location: almost Mile High in the USA

PostPosted: Sat Apr 09, 2022 5:01 pm    Post subject: Reply with quote

figueroa/Zucca: The question is what is the source of the "random flipped bits"? Also are you specifically talking about silent data corruption where the hardware could not tell whether a bit flipped or not? If you do not get an ECC (including ID or sector not found) error reported from the hard drive, likely the hard drive is not the source of the error. Single bit errors read from the hard drive without the out of bound error indication are exceedingly rare - usually you get an empty sector if the hard drive gave up trying to read due to ECC/CRC failure.

How many "random flipped bits" are you getting? If you're getting more than 1 per year, likely the problem is not "random" and due to hardware failure like trying to make it run faster than it's able to do (whether it was labeled to do it is irrelevant as chips are binned to spec, not made to spec.)

Do you have ECC end to end protection? This pretty much gets the silent error rate to much less than 1 per year.

BTW: a lot of these CRC/ECC in-filesystem protection is not for protection against hardware problems. These in-filesystem checks actually are protecting against crashes due to software issues (and possible hardware hangs, driver bugs, cord pulling, etc.) - These are not "random" problems, even "random" power outages are solvable by UPS, and if you're worried this source of failure and don't have a UPS, then you're not doing all you can to prevent this...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 2912
Location: Edge of marsh USA

PostPosted: Sat Apr 09, 2022 6:23 pm    Post subject: Reply with quote

I do understand the ambiguities. And, besides, I'm using decent, but bog-standard, off-the-shelf, consumer-grade hardware. I'm sort of a low budget, low bandwidth doomsday prepper.

I do have UPSs. Three in my small office, one for each computer, and four others to protect other computers and devices around the house. But, I don't have end-to-end ECC.

My objectives are:
1. Prevent loss of data, especially silent/hidden loss.
2. Avoid loss of use.
3. Not be surprised.

I can't remember the last time I had loss of data of any kind that wasn't self-inflicted. In other words, the incidence of known/found bit-rot is zero. But, given the limitations of my hardware and implementation (my use case), I can't really know for sure.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/17.1/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3310
Location: Rasi, Finland

PostPosted: Sat Apr 09, 2022 6:25 pm    Post subject: Reply with quote

Whoa. Lot's of replies. :)

I was thinking more in the line of reasonable data protection against errors in data on hard drives. Mainly software based... and a bit of hardware in sense of multiple hard drives. ;) ECC RAM also possible to use.

Is short: What's the best filesystem + ~raid-like implementation in terms of data safety. This assuming you could not take backups and there's no human errors.
Opinions? Experiences? This is 'Gentoo chat' after all.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Sat Apr 09, 2022 7:52 pm    Post subject: Reply with quote

Zucca,

Let me just point you at Why Premature Optimization Is the Root of All Evil
You are spending your money on the wrong thing.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Goverp
Veteran
Veteran


Joined: 07 Mar 2007
Posts: 1972

PostPosted: Sun Apr 10, 2022 8:53 am    Post subject: Reply with quote

Zucca wrote:
...What's the best filesystem + ~raid-like implementation in terms of data safety. This assuming you could not take backups and there's no human errors. ...

And what about the case where you have a backup, but it disagrees with your system!

I think the dm-integrity layer in the kernel does what you want - or rather what I think you think you want. I looked at enabling it a while back, but reading several posts in a discussion not unlike this came to the conclusion it was pointeless. IIUC, the net is that bit rot from hard drive problems is exceedingly rare, as they rely on internal algorithms to extract your data from the spinning rust, and said algorithms already include ECC. The danger, as pointed out above, is downstream - cables, memory, and most importantly IMHO software bugs and hacking.
RAID-6 might be nearly as useful as dm-integrity, though of course it introduces complexity - ie, a source of entropy ...

I'm not so sure the claim for "built in ECC" is as strong for SSDs. I have an unsettling feeling that they run on the "hey, trust me" algorithm. Perhaps RAID-6 would be good here, if ludicrously expensive.
_________________
Greybeard
Back to top
View user's profile Send private message
Fitzcarraldo
Advocate
Advocate


Joined: 30 Aug 2008
Posts: 2034
Location: United Kingdom

PostPosted: Sun Apr 10, 2022 1:37 pm    Post subject: Reply with quote

I only use Btrfs on a single-disk nettop, and even then Btrfs was problematic initially (albeit I was able to fix the corrupt filesystem by using 'btrfs rescue chunk-recover'). I would never use Btrfs in a mission-critical installation, for the reasons given in the following article:

https://arstechnica.com/gadgets/2021/09/examining-btrfs-linuxs-perpetually-half-finished-filesystem/
_________________
Clevo W230SS: amd64, VIDEO_CARDS="intel modesetting nvidia".
Compal NBLB2: ~amd64, xf86-video-ati. Dual boot Win 7 Pro 64-bit.
OpenRC udev elogind & KDE on both.

Fitzcarraldo's blog
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Sun Apr 10, 2022 2:25 pm    Post subject: Reply with quote

I store my files on an ext4 filesystem.

I have a script that calculates hash sums for those files and writes them to a text file: 'md5sums-<date>.txt:
Code:
bad2ca3c83cdf0fe4925323e61dec69b  ./Music/James Bond - Best Of/A-Ha - 16 - The Living Daylights.mp3
4d82b1d8edd6789ca4561bd04a3bbe98  ./Music/James Bond - Best Of/Carly Simon - 11 - Nobody Does It Better.mp3
a15257c0a99cebb5eceafbadfb500d3d  ./Music/James Bond - Best Of/Chris Cornell - 23 - You Know My Name.mp3
168deb8712703b2618b979343b3ce42e  ./Music/James Bond - Best Of/Duran Duran - 15 - A View To A Kill.mp3
...

This scripts runs automatically via cron every month.

A diff between two files
Code:
diff md5sums-2007-01-05.txt md5sums-2021-01-05.txt

shows missing files, added files, moved files and corrupted files.

I have never seen corrupted files. Consequently, I trust disks! :)

PS: here is my script:
Code:
#! /bin/bash

      DIR="/vol_library"
  DATESTR="$(date +'%Y-%m-%d')"

if cd "${DIR}"
then
    find . \
        -mindepth 2 \
        -type f \
        -exec /usr/bin/md5sum {} \; \
    | sort -k 2 \
    > "md5sums_${DATESTR}.txt"
fi


Last edited by mike155 on Sun Apr 10, 2022 2:38 pm; edited 2 times in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Sun Apr 10, 2022 2:34 pm    Post subject: Reply with quote

mike155,

How do you differentiate between a corrupt checksum and a corrupt file?
All you know is that the checksum failed. You need a tiebreaker too.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9645
Location: almost Mile High in the USA

PostPosted: Sun Apr 10, 2022 3:49 pm    Post subject: Reply with quote

The largest files I work with on my computers are map files from the OpenStreetMap project. I've been dealing with multi-gigabyte files and worry that files get corrupted - single bit errors could cause a lot of headache. Fortunately OSM files a bit flip would confuse the reader scripts due to compression. So far I haven't seen corruption that would show up as parse errors. I constantly read and write these files (diff changes, I don't download fresh copies) so theoretically errors should accumulate over time but I haven't seen it. I trust my computer, or at least this specific one(s) - and for now - they may break in the future.

I tend to also trust my disks and other storage medium (tape too) - not to be able to read my data back, but against single bit errors/corruption. I run RAID and do backups because I don't trust disks to return my data 100% of the time, but if it does return data, I'm quite confident it's exactly what I wrote to them... if I trust the computer that wrote it in the first place. By far the computer is the risk area, especially the RAM because in general I do not have computers with ECC memory.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3310
Location: Rasi, Finland

PostPosted: Sun Apr 10, 2022 4:54 pm    Post subject: Reply with quote

NeddySeagoon wrote:
How do you differentiate between a corrupt checksum and a corrupt file?
I guess performing more different (algo that is) checksums is the way to get more reliable result? Portage relies on that.

Fitzcarraldo wrote:
I only use Btrfs on a single-disk nettop, and even then Btrfs was problematic initially (albeit I was able to fix the corrupt filesystem by using 'btrfs rescue chunk-recover'). I would never use Btrfs in a mission-critical installation, for the reasons given in the following article:

https://arstechnica.com/gadgets/2021/09/examining-btrfs-linuxs-perpetually-half-finished-filesystem/
I've used btrfs and in the early days I had pretty bad case where the filesystem became unmountable. I managed to recover all the data, but I had to write a script to fetch all the most recent uncorrupted version of every file.
Then later I had one nasty one, which was a result of newer kernel not letting it mount rw, because the new version had more strict data integrity standards. So I guess it was a "good problem".

Anyway I wonder how big companies (like facebook/meta) cope with "not-ready-yet-for-production-use" filesystem...
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9645
Location: almost Mile High in the USA

PostPosted: Sun Apr 10, 2022 5:54 pm    Post subject: Reply with quote

I think the problem Neddy is indicating is ... what if the checksum was computed wrong because it read it wrong or bad computer? This is a risk as the checksums were computed separately from the file?

Ideally the checksum is integral in the file (and the source guarantees that the checksum was computed correctly!). Otherwise keeping them separate is a slight risk in integrity. This is sort of the reason why I tend to keep the .torrent files when dealing with bittorrent - these .torrent files contain the checksums for each block computed from the source material on the original source. Any corruption in the .torrent file or the data downloaded by the torrent would trigger a redownload - and if it tries to do this, then we know corruption occurred. Where the corruption occurred would still be a mystery, but at least we know something went wrong. Ideally the torrent swarm has consensus which is correct by majority rule (data or torrent)... and of course seeders should have the right file.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2977
Location: Germany

PostPosted: Mon Apr 11, 2022 9:53 am    Post subject: Re: You can't trust hard drives. Now what? Reply with quote

Zucca wrote:
Most of the hard drives and RAID implementations don't care if some bit of your data is corrupted. They'll serve it as-is.


Drives do have a checksum for each sector. That's how they detect read errors. If a RAID encounters a read error, it will try to restore the data from the other drives. Basically that's the concept that makes things work for most people most of the time. The transfer on the wire itself is also protected (udma crc). So if there is corruption it would most likely have to happen in the disk controller/cache or in the system ram...

Zucca wrote:
As far as I know, those 520 sector drives are a rare find now days and the price is astronomical compared to regular hard drives.


This is simply never exposed to the end user, if anything the checksumming per sector should have got more reliable with 4K sectors.

I run regular SMART tests (searching for reallocated/pending/uncorrectable sectors) as well as RAID checks (searching for parity mismatches, mismatch_cnt) and on my own systems, I never encountered any. In a RAID if any single drive started misbehaving on its own, it would show up here, but... there's nothing.

Most corruption issues I come in touch with (mostly users asking for help in forums, irc, on stackexchange) it usually always goes back to software misbehaving, user error, and the like. Even for parity mismatches, it can go back to drives that went missing and then were forced back into the array w/o resyncing. Or the array was created with assume-clean which is only safe if the drives are actually all fully zero.

For RAID, everyone should check for parity mismatches once in a while, if you do have them, you're in trouble because the RAID implementation (at least for mdadm) is such that parity mismatches don't necessarily get fixed on subsequent writes. Parity can be recalculated with parity, not with data, so the new parity is only correct if the old parity was correct, too. So the parity mismatches stay around and with parity mismatches, you have no redundancy at all.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Mon Apr 11, 2022 10:19 am    Post subject: Reply with quote

eccerr0r,

Quote:
Ideally the checksum is integral in the file (and the source guarantees that the checksum was computed correctly!)

That's not valid as the checksum is not computed atomically.
Where its stored does not matter.

Real world problem ...
You are in an aircraft with two flight control computers. The outputs diverge slowly. Which one is correct?

Pick the wrong one and you may experience CFIT.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9645
Location: almost Mile High in the USA

PostPosted: Mon Apr 11, 2022 4:15 pm    Post subject: Reply with quote

NeddySeagoon wrote:

Quote:
Ideally the checksum is integral in the file (and the source guarantees that the checksum was computed correctly!)

That's not valid as the checksum is not computed atomically.
Where its stored does not matter.

The assumption is that the source of the file (in this case the AV file) computed the checksum and is kept with the file. The problem is if someone generated the checksum but does not know what the checksum should have been.
Sort of like in portage, doing an ebuild file.ebuild digest - you're making the assumption that the file you got is correct. But if you downloaded the digest and ebuild, the source of the file should have ensured the file was correct to begin with.
Quote:
Real world problem ...
You are in an aircraft with two flight control computers. The outputs diverge slowly. Which one is correct?

Pick the wrong one and you may experience CFIT.

The point of these checksums always was that something went wrong, not to direct what to do if something did went wrong. It's easier for files, just redownload/restore from backup, or recreate the file. For lockstep computers usually a flag is set that something went wrong and no further updates is possible. Typically a reboot is the solution, but not sure if this is the right solution for flight controllers as critical data may be lost during reboot time. Perhaps best of 3 computers is the solution here and majority rule is the correct course of action.

---

frostschutz wrote:
the checksumming per sector should have got more reliable with 4K sectors.

That 4k sector hoo hah was solely for density purposes. Hard drive manufacturers got the bright idea that if fewer ECC/CRC bits were used, they'd waste less space for user data.
It's not all a capacity tradeoff conspiracy here. The gaps between sectors also can be reduced so even if they used fewer bits but I think the recovery of errors (meaning ability to restore corrupted data back to its original with ECC) is reduced with the 4K sectors but I'm sure the detection of errors was made at least as good if not better than before.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 20053

PostPosted: Thu Apr 14, 2022 7:33 pm    Post subject: Re: You can't trust hard drives. Now what? Reply with quote

Zucca wrote:
Enter the new generation of filesystems: zfs, btrfs, bcachefs... etc.
These filesystems have better checksumming. I have used btrfs quite a long time. It's good, but not completely ready (and you can't put swapfile on a multidisk btrfs). zfs has licensing problem, and using it is more complex than btrfs.

So I've finally settled to lvm handling my raid and logical volumes. So far ext4 and xfs have been my choice of filesystems.
I tend to lean against ext4 on lvm. I think lvm even supports shrinking of lv that has ext4 on it.
I very much dislike the legacy solutions, but there isn't a "good" alternative yet. ZFS would theoretically be the answer you're looking for (licensing doesn't seem enough of a problem). However, the "new" parts of ZFS seem to have introduced complexity that didn't previously exist. That's the main reason I've avoided it. For basic use on Solaris, it was very simple. I'm very disappointed the complexity has made it effectively unusable in my opinion. At least that's my deduction from reading various posts over the years about getting it to work.

My interim solution was to use multiple copies of everything. I probably need to break down and configure a system with FreeBSD / ZFS and call it done (unless of course the complexity is more ZoL the project and not where the code is used).
_________________
Quis separabit? Quo animo?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum