Solved: Which sdX drive choked - ext4fs on LUKS on MDRAID?

eccerr0r · Last edited by eccerr0r on Tue Mar 05, 2024 9:01 pm; edited 1 time in total

I have a cryptsetup root on MDRAID5 on /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2 /dev/sde2 and this is driving me nuts. Well, the setup isn't but determining read/write errors is.

I noticed:

NeddySeagoon · Posted: Sun Mar 03, 2024 11:22 am Post subject:

eccerr0r,

It all depends ...

eccerr0r · Posted: Sun Mar 03, 2024 3:50 pm Post subject:

Yeah I definitely know one or more of the disks are choking through reading SMART data but I find it odd the kernel isn't reporting underlying devices directly when they are accessed. Indeed it's possible it finally succeeded later on (perhaps it's the callbacks that's being suppressed) but surprised it doesn't report devices as they choke.

Also I would have seen the md subsystem reporting that it was trying to repair if it did find and fix an inconsistency.

BTW I think the problems I'm seeing with this array is not disks but rather a bad power supply's connectors. I've always hated the SATA connector as they are hard to repair (have to replace the connectors when they fail) and I never have any spare SATA connectors...

---

Yay!

eccerr0r · Posted: Mon Mar 04, 2024 4:14 pm Post subject:

The "Pending sectors" count on /dev/sdb was at a high of 52, after forcing a repair on the array it went down to 49.
I'm writing big zero files to the array and now it's down to 43. Reallocates is still 0.

Was able to clean two sata power connectors. One of them was completely useless, now it seems to be at least somewhat usable as that disk it's connected to is working.

Argh. The perils of trying to reuse equipment.

(The array is all 500G disks, surprisingly all the same size. A WD Green, a WD AV-Green, a WD Blue, a Seagate ES, and a Seagate 7200.14 4K-sector. The AV-Green seems to be the questionable one though the Seagate ES was initially causing issues prior to swapping SATA power connectors. I really wonder how much money Seagate saved by making those reduced height 3.5" drives...)
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

eccerr0r · Posted: Mon Mar 04, 2024 7:38 pm Post subject:

AHH I figured it out.

These bad blocks are not necessarily real ... or rather, they were real prior to remapping.
This is a mdraid 1.2 "issue." Problem is that md records bad blocks and ... well doesn't try to retry bad blocks anymore after it discovers one. It then handles them poorly.

Do any other RAID subsystems (dm-raid, btrfs-raid, ???) handle bad blocks better? This seems to imply to me that I need to wipe and recreate the RAID after fixing the PSU problem as the bad blocks aren't really bad anymore, but it won't try them again. Perhaps there's a way to undo these "bad" blocks but need to write the correct data into them...

Probably easier to wipe and redo :(
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

szatox · Advocate Joined: 27 Aug 2013 Posts: 3138

How 'bout failing and removing the drive with bad block and then adding it to raid set again?
500GB HDD should get fully resilvered in like 30 minutes.
_________________
Make Computing Fun Again

eccerr0r · Posted: Tue Mar 05, 2024 1:02 am Post subject:

It's actually more than 30 mins for some reason - some of my disks are slow, like 60-70MB/sec, though I do have some 90-120MB/sec units. I think it's more like 2 hours based on experience from looking at the 90 minute estimate at the beginning of the disk. However I need to add another 30% or so due to inner track slowdown.

I have two disks that have false bad blocks so that's 4 hours and I need to get that temporary disk back as it's a 500G 2.5" disk temporary disk, and want to make sure the array never runs in degraded mode...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

NeddySeagoon · Posted: Tue Mar 05, 2024 10:06 am Post subject:

eccerr0r,

You can run

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

If you replace / fail a drive in an array with mismatches, the data on the array will change. Suppose all data is \0, but the parity is \1 (a mismatch). If you rebuild / replace one drive, the data on it previously \0 will be rebuilt as \1.

If it's a benign mismatch (filesystem free space? trim/discard?) it won't matter but if it's actual data, then it's bye-bye data. It's very bad to have mismatches. RAID can't figure out what's right and wrong, at this point you'll have to verify file contents yourself. Preferably *before* you repair, rebuild and permanently "fix" mismatches the wrong way.

For mdadm's bad block list - causing read errors on md device, even if all underlying devices were replaced - you can --assemble with --update=no-bbl or --update=force-no-bbl. Same problem, you don't necessarily get correct data for these blocks afterwards.

eccerr0r · Posted: Tue Mar 05, 2024 4:52 pm Post subject:

Yeah that was the concern that I had to deal with inconsistencies.
I first found all the files that had blocks that contained the bad blocks and got a list of them, saved them.
Then I got rid of those bad blocks by disabling the list and then re-enabled it.
Then I deleted all those bad files and recopied them from source.

I think I'm good now. equery check \* says my base install is still good, and the rest of the bad block files I restored from my main, and so far I'm not seeing any more bad block behavior.

Probably should do one more diff of this array and call it good. I think the main thing that needed to be done was cleaning those SATA power connectors and the hard drive is stable now. In fact that hard drive with 52 pending sectors now reports zero pending sectors and surprisingly zero reallocates...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

NeddySeagoon · Posted: Tue Mar 05, 2024 6:32 pm Post subject:

eccerr0r,