strange issues with raid6 (file corruption or kernel oops)

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

Hello,

I have a raid6 array with a damaged hard drive. However, when a write error occurs on the array, it doesn't fail the harddrive, instead one of two things happen:

If I'm using kernel 3.18.12, it will log messages to dmesg saying I/O error, and the file on the array will be corrupt. The array does not fail the disk, as it should, so I end up with tons of corrupt files

If I'm using any 4.x version of kernel (I have tried both 4.0.9 and 4.1.12) then when a write error occurs, I get a kernel oops logged to dmesg and all I/O to the array will hang. I have to forcefully reboot the server, because a ton of processes get stuck in state D, and the discs are never marked as failed.

Here is the output from dmesg of a write error when it occurs on kernel version 3.18.12:

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

can you post mdadm --detail for /dev/md* and mdadm --examine for /dev/sd* and tune2fs -l /dev/md4?

Your issue is strange because it actually reports as I/O error on md4. With a bad disk it should report I/O error on /dev/sdx instead. It's a raid with double redundancy so a bad disk should not cause I/O errors on the md device until you have triple failure.

So it may be your issue is something different after all, such as a filesystem that believes itself to be larger than the device its on, or some other structural / logical problem rather than a hardware one.

The kernel panic you should probably take to the raid mailing list (try the latest stable kernel first, in case it was fixed somewhere already)

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

Thanks for the reply.

If I re-enable ncq on the discs then the errors in the log do report at the /dev/sdb for example, but since I set the queue_depth to 1, it reports the raid device.

Here is all the info you requested:

mdadm --detail for /dev/md*

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

Maybe your issue has something to do with the bad block log, which is a relatively new feature in MD. A drive might get a bad block recorded in this log instead of being kicked from the array.

But /dev/sda1, /dev/sdg1, /dev/sdi1, /dev/sdp1 all claim to have "bad blocks present" and that probably shouldn't be, it shouldn't affect this many disks.

Do the disks all pass a 'smartctl -t long' self-test?

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

I did notice while pasting the logs that I had badblocks on multiple discs. But surely the raid array should take the approach of degrading if a write fails. I assume that if a write fails and it records it in the badblock list, that it will use another part of the disk to write that data?

It's also strange that on 3.18.12 I have I/O errors and on 4.0.8 / 4.0.12 I get a kernel oops, as if the condition is being handled differently.

I will post this to the kernel raid mailing list as well.
_________________
OSST - Formally: The Linux Mirror Project
OSST - Open Source Software Downloads - Torrents for over 80 Distributions

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

Makes sense.

I have posted my issues to the linux-raid kernel mailing list, and linked them to this thread for more information.

I have issued the smartctl tests on all drives. That is going to take 4 hours so I will come back with the results from them later. I am expecting some of the drives to fail, in which case I would have thought that linux raid would degrade the array, unless obviously the badblocks list can work around the errors - but if that was the case I should have corrupt files / kernel oops.

Thanks for all your help.

Matt
_________________
OSST - Formally: The Linux Mirror Project
OSST - Open Source Software Downloads - Torrents for over 80 Distributions

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

and what does the --examine-badblocks look like?

If my theory was right it should show the same blocks bad on 3 disks and that block should translate to the sector ext4 was complaining about.

Since it's incredibly unlikely for same block to go back on three disks, maybe a controller issue that triggered it. you'd have to check your logs for old messages if you have them

NeddySeagoon · Posted: Fri Nov 06, 2015 8:13 pm Post subject:

matt2kjones,

When you get a write fail on a single drive in a raid set the drive will attempt to reallocate the failed sector.
This is a internal to the drive thing. The kernel is not involved.
Similarly with a read fail. The drive will want to reallocate the sector but can't because it can't read it.
Events like this are recorded in the drives internal SMART log. Take a look with smartmontools.

Drive level errors look like

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

Yeah I'm not sure whats happening.

I subscribed to the linux-raid mailing list, and everything went fine, I am now receiving mails sent to that list.

I posted a message to the list and I get no response back, however if I send a command like "help" to the list, I do get a reply, so not sure why my message isn't being posted to the list.
_________________
OSST - Formally: The Linux Mirror Project
OSST - Open Source Software Downloads - Torrents for over 80 Distributions

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

OK, I have managed to post to the kernel mailing list using a different email address.

I have the output from mdadm --examine-badblocks. I am only listing drives here which have anything in the list:

/dev/sda1:

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

Yup, I'm not sure if that's how bad blocks are supposed to work. In your case it seems to have resulted in a "raid that never fails" which is not particularly useful if that in turn leaves the filesystem or system to deal with the mess or even crash...

I wish the bad blocks feature would be more exposed, say in /proc/mdstat instead of showing [UUU] or [U_U] it could do something like [BBB] for disks with known bad blocks, and mdadm monitor should send you mails about it if it doesn't already.

RAID survival depends on detecting errors early, and replacing disks immediately; if the bad blocks is designed to hide errors from you then it would be better to go without this feature. (even though it is a nice idea depending on the implementation as I mentioned earlier in this thread)

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

Thanks again for the reply.

I think I read somewhere that if you use metadata version 0.9, the badblock functionality isn't enabled.

This array is split over two controllers (two 8port sas cards), and one of those drives with badblocks is on a different controller to the other, so I don't think it would be controller error, unless there was a power glitch or something, which could be possible, although the array and server are attached to a ups.

I can actually destroy this array. This server contains backups of our live, master server which uses hardware raid10 with many more discs. So I can easily destroy this array and re-create it with good discs and see if the problem goes away. The main reason I am looking to resolve it without destroying the data is so that I can understand why it's happened, and how to get around it in the future if it happens again.

I will probably go down the route you suggest and clear the badblock logs for all the drives and replace the known faulty drive (we have lots of unopened spares here).

One question, if you don't mind? If a drive has a write error, then the block is added to the badblocks list and if the write to the badblocks list fails, then the drive is set as faulty, I understand that. But what happens if the drive sucessfully writes the badblock to the badblock list? Do I only have one copy of that data elsewhere on the array? What happens if I have drive A with badblocks then Drive B and C fail. Theoretically I can recover the array, but I assume that the data in those badblocks would be lost.
_________________
OSST - Formally: The Linux Mirror Project
OSST - Open Source Software Downloads - Torrents for over 80 Distributions

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

I am having what I think is the same issue. After 50+ hours of trouble shooting, I ordered in 2 new LSI controller cards. I think the problem is with the mvsas card/driver. Matt2kjones, can you give us the output of lspci? What type of drives are you using? Mine are all 3TB WD Greens.

I have 10 drives in question and they keep failing. The drive(s) gets a sector marked as "pending" bad. From there I can use hdparm --write-sector to toggle the exact sector in question. The drive says the sector is fine. I have gone as far as running SMART long tests, and secure erase. The drives always come back healthy. Every test I can run on the drive shows they are in good healthy.

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

Hi DingbatCA,

My issue seems to be that I have multiple drives with blocks all in the same area. If I take a drive out of the array with no badblocks, then add it back in, the badblocks from the other drives propagate to the badblocks list on the drive added back in. Not sure if this is meant to happen.

I have different cards to you:

lspci |grep SAS
01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
02:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)

Also, on my system, I'm not actually getting any harddrive errors on the harddrives themselves, only on the raid array as a whole, which makes me think that no read/write errors are actually happening on any of the drives and I think the badblock list is faulty somehow
_________________
OSST - Formally: The Linux Mirror Project
OSST - Open Source Software Downloads - Torrents for over 80 Distributions

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

OK, Changes I have done since my last post.

I have failed, removed and re-added 3 drives, one at a time.

/dev/sdp - This had the full list of badblocks above. When it re-added, it had none, when the sync completed, the badblock list was full again.
/dev/sda - Same as above
/dev/sdi - This drive only had two entries in the badblock list prior to removal - After a full sync, it had the full list... same as sdp and sda.

I have also switched to the latest mainline kernel 4.3.0

Since I have done these two things, write have been considerably faster, and I haven't had any dmesg errors yet (written over 400GB so far).

So I'm not sure whether taking the drives out of the array, and adding them back in, 1 at a time has fixed the issue, or whether the badblocks implementation is broken in earlier kernels, and it works correctly in 4.3.0

I plan to fill all the free space (about 6TB) to see if I have any write errors - If not, I assume this is fixed.
_________________
OSST - Formally: The Linux Mirror Project
OSST - Open Source Software Downloads - Torrents for over 80 Distributions

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

This issue seems to be resolved.

Wrote over 4TB of data to the array this morning, and finally hit an I/O error on /dev/sdd

drive was marked as faulty, and array degraded.

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

Spoke to soon...

After writing about 6TB of data I have hit buffer I/O errors again:

krinn · Watchman Joined: 02 May 2003 Posts: 7470

If you want sector2 of drive1 the same content of sector2 in drive2, 3... you duplicates their sectors content. And have no way than accepting if any of the drive have sector3 dead, all drives will have sector3 mark dead (dead sector count == total different sectors dead on all drives)

You have another way by duplicating files instead, that is more flexible (you can use compression, dead sectors count == the biggest total of sector count on all drives, files content are the same on all disks, but sectors content are not), but the complexity to handle that have a great impact on performance.

with software raid, you can duplicate logical sectors, with hw you can only duplicate hw sectors (because to know the logical sectors, you must know the partition). So software raid array can combine different partitions from different disks, while hw arrays can only be made from disks.

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

if the same blocks are bad on 3+ disks, data for those blocks is gone (or at least considered such by mdadm), so sync won't get you data for those blocks back.

So after syncing the synced disks don't have valid data for these blocks, thus they are bad in a way.

You might have to turn off the bad block log to get rid of this issue (and remove disks that were not previously part of the raid, as those will be guaranteed to have wrong data in those blocks).

Please note my own experience with the bbl is very limited, hence I suggested the mailing list, ...

You can enable/disable bad block log using bbl / no-bbl options on assemble update, check mdadm manpage for details.

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

Hi,

I was going to remove the badblocks log but according to man, if there is anything stored in the badblocks log you can't remove it, IE, you can only remove the badblocks log if there are currently no badblocks logged on that drive.

So it seems that I am stuck in an error state that I can't get out of. MDADM adds badblocks to all drives that I add or remove.. the badblocks are not passed down to the filesystem level, so I can't even get ext4 to ignore the badblocks to avoid corruption, and I can't remove the badblocks list from any harddrives.

I could fail and replace each disc with a new disc one at a time and would still have array that is unusable.

I have posted this thread and additional information to the kernel mailing list and I haven't had any replies, and as there is so little information on mdadm badblocks on the internet, im going to have to destroy the array and start fresh, rebuilding the data from the master server as I can't spend all of next week on this as well (Spent two weeks trying to get it operating so far). When I re-create the array I will leave badblocks on - I guess it got into this state due to an early broken implementation.

Thanks for all the help
_________________
OSST - Formally: The Linux Mirror Project
OSST - Open Source Software Downloads - Torrents for over 80 Distributions

frostschutz · Advocate Joined: 22 Feb 2005 Posts: 2977 Location: Germany

matt2kjones · Tux's lil' helper Joined: 03 Mar 2004 Posts: 89

This server acts as a backup that we can quickly grab files off, or a server we can switch over to if our master fails, so I am in a position where I can just destroy the array and re-create it.

Would have been nice to find a way out of this situation other than starting clean though.

I was thinking of stopping the array, then using dd to write zeros to the location of the badblocks list on each drive as I'm not worried about the data in those locations, then force a check on the array, but I guess I would have run into issues doing that and seemed like a lot of work for something that probably wouldn't have worked.

Again, thanks for all the help. Really appreciate all the help you've given.
_________________
OSST - Formally: The Linux Mirror Project
OSST - Open Source Software Downloads - Torrents for over 80 Distributions