mdadm error log -- does it exist?

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

I had one HDD crash some minutes ago. It's RAID5 so no worries (yet

) about the data. Right now I have only remote SSH access to the server.
I'd like to know what was the cause of the crash and if it's recoverable (or not) after power cycle (restart) of the computer with failed drive (it as well may be broken SATA cable, not for the first time for me). Faulty hdd got kicked out of the RAID, but also it is not responding to anything:

jbest · n00b Joined: 29 Sep 2011 Posts: 3

My raid5 array and my raid1 array failed this morning, too, in a very similar way to yours:

drescherjm · Posted: Thu Sep 29, 2011 4:45 pm Post subject:

I see this when a drive has too many UREs and the drive goes completely offline trying to fix them. Or when a drive totally dies. This seems to happen a few times a year with my arrays based on Seagate 7200.10 and 7200.11 drives. I now have moved all arrays to raid 6 and monitor the status of 5 key SMART parameters to better predict drive failure. Btw when I have had failure like more than 2 drives kicked out of a raid6, I was able to recover by using ddrescue ()to recover the readable parts) on the drives that were kicked out of the array to new disks.
_________________
John

My gentoo overlay
Instructons for overlay

jbest · n00b Joined: 29 Sep 2011 Posts: 3

drescherjm · Posted: Thu Sep 29, 2011 5:03 pm Post subject:

I believe it is unrecoverable read error. These show as Current_Pending_Sector and / or "Offline_Uncorrectable" in SMART.

BTW, here is a link to my script that checks the smart params:

https://raw.github.com/drescherjm/jmdgentoooverlay/master/Other/shell-scripts/examine_mdraid.sh
_________________
John

My gentoo overlay
Instructons for overlay

drescherjm · Posted: Thu Sep 29, 2011 5:06 pm Post subject:

As for the OPs question. I do not believe a log exists for this. However most of the time you will see errors for a drive in your dmesg. More than what you posted.
_________________
John

My gentoo overlay
Instructons for overlay

jbest · n00b Joined: 29 Sep 2011 Posts: 3

drescherjm · Posted: Thu Sep 29, 2011 5:37 pm Post subject:

Note about that script. For some manufacturers (like seagate) some of the params may be bogus. You will know that when a value is like 5443455 and you are expecting 10..

BTW, I did not explain exactly what the script does. It enumerates all /dev/sd devices. Checks to see if the device is in any of your mdadm arrays and prints 5 key SMART params for the drive. I use this at work for my 75 to 100 drives in mdadm arrays. I also use nagios to monitor the temps and the reallocated sectors count for each drive.
_________________
John

My gentoo overlay
Instructons for overlay

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

Thanks for all your replies!
Fortunately for me it was only a matter of reseating SATA cable for failed drive. After that it showed up fine and smartctl did not found any errors. Also write intent bitmap saved me from 1,5 days of resync

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

drescherjm · Posted: Fri Sep 30, 2011 11:28 am Post subject:

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

Yeah, you are right:

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

fcuk, it happened again, in the same funny way:

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland

This time maybe not so "good health":

mbar · Veteran Joined: 19 Jan 2005 Posts: 1990 Location: Poland