Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
can mdadm say if an array is broken ?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Wed Jan 06, 2010 3:40 am    Post subject: can mdadm say if an array is broken ? Reply with quote

When a device is missing in an array, /proc/mdstat put's an "_" in the description. If only one device is missing in RAID5 or one or two in RAID6, the array can be fixed by adding a new volume. When doing "mdadm -a", recovery starts automatically.

When 2 are faulty in RAID5 or three in RAID6, mdadm outputs things the same way; the array can not be fixed by adding more drives, and DATA can not be fixed ever. After adding more drives, mdadm does NOT start recovery, and just put the devices as "spare". In this second case, data are lost forcever, and this point is not clear at all ...

neither in proc/mdstat nor in mdadm -D ...

Did I miss something ? or is mdadm just unable to "measure" this ?

I understand that at boot process, kernel adds volumes one by one to arrays, so that, during a few milisenconds, arrays are in degraded mode untill all elements of the array are found. I understand that degraded mode is acceptable for a few seconds at boot time, and thus, it is an essantial, and "not so alarming state" for an array. Still, I would like mdadm or the kernel to tell me "right now, at once, the array is not usable, broken, and <<if you don't add blocks pretty fast, and performa any write on the array, you will loose data forever>> ".
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 4:06 am    Post subject: Reply with quote

Quote:
When 2 are faulty in RAID5 or three in RAID6, mdadm outputs things the same way; the array can not be fixed by adding more drives, and DATA can not be fixed ever. After adding more drives, mdadm does NOT start recovery, and just put the devices as "spare". In this second case, data are lost forcever, and this point is not clear at all ...


How are you loosing so many drives? This is definitely not normal. In 6 years I have not lost a single software raid 5 or 6 array out of dozens. If your drives are just being kicked out of the array for some reason (like loose power connector, bad sata cable, bad sata controller ...) and the drives are still in working condition (enough to duplicate) you can duplicate the missing drives and force the array to use the out of sync member if needed. If the cause of being kicked out of the array is not a bad drive and then no need to copy the drive just force mdadm to assemble. There may be some inconsistencies but that is better than loosing everything.


Quote:
I understand that at boot process, kernel adds volumes one by one to arrays, so that, during a few milisenconds, arrays are in degraded mode untill all elements of the array are found.


This is wrong. The array usually is not started until all members are found.

Quote:
Still, I would like mdadm or the kernel to tell me "right now, at once, the array is not usable, broken, and <<if you don't add blocks pretty fast, and performa any write on the array, you will loose data forever>> ".


You can monitor this via several ways. The mdadm daemon can be setup to email you when the array becomes degraded. Or there are other programs that monitor the arrays like nagios.

Also if you have these kinds of problems often add a spare. On > 10 arrays (over 50 hard drives) that have run for the last 3 to 6 years 24/7 I have replaced 3 drives total.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Wed Jan 06, 2010 4:11 am    Post subject: Reply with quote

I only used RAID 0 and RAID1 in the past; I just bought 4 drives this morning, and doing heavy tests, to understand how mdadm works, and what my computer will say me when one fails, then a second one. Just doing simulation, using mdadm -f -r for now.

Still, if I do -f vol4 -f vol3 -r vol4 -r vol3 ... proc/mdstat say that everything is normal. While I would expect some "error", warning, or explicit message about possible data loss. IMHO, it should show a graphical difference between "your are loosing redundancy", and "you have lost redundancy" and "your data are corrupt". Mdadm -D is not really better than mdstat.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 4:14 am    Post subject: Reply with quote

Quote:
Still, if I do -f vol4 -f vol3 -r vol4 -r vol3 ... proc/mdstat say that everything is normal.


That is not normal. The status (/proc/mdstat) normally tells you as soon as the drive is kicked out of the array.

It does this by indicating missing drives _ and it can also email you of that.

Although I have not played around with software failing of drives/members in years.
_________________
John

My gentoo overlay
Instructons for overlay


Last edited by drescherjm on Wed Jan 06, 2010 4:19 am; edited 1 time in total
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 4:18 am    Post subject: Reply with quote

Quote:
IMHO, it should show a graphical difference between "your are loosing redundancy", and "you have lost redundancy" and "your data are corrupt".


Now this type of message I have not seen with any program.

However as an admin you should address a single drive failure as soon as you can.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Wed Jan 06, 2010 4:18 am    Post subject: Reply with quote

I just says as you expect [UU__] ... and *I* have to know that, if the erray is RAID5, then it's dead, if RAID6, I have to hurry up.

Unless messages are removed because I remove members manually ? maybe things should be different in case kernel detects failts by itself ? still, before letting me marking a volume as faulty, it should warn me. Because, if just before I mark it faulty, an other volume breaks ... I may end of rapidly with a broken system.

Mdadm really seems to have absolutely no backup or security against PEBCAK, or human mistakes.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Wed Jan 06, 2010 4:19 am    Post subject: Reply with quote

mdadm should send me email/sms automatically ? where do i set this ?
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 4:24 am    Post subject: Reply with quote

Quote:
I just says as you expect [UU__]


This is the same output for hardware failures. Again you can have madam email you that there is a problem or have nagios monitor your array for you and also email you.

If you are looking for this to tell you that having 1 _ in RAID 5 is one step from data loss or 2 _ in raid 6 the software does not spell it out that clearly. As an admin you are supposed to know that..

I think webmin will also show you this.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 4:25 am    Post subject: Reply with quote

doublehp wrote:
mdadm should send me email/sms automatically ? where do i set this ?


In your mdadm.conf

also rember to start the mdadm daemon.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 4:27 am    Post subject: Reply with quote

Also the /proc/mdstat will say degraded on the array that does not have all of its drives.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Wed Jan 06, 2010 4:27 am    Post subject: Reply with quote

I never used the conf file; always let magic dothings. Can't mdadm record the email to send directly in the drives ? or make the monitoring daemon do it without declaring arrays in the conf ?

I tried several times to declare the drives in the conf, and always got troubles.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 4:29 am    Post subject: Reply with quote

There is some good info here:
http://en.gentoo-wiki.com/wiki/RAID/Software
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Wed Jan 06, 2010 4:29 am    Post subject: Reply with quote

drescherjm wrote:
Also the /proc/mdstat will say degraded on the array that does not have all of its drives.


No, it did not; not even that. Unless it's a kernel/mdadm version problem. I am ATM using an old system (stable Debian); I will have a better one in two days (stable Gentoo).
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 4:33 am    Post subject: Reply with quote

I may be wrong about it saying degraded in the /proc/mdstat. I do know it does send out emails though.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 4:36 am    Post subject: Reply with quote

BTW. Here are examples of the email nagios can send:

Code:
***** Nagios 2.10 *****

Notification Type: PROBLEM

Service: Linux Raid Status for md1
Host: dev6
Address: dev6.radimg.pitt.edu
State: CRITICAL

Date/Time: Wed Sept 17 15:39:30 EDT 2008

Additional Info:

CRITICAL md1 status=[UUU_].


Code:
***** Nagios 2.10 *****

Notification Type: PROBLEM

Service: Linux Raid Status for md2
Host: dev6
Address: dev6.radimg.pitt.edu
State: WARNING

Date/Time: Thu Sept 18 13:08:20 EDT 2008

Additional Info:

WARNING md2 status=[UUU_], recovery=80.2%, finish=29.5min.

_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Wed Jan 06, 2010 12:07 pm    Post subject: Reply with quote

drescherjm wrote:
BTW. Here are examples of the email nagios can send:


That's nice :) Thanks.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
Monkeh
Veteran
Veteran


Joined: 06 Aug 2005
Posts: 1656
Location: England

PostPosted: Wed Jan 06, 2010 12:53 pm    Post subject: Reply with quote

doublehp wrote:
I just says as you expect [UU__] ... and *I* have to know that, if the erray is RAID5, then it's dead, if RAID6, I have to hurry up.


I know, it's horrible, you have to understand how it works.
Back to top
View user's profile Send private message
doublehp
Guru
Guru


Joined: 11 Apr 2005
Posts: 473
Location: FRANCE

PostPosted: Wed Jan 06, 2010 1:23 pm    Post subject: Reply with quote

Monkeh wrote:
doublehp wrote:
I just says as you expect [UU__] ... and *I* have to know that, if the erray is RAID5, then it's dead, if RAID6, I have to hurry up.


I know, it's horrible, you have to understand how it works.


When I have a full page with half, or a dozen of raid chains, I will not count the "_" for every single them, and check if this number is acceptable depending on the raid type of the array. A very simple graph could be very nicely explicit:
- - array is fully operationnal
- / some redundancy is missing
- ! state is critical, you don't have any more redundancy
- ? you lost too many drives, data are not recoverable (unless you can introduce a drive that you are sure is sync with this arrays, and if the array is RO)

To show that a very simple sign, a single letter, put just after the [UUUU__] (in proc/mdstat , and respective full descriptions in mdadm -D ) could easily show what I am talking about. Mdadm says "degraded" in any case.

For example, 3 missing drives in RAID1 5 drives should just be "/". Just to illustrate: [UU___] /

The mail system proposed by drescherjm seems nice to me.
_________________
DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2792
Location: Pittsburgh, PA, USA

PostPosted: Wed Jan 06, 2010 2:26 pm    Post subject: Reply with quote

Here is some info on where to begin for nagios:

http://www.gentoo.org/doc/en/nagios-guide.xml
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum