View previous topic :: View next topic |
Author |
Message |
doublehp Guru


Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Wed Jan 06, 2010 3:40 am Post subject: can mdadm say if an array is broken ? |
|
|
When a device is missing in an array, /proc/mdstat put's an "_" in the description. If only one device is missing in RAID5 or one or two in RAID6, the array can be fixed by adding a new volume. When doing "mdadm -a", recovery starts automatically.
When 2 are faulty in RAID5 or three in RAID6, mdadm outputs things the same way; the array can not be fixed by adding more drives, and DATA can not be fixed ever. After adding more drives, mdadm does NOT start recovery, and just put the devices as "spare". In this second case, data are lost forcever, and this point is not clear at all ...
neither in proc/mdstat nor in mdadm -D ...
Did I miss something ? or is mdadm just unable to "measure" this ?
I understand that at boot process, kernel adds volumes one by one to arrays, so that, during a few milisenconds, arrays are in degraded mode untill all elements of the array are found. I understand that degraded mode is acceptable for a few seconds at boot time, and thus, it is an essantial, and "not so alarming state" for an array. Still, I would like mdadm or the kernel to tell me "right now, at once, the array is not usable, broken, and <<if you don't add blocks pretty fast, and performa any write on the array, you will loose data forever>> ". _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png |
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
Posted: Wed Jan 06, 2010 4:06 am Post subject: |
|
|
Quote: | When 2 are faulty in RAID5 or three in RAID6, mdadm outputs things the same way; the array can not be fixed by adding more drives, and DATA can not be fixed ever. After adding more drives, mdadm does NOT start recovery, and just put the devices as "spare". In this second case, data are lost forcever, and this point is not clear at all ... |
How are you loosing so many drives? This is definitely not normal. In 6 years I have not lost a single software raid 5 or 6 array out of dozens. If your drives are just being kicked out of the array for some reason (like loose power connector, bad sata cable, bad sata controller ...) and the drives are still in working condition (enough to duplicate) you can duplicate the missing drives and force the array to use the out of sync member if needed. If the cause of being kicked out of the array is not a bad drive and then no need to copy the drive just force mdadm to assemble. There may be some inconsistencies but that is better than loosing everything.
Quote: | I understand that at boot process, kernel adds volumes one by one to arrays, so that, during a few milisenconds, arrays are in degraded mode untill all elements of the array are found. |
This is wrong. The array usually is not started until all members are found.
Quote: | Still, I would like mdadm or the kernel to tell me "right now, at once, the array is not usable, broken, and <<if you don't add blocks pretty fast, and performa any write on the array, you will loose data forever>> ". |
You can monitor this via several ways. The mdadm daemon can be setup to email you when the array becomes degraded. Or there are other programs that monitor the arrays like nagios.
Also if you have these kinds of problems often add a spare. On > 10 arrays (over 50 hard drives) that have run for the last 3 to 6 years 24/7 I have replaced 3 drives total. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
 |
doublehp Guru


Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Wed Jan 06, 2010 4:11 am Post subject: |
|
|
I only used RAID 0 and RAID1 in the past; I just bought 4 drives this morning, and doing heavy tests, to understand how mdadm works, and what my computer will say me when one fails, then a second one. Just doing simulation, using mdadm -f -r for now.
Still, if I do -f vol4 -f vol3 -r vol4 -r vol3 ... proc/mdstat say that everything is normal. While I would expect some "error", warning, or explicit message about possible data loss. IMHO, it should show a graphical difference between "your are loosing redundancy", and "you have lost redundancy" and "your data are corrupt". Mdadm -D is not really better than mdstat. _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png |
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
Posted: Wed Jan 06, 2010 4:14 am Post subject: |
|
|
Quote: | Still, if I do -f vol4 -f vol3 -r vol4 -r vol3 ... proc/mdstat say that everything is normal. |
That is not normal. The status (/proc/mdstat) normally tells you as soon as the drive is kicked out of the array.
It does this by indicating missing drives _ and it can also email you of that.
Although I have not played around with software failing of drives/members in years. _________________ John
My gentoo overlay
Instructons for overlay
Last edited by drescherjm on Wed Jan 06, 2010 4:19 am; edited 1 time in total |
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
Posted: Wed Jan 06, 2010 4:18 am Post subject: |
|
|
Quote: | IMHO, it should show a graphical difference between "your are loosing redundancy", and "you have lost redundancy" and "your data are corrupt". |
Now this type of message I have not seen with any program.
However as an admin you should address a single drive failure as soon as you can. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
 |
doublehp Guru


Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Wed Jan 06, 2010 4:18 am Post subject: |
|
|
I just says as you expect [UU__] ... and *I* have to know that, if the erray is RAID5, then it's dead, if RAID6, I have to hurry up.
Unless messages are removed because I remove members manually ? maybe things should be different in case kernel detects failts by itself ? still, before letting me marking a volume as faulty, it should warn me. Because, if just before I mark it faulty, an other volume breaks ... I may end of rapidly with a broken system.
Mdadm really seems to have absolutely no backup or security against PEBCAK, or human mistakes. _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png |
|
Back to top |
|
 |
doublehp Guru


Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
Posted: Wed Jan 06, 2010 4:24 am Post subject: |
|
|
Quote: | I just says as you expect [UU__] |
This is the same output for hardware failures. Again you can have madam email you that there is a problem or have nagios monitor your array for you and also email you.
If you are looking for this to tell you that having 1 _ in RAID 5 is one step from data loss or 2 _ in raid 6 the software does not spell it out that clearly. As an admin you are supposed to know that..
I think webmin will also show you this. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
Posted: Wed Jan 06, 2010 4:25 am Post subject: |
|
|
doublehp wrote: | mdadm should send me email/sms automatically ? where do i set this ? |
In your mdadm.conf
also rember to start the mdadm daemon. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
Posted: Wed Jan 06, 2010 4:27 am Post subject: |
|
|
Also the /proc/mdstat will say degraded on the array that does not have all of its drives. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
 |
doublehp Guru


Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Wed Jan 06, 2010 4:27 am Post subject: |
|
|
I never used the conf file; always let magic dothings. Can't mdadm record the email to send directly in the drives ? or make the monitoring daemon do it without declaring arrays in the conf ?
I tried several times to declare the drives in the conf, and always got troubles. _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png |
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
|
Back to top |
|
 |
doublehp Guru


Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Wed Jan 06, 2010 4:29 am Post subject: |
|
|
drescherjm wrote: | Also the /proc/mdstat will say degraded on the array that does not have all of its drives. |
No, it did not; not even that. Unless it's a kernel/mdadm version problem. I am ATM using an old system (stable Debian); I will have a better one in two days (stable Gentoo). _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png |
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
Posted: Wed Jan 06, 2010 4:33 am Post subject: |
|
|
I may be wrong about it saying degraded in the /proc/mdstat. I do know it does send out emails though. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
Posted: Wed Jan 06, 2010 4:36 am Post subject: |
|
|
BTW. Here are examples of the email nagios can send:
Code: | ***** Nagios 2.10 *****
Notification Type: PROBLEM
Service: Linux Raid Status for md1
Host: dev6
Address: dev6.radimg.pitt.edu
State: CRITICAL
Date/Time: Wed Sept 17 15:39:30 EDT 2008
Additional Info:
CRITICAL md1 status=[UUU_]. |
Code: | ***** Nagios 2.10 *****
Notification Type: PROBLEM
Service: Linux Raid Status for md2
Host: dev6
Address: dev6.radimg.pitt.edu
State: WARNING
Date/Time: Thu Sept 18 13:08:20 EDT 2008
Additional Info:
WARNING md2 status=[UUU_], recovery=80.2%, finish=29.5min.
|
_________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
 |
doublehp Guru


Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Wed Jan 06, 2010 12:07 pm Post subject: |
|
|
drescherjm wrote: | BTW. Here are examples of the email nagios can send: |
That's nice Thanks. _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png |
|
Back to top |
|
 |
Monkeh Veteran


Joined: 06 Aug 2005 Posts: 1656 Location: England
|
Posted: Wed Jan 06, 2010 12:53 pm Post subject: |
|
|
doublehp wrote: | I just says as you expect [UU__] ... and *I* have to know that, if the erray is RAID5, then it's dead, if RAID6, I have to hurry up. |
I know, it's horrible, you have to understand how it works. |
|
Back to top |
|
 |
doublehp Guru


Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Wed Jan 06, 2010 1:23 pm Post subject: |
|
|
Monkeh wrote: | doublehp wrote: | I just says as you expect [UU__] ... and *I* have to know that, if the erray is RAID5, then it's dead, if RAID6, I have to hurry up. |
I know, it's horrible, you have to understand how it works. |
When I have a full page with half, or a dozen of raid chains, I will not count the "_" for every single them, and check if this number is acceptable depending on the raid type of the array. A very simple graph could be very nicely explicit:
- - array is fully operationnal
- / some redundancy is missing
- ! state is critical, you don't have any more redundancy
- ? you lost too many drives, data are not recoverable (unless you can introduce a drive that you are sure is sync with this arrays, and if the array is RO)
To show that a very simple sign, a single letter, put just after the [UUUU__] (in proc/mdstat , and respective full descriptions in mdadm -D ) could easily show what I am talking about. Mdadm says "degraded" in any case.
For example, 3 missing drives in RAID1 5 drives should just be "/". Just to illustrate: [UU___] /
The mail system proposed by drescherjm seems nice to me. _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png |
|
Back to top |
|
 |
drescherjm Advocate

Joined: 05 Jun 2004 Posts: 2792 Location: Pittsburgh, PA, USA
|
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|