Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED] drive was removed from my RAID 5 array; is it dead?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
mhelvens
Guru
Guru


Joined: 17 Mar 2005
Posts: 337
Location: The Netherlands

PostPosted: Mon Oct 29, 2012 1:32 pm    Post subject: [SOLVED] drive was removed from my RAID 5 array; is it dead? Reply with quote

Hello all!

My /home dir consists of a RAID 5 array with three 1.5TB disks. Yesterday I did an `emerge --update --deep world`. Today upon reboot, /home didn't mount.

So, I did an `mdadm --assemble --scan` and got the message:

Code:
mdadm: /dev/md127 has been started with 2 drives (out of 3)


`mdadm --detail /dev/md127` now shows one of the drives as 'removed':

Code:
mhelvens-pc mhelvens # mdadm --detail /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Thu Oct 20 19:41:06 2011
     Raid Level : raid5
     Array Size : 2930272256 (2794.53 GiB 3000.60 GB)
  Used Dev Size : 1465136128 (1397.26 GiB 1500.30 GB)
   Raid Devices : 3
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon Oct 29 14:06:06 2012
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : michiel-pc:0
           UUID : 82da8dc5:42efff78:bcce5cab:0baa4591
         Events : 24760

    Number   Major   Minor   RaidDevice State
       3       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       0        0        2      removed


If I press on, and tell `mdadm` specifically about /dev/sdd1, I will get something like:

Code:
mdadm: /dev/md/michiel-pc:0_0 assembled from 1 drive - not enough to start the array


Also `mdadm /dev/md127 --re-add /dev/sdd1` doesn't work:

Code:
mdadm: --re-add for /dev/sdd1 to /dev/md127 is not possible


Of course, it occurred to me that the drive may be dead (that's why I have RAID5, after all). But it seems like too much of a coincidence that this happened after a long overdue world update where I... didn't pay particular attention to the messages afterwards.

How can I be sure?

Thanks in advance!


Last edited by mhelvens on Tue Oct 30, 2012 7:53 pm; edited 1 time in total
Back to top
View user's profile Send private message
DaggyStyle
Advocate
Advocate


Joined: 22 Mar 2006
Posts: 4957

PostPosted: Mon Oct 29, 2012 3:02 pm    Post subject: Reply with quote

far from being an expert but can you see anything on that drive? partition tables? smart status?

also if I'm not mistaken, in order to get redundancy feature working in raid5 (e.g. loose one drive, data still intact) requires 4 drives, running raid5 on three drives is raid 5 without redundancy.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
ProjectFootball
Back to top
View user's profile Send private message
mhelvens
Guru
Guru


Joined: 17 Mar 2005
Posts: 337
Location: The Netherlands

PostPosted: Mon Oct 29, 2012 3:06 pm    Post subject: Reply with quote

DaggyStyle wrote:
far from being an expert but can you see anything on that drive? partition tables? smart status?


Looks like. In fdisk I can still see this info (looks fine):

Code:
Disk /dev/sdd: 1500.3 GB, 1500301910016 bytes, 2930277168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xa7a50b94

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1            2048  2930277167  1465137560   fd  Linux raid autodetect


Is there any other specific info I should look up?

DaggyStyle wrote:
also if I'm not mistaken, in order to get redundancy feature working in raid5 (e.g. loose one drive, data still intact) requires 4 drives, running raid5 on three drives is raid 5 without redundancy.


No, that's not true. RAID5 uses 1 drive for redundancy when you have 3 drives total or more. Right now the array is running fine on 2 drives, without redundancy.
Back to top
View user's profile Send private message
DaggyStyle
Advocate
Advocate


Joined: 22 Mar 2006
Posts: 4957

PostPosted: Mon Oct 29, 2012 3:15 pm    Post subject: Reply with quote

mhelvens wrote:
DaggyStyle wrote:
far from being an expert but can you see anything on that drive? partition tables? smart status?


Looks like. In fdisk I can still see this info (looks fine):

Code:
Disk /dev/sdd: 1500.3 GB, 1500301910016 bytes, 2930277168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xa7a50b94

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1            2048  2930277167  1465137560   fd  Linux raid autodetect


Is there any other specific info I should look up?

DaggyStyle wrote:
also if I'm not mistaken, in order to get redundancy feature working in raid5 (e.g. loose one drive, data still intact) requires 4 drives, running raid5 on three drives is raid 5 without redundancy.


No, that's not true. RAID5 uses 1 drive for redundancy when you have 3 drives total or more. Right now the array is running fine on 2 drives, without redundancy.


ok, it seems that my IT guy is wrong.

I think it maybe worthwhile to check the superblock on that drive and see if it matches the other's superblock, also, did you upgraded your kernel before it happened?

in addtion, try to run smartctl on the drive and get some data and see if the drive is not pre fail.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
ProjectFootball
Back to top
View user's profile Send private message
mhelvens
Guru
Guru


Joined: 17 Mar 2005
Posts: 337
Location: The Netherlands

PostPosted: Mon Oct 29, 2012 3:24 pm    Post subject: Reply with quote

DaggyStyle wrote:
I think it maybe worthwhile to check the superblock on that drive and see if it matches the other's superblock,


I'm not sure how to do that.

DaggyStyle wrote:
also, did you upgraded your kernel before it happened?


Nope.

DaggyStyle wrote:
in addtion, try to run smartctl on the drive and get some data and see if the drive is not pre fail.


Brilliant! Never used that. Anyway, the drive in question PASSED with flying colours. No errors reported, etc.

Seems the drive is not dying. Just don't know how to add it back to the array.

Should I try --add? I didn't want to try that yet, as I assumed this would add it as a new drive, and completely resync.
Back to top
View user's profile Send private message
DaggyStyle
Advocate
Advocate


Joined: 22 Mar 2006
Posts: 4957

PostPosted: Mon Oct 29, 2012 4:05 pm    Post subject: Reply with quote

mhelvens wrote:
DaggyStyle wrote:
I think it maybe worthwhile to check the superblock on that drive and see if it matches the other's superblock,


I'm not sure how to do that.

DaggyStyle wrote:
also, did you upgraded your kernel before it happened?


Nope.

DaggyStyle wrote:
in addtion, try to run smartctl on the drive and get some data and see if the drive is not pre fail.


Brilliant! Never used that. Anyway, the drive in question PASSED with flying colours. No errors reported, etc.

Seems the drive is not dying. Just don't know how to add it back to the array.

Should I try --add? I didn't want to try that yet, as I assumed this would add it as a new drive, and completely resync.

there are specific entries to watch in smartctl{s output, search the forum.

why didn't you tried to add it again?

as for superblock I assume that dd and md5sum is the way.
_________________
Only two things are infinite, the universe and human stupidity and I'm not sure about the former - Albert Einstein
ProjectFootball
Back to top
View user's profile Send private message
mhelvens
Guru
Guru


Joined: 17 Mar 2005
Posts: 337
Location: The Netherlands

PostPosted: Mon Oct 29, 2012 4:24 pm    Post subject: Reply with quote

DaggyStyle wrote:
there are specific entries to watch in smartctl{s output, search the forum.

I looked at all the info. Everything looking good. Health status: PASSED. Everything completed without errors. I ran a short self-test: no errors.

DaggyStyle wrote:
why didn't you tried to add it again?

Because the drive was already in the array before, so I assumed it would be a quick fix. When I --add, it takes quite a while to complete.

I now guess that a 'quick' fix (like --re-add) couldn't work because while the array was mounted with only two drives there were write-actions. So perhaps the third drive was now inconsistent? Just guessing.

So I used --add anyway. It's now recovering. 1150 minutes to go. ;-) I assume it will go faster once I stop using the array. But right now, I have no choice. Work to complete.

Thanks! I'll report back when it completes.
Back to top
View user's profile Send private message
energyman76b
Advocate
Advocate


Joined: 26 Mar 2003
Posts: 2031
Location: Germany

PostPosted: Tue Oct 30, 2012 5:40 pm    Post subject: Reply with quote

ok, only read the start and not all of the rest.

I hope you didn't do anything stupid in the mean time.

First of all:

most of the time a drive is not added to an array nothing serious happened. Driver was not done initializing hardware and similar stuff. Nothing bad. Just timing.

smartctl is a good call. Please have smartd run. Always. Especially with raid devices.
Check dmesg. No errors?
Then continue:

First things first: log out as user. Unmout /home. The less you do on that FS the smaller the chance that the resync will run into problems.
Second. Stop the array

mdadm -S /dev/md127 or whatitscalled

Third. Start the array

mdadm -A /dev/mc127

and report back.

I almost never experience problems with kernel assembled arrays. But the 'new' superblock 1.2 arrays that have to be assembled by mdraid during boot are a different story....

Edit: read the rest now.
Ok, add/readd might work or not... restarting is easier...
the next time your array is degraded and not mounted, don't mount it. Fix it first. If it is degraded and mounted... well...
_________________
AidanJT wrote:

Libertardian denial of reality is wholly unimpressive and unconvincing, and simply serves to demonstrate what a bunch of delusional fools they all are.

Satan's got perfectly toned abs and rocks a c-cup.
Back to top
View user's profile Send private message
mhelvens
Guru
Guru


Joined: 17 Mar 2005
Posts: 337
Location: The Netherlands

PostPosted: Tue Oct 30, 2012 5:55 pm    Post subject: Reply with quote

energyman76b wrote:
ok, only read the start and not all of the rest.

I hope you didn't do anything stupid in the mean time.

Maybe I did. But if so, it appears to have gone well enough.

Please read on and help me find out if everything is fine now?

energyman76b wrote:
smartctl is a good call. Please have smartd run. Always. Especially with raid devices.

Thanks for the tip! I'll find out more about smartd. Until now I've been running the mdadm daemon, which is also supposed to warn me if anything goes wrong. Would that be redundant?

energyman76b wrote:
Check dmesg. No errors?
Then continue:

First things first: log out as user. Unmout /home. The less you do on that FS the smaller the chance that the resync will run into problems.
Second. Stop the array

mdadm -S /dev/md127 or whatitscalled

Third. Start the array

mdadm -A /dev/mc127

and report back.

Did all that (sort of). I reported the outcome in my first post. I just neglected to mention some of the steps (unmount, stop array, etc.). Except that I started right away with --assemble --scan.

Anyway, now that you've read the rest... So I did an --add. It was recovering the array all night, and it seems to have gone ok. Here's `mdadm --detail /dev/md127`:

Code:
mhelvens-pc / # mdadm --detail /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Thu Oct 20 19:41:06 2011
     Raid Level : raid5
     Array Size : 2930272256 (2794.53 GiB 3000.60 GB)
  Used Dev Size : 1465136128 (1397.26 GiB 1500.30 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Tue Oct 30 18:48:08 2012
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : michiel-pc:0
           UUID : 82da8dc5:42efff78:bcce5cab:0baa4591
         Events : 56187

    Number   Major   Minor   RaidDevice State
       3       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       4       8       49        2      active sync   /dev/sdd1


As you can see, it looks fine. Only thing is: The 'Number' column now shows '4' for sdd1. As far as mdadm is concerned, I suppose, nr. 2 died, and I put nr. 4 in its place.

Anyway, can you recommend any final tests to make sure everything is OK?

energyman76b wrote:
the next time your array is degraded and not mounted, don't mount it. Fix it first. If it is degraded and mounted... well...

I'll remember that!

Thanks!


Last edited by mhelvens on Tue Oct 30, 2012 6:00 pm; edited 2 times in total
Back to top
View user's profile Send private message
Jaglover
Advocate
Advocate


Joined: 29 May 2005
Posts: 4742
Location: Saint Amant, Acadiana

PostPosted: Tue Oct 30, 2012 5:57 pm    Post subject: Reply with quote

I had similar issue with my RAID-0, looking at /proc/mdstat I found I had two broken arrays - md0 and md127 instead of one working array. After fiddling with /etc/mdadm.conf it started working.
_________________
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
energyman76b
Advocate
Advocate


Joined: 26 Mar 2003
Posts: 2031
Location: Germany

PostPosted: Tue Oct 30, 2012 6:34 pm    Post subject: Reply with quote

mdadm will only scream when a disk is dead.

smartd can warn you so you might be able to act before the disk is dead.

It also runs self tests - if you configure it that way - which also help to find worrisome developments.

For example this:
199 UDMA_CRC_Error_Count 0x0036 100 100 000 Old_age Always - 2
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 2
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 1

happens. if it does not change over weeks or month nothing to worry about. But if it goes up quickly... get a new disk ASAP.

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

or those two. One or two? might happen. Constant growth? Time for a backup.
_________________
AidanJT wrote:

Libertardian denial of reality is wholly unimpressive and unconvincing, and simply serves to demonstrate what a bunch of delusional fools they all are.

Satan's got perfectly toned abs and rocks a c-cup.
Back to top
View user's profile Send private message
mhelvens
Guru
Guru


Joined: 17 Mar 2005
Posts: 337
Location: The Netherlands

PostPosted: Tue Oct 30, 2012 7:53 pm    Post subject: Reply with quote

Ok. Thanks!
Back to top
View user's profile Send private message
Jaglover
Advocate
Advocate


Joined: 29 May 2005
Posts: 4742
Location: Saint Amant, Acadiana

PostPosted: Wed Oct 31, 2012 11:33 pm    Post subject: Reply with quote

And the solution was?
_________________
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
mhelvens
Guru
Guru


Joined: 17 Mar 2005
Posts: 337
Location: The Netherlands

PostPosted: Sun Nov 04, 2012 12:41 pm    Post subject: Reply with quote

Jaglover wrote:
And the solution was?


I described the 'solution' in my earlier post. I used `mdadm --add` to add the drive back into the array. It had to completely resync, but is working fine now.

This was not the ideal solution. I could have possibly let it 'catch back up' with the array, but I had already mounted it, and I didn't know how.

I marked the topic as [SOLVED] because my problem is now gone. Is this not common practice?

Cheers!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum