Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
mdadm removing drives?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
FizzyWidget
Veteran
Veteran


Joined: 21 Nov 2008
Posts: 1133
Location: 127.0.0.1

PostPosted: Mon Dec 03, 2012 11:26 am    Post subject: mdadm removing drives? Reply with quote

Strange issue, have finally got gentoo installed with everything I require, and I notice that mdadm has removed 2 partitions from one of my raid arrays, being more than just a little pissed off, I downloaded some diag tools from the HDD manufacturers site and tested them, all came back clean even after full smart scans and media test, I couldnt test one drive as it is too new for the dos version, and the only other option was a windows version, and i wasnt going to install windows just for that, so I have tested the remaining drive using smartmon tools under gentoo and even that says it is fine.

I have re-added the missing partitions to the array and have changed all the cables for brand new ones on all the drives, but if it was a drive or cable then why did it only affect one partition on two sperate drives, and not have more partitions missing from the other arrays.

Should I be concerned about this?
_________________
I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Mon Dec 03, 2012 6:39 pm    Post subject: Reply with quote

Dark Foo,

Be afraid ... very very afraid :)

It would have been useful to get the dmesg content concerning the failures but I guess thats gone.

The following is a very bad sign
Code:
[417885.092394] sd 0:0:0:0: [sda] CDB: cdb[0]=0x28: 28 00 bf 46 92 30 00 00 f8 00
[417885.092406] end_request: I/O error, dev sda, sector 3209073200
[417885.092412] md/raid:md2: read error NOT corrected!! (sector 3193072176 on sda3).
[417885.092418] md/raid:md2: Disk failure on sda3, disabling device.
[417885.092420] md/raid:md2: Operation continuing on 4 devices.

In this instance, the drive failed to relocate a bad sector in time, the kernel got fed up waiting and kicked the underlying block device /dev/sda3 out of the array.
It was a 5 element raid 5, so it dropped to degraded mode on 4 drives. The other raids, using sda1 and sda2 kept going.

Run a check on your raid sets
Code:
echo check > /sys/block/md2/md/sync_action

Change md2 to whatever mdX you want to check. This checks that the raid redundant data is valid everywhere, even where space is not used by the filesystem yet.

Check your
Code:
smartctl -a
before and after the check too
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

The above sample is good - no reallocated sectors and none pending. Reallocated sectors are mostly harmless, thats how the drive never seems to have any bad blocks. When the drive struggles to read a sector, its rewritten to a spare. Which is good when it works. This example
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       40
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       40
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       14
is from a less healthy drive. In fact, I've just replaced it my desktop raid.
What happened was another dive was kicked out the set, and during the rebuild, this drive was kicked out too. Thats a really bad thing as I now had a raid5 missing two drives.
ddrescue imaged the dud above onto a new drive, all except 58 sectors. I was able to determine that the unread sectors were in /usr somewhere and I'm guessing they were unused as they were not in the Current_Pending_Sector count, which is only 14.

A write to the dud sectors will force them to be relocated but do I really trust a drive that looks like its slowly dying?
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
FizzyWidget
Veteran
Veteran


Joined: 21 Nov 2008
Posts: 1133
Location: 127.0.0.1

PostPosted: Mon Dec 03, 2012 6:56 pm    Post subject: Reply with quote

from memory it was saying buffer I/O error, looking at the smartctl output there are 0 bad sectors and none pending correction or needing to be swapped out, long pass with smartmon tools has passed all drives, I am wondering if it was a cable issue, I replaced all of them with brand new ones and haven't seen any issues so far, I will keep an eye on it.

Am running the test you suggested now nd will leave the pc for a while

Code:
[12956.221875] md: data-check of RAID array md3
[12956.221880] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[12956.221882] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[12956.221888] md: using 128k window, over a total of 943010816k.
[13245.715051] md: delaying data-check of md0 until md3 has finished (they share one or more physical units)
[13249.427192] md: delaying data-check of md1 until md3 has finished (they share one or more physical units)
[13253.899782] md: delaying data-check of md2 until md3 has finished (they share one or more physical units)


Code:
sda

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0


Code:
sdb

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0


Code:
sdc


ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0


Code:
sdd

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

_________________
I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Mon Dec 03, 2012 7:07 pm    Post subject: Reply with quote

Dark Foo,

/proc/mdstat will tell about progress. If you use the system, the check will take longer as it will get out the way for you to read/write your data.
mdadm should have emailed you about the issue, if you had it set up and running.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
FizzyWidget
Veteran
Veteran


Joined: 21 Nov 2008
Posts: 1133
Location: 127.0.0.1

PostPosted: Mon Dec 03, 2012 7:09 pm    Post subject: Reply with quote

i haven't got that part of it setup yet, I have only just got gentoo on there and working when mdadm removed or something happened to remove 2 of the 4 partitions (which were on serperate drives), none of the partitions of the other raid arrays were touched, so I am at a loss. I will leave the other machine for a few hours and see what happens
_________________
I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch.
Back to top
View user's profile Send private message
FizzyWidget
Veteran
Veteran


Joined: 21 Nov 2008
Posts: 1133
Location: 127.0.0.1

PostPosted: Mon Dec 03, 2012 8:49 pm    Post subject: Reply with quote

After a resync of all drives

Code:
cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid10 sda3[0] sdd3[2] sdc3[3] sdb3[1]
      2096128 blocks 512K chunks 2 near-copies [4/4] [UUUU]

md3 : active raid10 sdc4[3] sda4[0] sdb4[1] sdd4[2]
      943010816 blocks 512K chunks 2 near-copies [4/4] [UUUU]

md0 : active raid1 sdd1[2] sda1[0] sdc1[3] sdb1[1]
      102336 blocks [4/4] [UUUU]

md1 : active raid10 sda2[0] sdc2[3] sdd2[2] sdb2[1]
      31456256 blocks 512K chunks 2 near-copies [4/4] [UUUU]

unused devices: <none>


All are showing as being there, so that is good :)

smart output

sda


Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0


sdb

Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0


sdc

Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0


sdd

Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0


will try and set up email, but ISP has issues about people having a email server on their machine, is there a simpler way to do this?

Any links you could point me to?

From the looks of things I am hoping it was just something freaky that happened seeing as all the drives check out ok
_________________
I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54096
Location: 56N 3W

PostPosted: Tue Dec 04, 2012 12:55 pm    Post subject: Reply with quote

Dark Foo,

You don't need a mail server. mdadmd just needs to know how to reach you by emeil.
It will send email to your address at your ISP

As you say, all looks good.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
FizzyWidget
Veteran
Veteran


Joined: 21 Nov 2008
Posts: 1133
Location: 127.0.0.1

PostPosted: Tue Dec 04, 2012 12:56 pm    Post subject: Reply with quote

yes i googled it last night and the test email worked :)
_________________
I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum