View previous topic :: View next topic |
Author |
Message |
FizzyWidget Veteran
Joined: 21 Nov 2008 Posts: 1133 Location: 127.0.0.1
|
Posted: Mon Dec 03, 2012 11:26 am Post subject: mdadm removing drives? |
|
|
Strange issue, have finally got gentoo installed with everything I require, and I notice that mdadm has removed 2 partitions from one of my raid arrays, being more than just a little pissed off, I downloaded some diag tools from the HDD manufacturers site and tested them, all came back clean even after full smart scans and media test, I couldnt test one drive as it is too new for the dos version, and the only other option was a windows version, and i wasnt going to install windows just for that, so I have tested the remaining drive using smartmon tools under gentoo and even that says it is fine.
I have re-added the missing partitions to the array and have changed all the cables for brand new ones on all the drives, but if it was a drive or cable then why did it only affect one partition on two sperate drives, and not have more partitions missing from the other arrays.
Should I be concerned about this? _________________ I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54232 Location: 56N 3W
|
Posted: Mon Dec 03, 2012 6:39 pm Post subject: |
|
|
Dark Foo,
Be afraid ... very very afraid :)
It would have been useful to get the dmesg content concerning the failures but I guess thats gone.
The following is a very bad sign Code: | [417885.092394] sd 0:0:0:0: [sda] CDB: cdb[0]=0x28: 28 00 bf 46 92 30 00 00 f8 00
[417885.092406] end_request: I/O error, dev sda, sector 3209073200
[417885.092412] md/raid:md2: read error NOT corrected!! (sector 3193072176 on sda3).
[417885.092418] md/raid:md2: Disk failure on sda3, disabling device.
[417885.092420] md/raid:md2: Operation continuing on 4 devices. |
In this instance, the drive failed to relocate a bad sector in time, the kernel got fed up waiting and kicked the underlying block device /dev/sda3 out of the array.
It was a 5 element raid 5, so it dropped to degraded mode on 4 drives. The other raids, using sda1 and sda2 kept going.
Run a check on your raid sets Code: | echo check > /sys/block/md2/md/sync_action |
Change md2 to whatever mdX you want to check. This checks that the raid redundant data is valid everywhere, even where space is not used by the filesystem yet.
Check your before and after the check too
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 |
The above sample is good - no reallocated sectors and none pending. Reallocated sectors are mostly harmless, thats how the drive never seems to have any bad blocks. When the drive struggles to read a sector, its rewritten to a spare. Which is good when it works. This example
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 40
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 40
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 14 | is from a less healthy drive. In fact, I've just replaced it my desktop raid.
What happened was another dive was kicked out the set, and during the rebuild, this drive was kicked out too. Thats a really bad thing as I now had a raid5 missing two drives.
ddrescue imaged the dud above onto a new drive, all except 58 sectors. I was able to determine that the unread sectors were in /usr somewhere and I'm guessing they were unused as they were not in the Current_Pending_Sector count, which is only 14.
A write to the dud sectors will force them to be relocated but do I really trust a drive that looks like its slowly dying? _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
FizzyWidget Veteran
Joined: 21 Nov 2008 Posts: 1133 Location: 127.0.0.1
|
Posted: Mon Dec 03, 2012 6:56 pm Post subject: |
|
|
from memory it was saying buffer I/O error, looking at the smartctl output there are 0 bad sectors and none pending correction or needing to be swapped out, long pass with smartmon tools has passed all drives, I am wondering if it was a cable issue, I replaced all of them with brand new ones and haven't seen any issues so far, I will keep an eye on it.
Am running the test you suggested now nd will leave the pc for a while
Code: | [12956.221875] md: data-check of RAID array md3
[12956.221880] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[12956.221882] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[12956.221888] md: using 128k window, over a total of 943010816k.
[13245.715051] md: delaying data-check of md0 until md3 has finished (they share one or more physical units)
[13249.427192] md: delaying data-check of md1 until md3 has finished (they share one or more physical units)
[13253.899782] md: delaying data-check of md2 until md3 has finished (they share one or more physical units) |
Code: | sda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 |
Code: | sdb
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 |
Code: | sdc
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 |
Code: | sdd
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 |
_________________ I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54232 Location: 56N 3W
|
Posted: Mon Dec 03, 2012 7:07 pm Post subject: |
|
|
Dark Foo,
/proc/mdstat will tell about progress. If you use the system, the check will take longer as it will get out the way for you to read/write your data.
mdadm should have emailed you about the issue, if you had it set up and running. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
FizzyWidget Veteran
Joined: 21 Nov 2008 Posts: 1133 Location: 127.0.0.1
|
Posted: Mon Dec 03, 2012 7:09 pm Post subject: |
|
|
i haven't got that part of it setup yet, I have only just got gentoo on there and working when mdadm removed or something happened to remove 2 of the 4 partitions (which were on serperate drives), none of the partitions of the other raid arrays were touched, so I am at a loss. I will leave the other machine for a few hours and see what happens _________________ I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch. |
|
Back to top |
|
|
FizzyWidget Veteran
Joined: 21 Nov 2008 Posts: 1133 Location: 127.0.0.1
|
Posted: Mon Dec 03, 2012 8:49 pm Post subject: |
|
|
After a resync of all drives
Code: | cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid10 sda3[0] sdd3[2] sdc3[3] sdb3[1]
2096128 blocks 512K chunks 2 near-copies [4/4] [UUUU]
md3 : active raid10 sdc4[3] sda4[0] sdb4[1] sdd4[2]
943010816 blocks 512K chunks 2 near-copies [4/4] [UUUU]
md0 : active raid1 sdd1[2] sda1[0] sdc1[3] sdb1[1]
102336 blocks [4/4] [UUUU]
md1 : active raid10 sda2[0] sdc2[3] sdd2[2] sdb2[1]
31456256 blocks 512K chunks 2 near-copies [4/4] [UUUU]
unused devices: <none> |
All are showing as being there, so that is good
smart output
sda
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 |
sdb
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 |
sdc
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 |
sdd
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 |
will try and set up email, but ISP has issues about people having a email server on their machine, is there a simpler way to do this?
Any links you could point me to?
From the looks of things I am hoping it was just something freaky that happened seeing as all the drives check out ok _________________ I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54232 Location: 56N 3W
|
Posted: Tue Dec 04, 2012 12:55 pm Post subject: |
|
|
Dark Foo,
You don't need a mail server. mdadmd just needs to know how to reach you by emeil.
It will send email to your address at your ISP
As you say, all looks good. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
FizzyWidget Veteran
Joined: 21 Nov 2008 Posts: 1133 Location: 127.0.0.1
|
Posted: Tue Dec 04, 2012 12:56 pm Post subject: |
|
|
yes i googled it last night and the test email worked _________________ I know 43 ways to kill with a SKITTLE, so taste my rainbow bitch. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|