Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[NOT QUITE SOLVED] recover from raid failure
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
Vieri
Guru
Guru


Joined: 18 Dec 2005
Posts: 347

PostPosted: Thu Sep 05, 2013 7:21 am    Post subject: [NOT QUITE SOLVED] recover from raid failure Reply with quote

Hi,

I have a RAID1 with dmraid and this morning I got the following error messages which seem to indicate that /dev/sdb is failing and that /dev/sda is still active.

Code:

ERROR: asr: reading /dev/sdb[Input/output error]
ERROR: ddf1: reading /dev/sdb[Input/output error]
ERROR: ddf1: reading /dev/sdb[Input/output error]
ERROR: hpt37x: reading /dev/sdb[Input/output error]
ERROR: hpt45x: reading /dev/sdb[Input/output error]
ERROR: isw: reading /dev/sdb[Input/output error]
ERROR: jmicron: reading /dev/sdb[Input/output error]
ERROR: lsi: reading /dev/sdb[Input/output error]
ERROR: nvidia: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: pdc: reading /dev/sdb[Input/output error]
ERROR: sil: reading /dev/sdb[Input/output error]
ERROR: via: reading /dev/sdb[Input/output error]
ERROR: pdc: wrong # of devices in RAID set "pdc_bccidebfaf" [1/2] on /dev/sda
ERROR: pdc: wrong # of devices in RAID set "pdc_bccidebfaf" [1/2] on /dev/sda
*** *Inconsistent* Active Set
name   : pdc_bccidebfaf
size   : 488281216
stride : 128
type   : mirror
status : inconsistent
subsets: 0
devs   : 1
spares : 0


I'd like to make sure what steps I should take to fix this situation, keeping in mind that this is a production server and I should try to keep the downtime as short as possible.

These aren't hot-swappable drives so I was thinking of stopping the server, extracting /dev/sdb leaving just /dev/sda and boot the server so it can keep running in degraded mode while I try to fix /dev/sdb.
Then I'd connect /dev/sdb to a test system, boot a live CD and run an fsck on it and see if it can fix it.
If that goes well then I could stop the server again, reconnect /dev/sdb and boot it.
On the command line I'd issue

# dmraid -R pdc_bccidebfaf /dev/sdb

in order to rebuild the array.

Would this be the correct approach?

I also ran the following commands:

Code:

# hdparm -I /dev/sdb

/dev/sdb:
 HDIO_DRIVE_CMD(identify) failed: Input/output error


Code:

# hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
        Model Number:       WDC WD2500AAKS-00F0A0
        Serial Number:      WD-WCAT1F043246
        Firmware Revision:  12.01B02
        Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5
Standards:
        Supported: 8 7 6 5
        Likely used: 8
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:  268435455
        LBA48  user addressable sectors:  488397168
        device size with M = 1024*1024:      238475 MBytes
        device size with M = 1000*1000:      250059 MBytes (250 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, with device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Recommended acoustic management value: 128, current value: 254
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    SMART feature set
                Security Mode feature set
           *    Power Management feature set
           *    Write cache
           *    Look-ahead
           *    Host Protected Area feature set
           *    WRITE_BUFFER command
           *    READ_BUFFER command
           *    NOP cmd
           *    DOWNLOAD_MICROCODE
                SET_MAX security extension
                Automatic Acoustic Management feature set
           *    48-bit Address feature set
           *    Device Configuration Overlay feature set
           *    Mandatory FLUSH_CACHE
           *    FLUSH_CACHE_EXT
           *    SMART error logging
           *    SMART self-test
           *    General Purpose Logging feature set
           *    64-bit World wide name
           *    {READ,WRITE}_DMA_EXT_GPL commands
           *    Segmented DOWNLOAD_MICROCODE
           *    SATA-I signaling speed (1.5Gb/s)
           *    SATA-II signaling speed (3.0Gb/s)
           *    Native Command Queueing (NCQ)
           *    Phy event counters
                DMA Setup Auto-Activate optimization
           *    Software settings preservation
           *    SMART Command Transport (SCT) feature set
           *    SCT Long Sector Access (AC1)
           *    SCT LBA Segment Access (AC2)
           *    SCT Error Recovery Control (AC3)
           *    SCT Features Control (AC4)
           *    SCT Data Tables (AC5)
                unknown 206[12]
                unknown 206[13]
Security:
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
                frozen
        not     expired: security count
                supported: enhanced erase
        46min for SECURITY ERASE UNIT. 46min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct


By the way, I'm guessing that /dev/sdb is physically connected to SATAII port 2 and /dev/sda is connected to SATAII port 1 but is there a way to be absolutely sure by consulting HD information (the disks are identical - same brand, model and size)?
For instance, can I actually get a unique identifier via software that I can then physically identify?
I'm thinking of using the serial number which is generally printed on the drive. In my case above I can identify /dev/sda with the Serial Number: WD-WCAT1F043246.
Is there a better way to do this?

Well, I guess that in my case, if I were to disconnect the wrong drive, the system wouldn't be able to reboot.

Thanks,

Vieri


Last edited by Vieri on Tue Oct 29, 2013 7:48 am; edited 2 times in total
Back to top
View user's profile Send private message
eccerr0r
Advocate
Advocate


Joined: 01 Jul 2004
Posts: 3598
Location: USA

PostPosted: Thu Sep 05, 2013 1:31 pm    Post subject: Reply with quote

That is quite strange that the kernel or something is trying to read that drive with all those different controllers - implying this is happening at boot or something?

But you should be able to get the right device by noting the good drive's serial number and leaving that one alone. If you can still get the serial of the broken drive, that's better. (Hard drives have their serial numbers printed on the label, and they do match firmware, though I have seen some drives that have extra letters appended to one or the other.) Yes this is a fear I have on my RAID5 with mdraid (same problem with software dmraid) in that the sdX mapping isn't always the same as what it looks like on the computer. I have a hodgepodge hotswap bay for my raid5 and I can also do this by noting the activity lights - the drive bay whose activity light is dead is probably the dead drive.
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed to be advocating?
Back to top
View user's profile Send private message
Vieri
Guru
Guru


Joined: 18 Dec 2005
Posts: 347

PostPosted: Fri Sep 06, 2013 10:48 am    Post subject: Reply with quote

Thanks again!

Those messages don't come up at boot but are issued after I run:

Code:

dmraid -s


I haven't rebooted the system yet.
Back to top
View user's profile Send private message
eccerr0r
Advocate
Advocate


Joined: 01 Jul 2004
Posts: 3598
Location: USA

PostPosted: Fri Sep 06, 2013 2:12 pm    Post subject: Reply with quote

Another thing I did with my RAID 5 (finally) was to arrange my sata hotswap bays in /dev/sdX order. That took a couple of iterations but now the top bay is sda, bottom bay is sde, and the hot spare is in sde. When you do figure it out (another thing you can do when both drives work is to separate the disks a bit and try to sleep one of them, then you can tell which one is which) it's worth to label them or put them in order.

Fortunately if you check your motherboard I think it *tends* to still be correct, "port 0" tends to show up as sda (but not always). My PATA machine the /dev/sdX is all over the place - since it has 6 IDE ports (12 drives max) things got out of hand:
Promise Ultra133-sda
Promise Ultra66-sdb, sdc
Onboard SiS-sdd, sde

sdb through sde is its raid, sda is an extra drive... not sure how that could have been predicted.

I wish hard drives have individual LEDs in them again... Read, Write, and Alert "I'm Dead" LEDs...
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed to be advocating?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 31342
Location: 56N 3W

PostPosted: Fri Sep 06, 2013 9:09 pm    Post subject: Reply with quote

Vieri,

Code:
# hdparm -I /dev/sdb

/dev/sdb:
 HDIO_DRIVE_CMD(identify) failed: Input/output error


Says the drive never responds to the identify command. Drives don't get any deader than that.
However, it may not be the drive. Check the data cable, the power cable and even the motherboard SATA interface.

I had a SATA power connector fail not long ago. It was easy to find from the burning smell and the soot on the HDD.
It overloaded the PSU and the server shut down. The disk was fine.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Vieri
Guru
Guru


Joined: 18 Dec 2005
Posts: 347

PostPosted: Mon Sep 09, 2013 6:38 am    Post subject: Reply with quote

Thanks, I'll check the drive with different data and power cables, just in case.
Back to top
View user's profile Send private message
Vieri
Guru
Guru


Joined: 18 Dec 2005
Posts: 347

PostPosted: Tue Oct 29, 2013 7:48 am    Post subject: Reply with quote

Strange thing happened. I rebooted the server and after that, dmraid -s reported that everything was OK.
After 2 weeks running fine (no reboots), dmraid -s reported failure of /dev/sdb again...

So I decided to replace the hard disk with a new one.
I booted the system and tried to rebuild the RAID:

Code:

# dmraid -R pdc_bccidebfaf /dev/sdb
ERROR: pdc: wrong # of devices in RAID set "pdc_bccidebfaf" [1/2] on /dev/sda
ERROR: pdc: wrong # of devices in RAID set "pdc_bccidebfaf" [1/2] on /dev/sda
Segmentation fault


And dmesg shows:
Code:

dmraid[5545]: segfault at 00000038 eip b7f6e5b2 esp bfc661e0 error 6


Code:

# dmraid -s
ERROR: pdc: wrong # of devices in RAID set "pdc_bccidebfaf" [1/2] on /dev/sda
ERROR: pdc: wrong # of devices in RAID set "pdc_bccidebfaf" [1/2] on /dev/sda
*** *Inconsistent* Active Set
name   : pdc_bccidebfaf
size   : 488281216
stride : 128
type   : mirror
status : inconsistent
subsets: 0
devs   : 1
spares : 0


What else can I do?

Thanks,

Vieri
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum