Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Help with failing disk
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Sat Apr 19, 2014 1:09 pm    Post subject: Help with failing disk Reply with quote

I have a 4 disk Raid5 array. One disk has completely died, and before I found time to replace that disk, (while running with 3 or 4 disks) one of my other hard drives started to die. first one partition, then another, while the others on that same disk continue to work.

I really need to recover some files from one of the partitions in the newly failing disk, and i just bought a couple new hard drives and replaced the one completely destroyed disk.

My question is, is there any way to recover one of the partitions in that 'fourth' disk so that i can assemble and mount it (using 3 of 4 partitions)?

Code:
mdadm --assemble  --force /dev/md5 /dev/sda6 /dev/sdc6 /dev/sdd6
mdadm: cannot open device /dev/sdd6: No such file or directory
mdadm: /dev/sdd6 has no superblock - assembly aborted


/dev/sdd is the failing drive
/dev/sdb is the completely failed drive that has been replaced


here is some snippets from smartctl -a /dev/sdd

Code:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

Error 16129 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
  When the command that caused the error occurred, the device was activ
e or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 5f bb 7e 00  Error: UNC 1 sectors at LBA = 0x007ebb5f = 8305
503

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 02 5e bb 7e e0 0a   3d+16:18:43.635  READ DMA EXT
  25 00 08 76 ba 7e e0 0a   3d+16:18:43.275  READ DMA EXT
  ca 00 08 3f 00 00 e0 0a   3d+16:18:43.141  WRITE DMA
  ca 00 08 6f 30 00 e0 0a   3d+16:18:43.091  WRITE DMA
  ca 00 08 67 30 00 e0 0a   3d+16:18:43.038  WRITE DMA

Error 16128 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
  When the command that caused the error occurred, the device was activ
e or idle.

After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 05 73 bb 7e 00  Error: UNC 5 sectors at LBA = 0x007ebb73 = 8305
523

Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 70 bb 7e e0 0a   3d+14:28:37.227  READ DMA EXT
  27 00 00 00 00 00 e0 0a   3d+14:28:37.225  READ NATIVE MAX ADDRESS EX
T
  ec 00 00 00 00 00 a0 0a   3d+14:28:37.102  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 0a   3d+14:28:36.982  SET FEATURES [Set transfer
 mode]
  27 00 00 00 00 00 e0 0a   3d+14:28:36.981  READ NATIVE MAX ADDRESS EX
T

Error 16127 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
  When the command that caused the error occurred, the device was activ
e or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 77 bb 7e 00  Error: UNC 1 sectors at LBA = 0x007ebb77 = 8305
527

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 70 bb 7e e0 0a   3d+14:27:52.480  READ DMA EXT
  27 00 00 00 00 00 e0 0a   3d+14:27:52.479  READ NATIVE MAX ADDRESS EX
T
  ec 00 00 00 00 00 a0 0a   3d+14:27:52.356  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 0a   3d+14:27:52.236  SET FEATURES [Set transfer
 mode]
  27 00 00 00 00 00 e0 0a   3d+14:27:52.234  READ NATIVE MAX ADDRESS EX
T


  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 70 bb 7e e0 0a   3d+14:27:52.480  READ DMA EXT
  27 00 00 00 00 00 e0 0a   3d+14:27:52.479  READ NATIVE MAX ADDRESS EX
T
  ec 00 00 00 00 00 a0 0a   3d+14:27:52.356  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 0a   3d+14:27:52.236  SET FEATURES [Set transfer
 mode]
  27 00 00 00 00 00 e0 0a   3d+14:27:52.234  READ NATIVE MAX ADDRESS EX
T

Error 16126 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
  When the command that caused the error occurred, the device was activ
e or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 05 73 bb 7e 00  Error: UNC 5 sectors at LBA = 0x007ebb73 = 8305
523


  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 70 bb 7e e0 0a   3d+14:27:00.958  READ DMA EXT
  27 00 00 00 00 00 e0 0a   3d+14:27:00.956  READ NATIVE MAX ADDRESS EX
T
  ec 00 00 00 00 00 a0 0a   3d+14:27:00.834  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 0a   3d+14:27:00.713  SET FEATURES [Set transfer
 mode]
  27 00 00 00 00 00 e0 0a   3d+14:27:00.712  READ NATIVE MAX ADDRESS EX
T

Error 16125 occurred at disk power-on lifetime: 35980 hours (1499 days
+ 4 hours)
  When the command that caused the error occurred, the device was activ
e or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 70 bb 7e 00  Error: UNC 8 sectors at LBA = 0x007ebb70 = 8305
520

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 70 bb 7e e0 0a   3d+14:26:40.956  READ DMA EXT
  25 00 08 30 02 8a e0 0a   3d+14:26:40.854  READ DMA EXT
  c8 00 40 20 00 00 e0 0a   3d+14:26:40.853  READ DMA
  25 00 08 a8 6d 70 e0 0a   3d+14:26:40.725  READ DMA EXT
  c8 00 20 00 00 00 e0 0a   3d+14:26:40.633  READ DMA


I'm hoping someone who understands this stuff can help me point me in the right direction of possibly preserving even a little of the unreadable partition.
_________________
help the needy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54211
Location: 56N 3W

PostPosted: Sat Apr 19, 2014 3:47 pm    Post subject: Reply with quote

remix,

Install ddrescue and use that to make an image of the most recently failed drive onto the new drive
Make sure to put the ddrescue log on a third drive.
Code:
# Rescue Logfile. Created by GNU ddrescue version 1.15
# Command line: ddrescue -b 4096 -r 8 -f /dev/sde3 /dev/null /root/rescue_log.txt
# current_pos  current_status
0x18D786D0000     ?
#      pos        size  status
0x00000000  0x16E4BE9E000  +
0x16E4BE9E000  0x00002000  *
0x16E4BEA0000  0xFD4F9D000  +
0x17E20E3D000  0x00003000  *
0x17E20E40000  0x8FBF8000  +
0x17EB0A38000  0x00008000  *
0x17EB0A40000  0x358CC2000  +
0x18209702000  0x0000E000  *
0x18209710000  0x2DE00000  +
0x18237510000  0x00010000  *
0x18237520000  0x01AC0000  +
0x18238FE0000  0x00010000  *
0x18238FF0000  0x012C1000  +
0x1823A2B1000  0x0000F000  *
0x1823A2C0000  0x2D752000  +
0x18267A12000  0x0000E000  *
0x18267A20000  0x11EDD4000  +
0x183867F4000  0x0000C000  *
0x18386800000  0x9F1ED0000  +
0x18D786D0000  0x4260530000  ?
is one I did earlier.
DON'T DO THIS YET Notice the output file here is /dev/null ... all I was trying to do was get the drive to do one last read and relocate the data so I could grab it later.
You need the best image you can get first.

You need the -b 4096 for advanced format drives. There is no point in trying to recover 512 bytes if the drive has a4k block size.
-r 8 eight retries is a good place to start.

The log allows ddrescue to resume recovery, even with a different command. It will only work on areas not yet recovered.
ddrescuse can work much harder and you can help it too but thats for another post.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Sun Apr 20, 2014 4:29 am    Post subject: Reply with quote

I got confused when you said "Don't do this yet"

If I am understanding you correctly, i should not perform a Disk to Disk ddrescue yet,
first, I should just copy to /dev/null and output the errors to rescue_log.txt


my setup,
/dev/sda good
/dev/sdb newly installed, partitioned to match raid ( -b 4096)
/dev/sdc good
/dev/sdd failing ( -b 512)

and another brand new hd ( -b 4096) that is waiting to replace /dev/sdd once i can recover anything i can from my 5th raid partition.

/dev/sdd1, working
/dev/sdd2, working
/dev/sdd3, starting to fail
/dev/sdd4, extended
/dev/sdd5, failed (most important)
/dev/sdd6, failed (don't really care)


Code:
ddrescue -b 512 -r 8 -f /dev/sdd5 /dev/null /root/rescue_log.txt


then i'll inspect /root/rescue_log.txt knowing nothing of what those hex addresses mean

then actually perform the copy

Code:
ddrescue -b 4096 -f -n /dev/sdd5 /dev/sdb5 /root/rescue_copy_log.txt


or should i be copying over the entire disk?

Code:
ddrescue -b 4096 -f -n /dev/sdd /dev/sdb /root/rescue_copy_log.txt

_________________
help the needy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54211
Location: 56N 3W

PostPosted: Sun Apr 20, 2014 8:31 am    Post subject: Reply with quote

remix,

The disk may fail completely at any time.
Do the disc to disc rescue first. The disk to /dev/null is a final desperate attempt to get more data recovered.

You may as well copy the entire disk. If you only copy a partition, how will you recover your raid sets?
You will need to get the good partitions onto the new drive at some time.
Of course, if you are still using the raid sets in degraded mode, the data will change and whatever ddrescue copies now from the degraded arrays will be useless.

You should not use ddrescue at all until yon know what its telling you. Read its man and/or info pages.
The hex numbers are block numbers. The symbols at the end of each line tell what the block numbers mean.
A log showing perfect data recovery will have exactly one line of data.

Being able to do arithmetic is hex is useful, since you can work out where and how many blocks you have lost, or still have to recover.
With a bit more poking about the filesystem, you can get a rough idea of whats there and determine its importance.
That enables an informed decision about giving up or trying harder.

Only use -b 4096 on drives with 4k physical sectors. Use -b 512 on other drives.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Sun Apr 20, 2014 10:47 pm    Post subject: Reply with quote

Good point.

Sounds like it would be safer to boot into a livedvd and not mount any of the degraded raid partitions.

I'll install the new blank disk, and perform the ddrescue

Code:
ddrescue -b 4096 -r 8 -n /dev/sdd /dev/sdb /sshfs_mounted_volume/rescue_copy_log.txt


I've read this guide, http://wiki.gentoo.org/wiki/Ddrescue
I'll check out the man page as well

thanks!
_________________
help the needy
Back to top
View user's profile Send private message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Sun Apr 20, 2014 11:59 pm    Post subject: Reply with quote

i just read the man page and -b is the block size of the input device, which in my case is 512
output device is 409

so it is
Code:
ddrescue -b 512 -r 8 -n /dev/sdd /dev/sdb /sshfs_mounted_volume/rescue_copy_log.txt

_________________
help the needy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54211
Location: 56N 3W

PostPosted: Mon Apr 21, 2014 5:46 pm    Post subject: Reply with quote

remix,

Give it a go. Post the log when it stops.
You will find that gravity can assist the data recovery,

rerun the same command using the same log file a total of six times.
Make a copy of the log each time the command completes, so you can look at the differences later.
Between each invocation of the command, move the drive so you try it with each edge and both faces 'down'.

If you have a bearing failure, gravity and the odd orientations can get you one last read and thats all you need.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Tue Apr 22, 2014 8:34 am    Post subject: Reply with quote

awesome tip! should i be flipping it during the retries? (without stopping or rebooting)

i just finished the first go through, i did set it to retry 8 times.

Code:
GNU ddrescue 1.16
Press Ctrl-C to interrupt
rescued:     1000 GB,  errsize:    543 kB,  current rate:        0 B/s
   ipos:   717936 MB,   errors:      40,    average rate:    8546 kB/s
   opos:   717936 MB,     time since last successful read:     4.3 m
Retrying bad sectors... Retry 1

_________________
help the needy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54211
Location: 56N 3W

PostPosted: Wed Apr 23, 2014 5:57 pm    Post subject: Reply with quote

remix,

That looks fairly good so far.
Code:
rescued:     1000 GB,  errsize:    543 kB,    errors:      40,

You have 543 kB still to recover in 40 regions of the drive.

Its time to tell ddrescue to try harder, now that most of the data has been read.
By using the same input device, output device and log file, ddrescue will try to fill in the holes in your image an ignore data already recovered.

Look back at the copies of the logs and determine which drive orientation produced the best resuts.
You will still run all four edges and two faces but treat each drive spin up as it it were the last, so start with that orientation.
--retries= can be increased. I tend to try 8, 16, 32, 64 and 128
--direct may be useful, it has no effect on some operating systems
--try-again can help when you have a group of contiguous blocks that cant be read.
--retrim will help too.

After you have tried the above, on all 6 faces, (just with --retries=8) its time to look at what is still missing.
If its unallocated space, it doesn't matter.
If its a file or two, they are gone - you need to decide if you need these files and how much time you want to spend on data recovery.
If its a directory then the files in that directory and its child directories cannot be accessed normally but they may be perfectly recovered.
If its filesystem metadata, it depends what aspects of the metadata are damaged.

Please post the log - like my sample above, next time and we can begin to take into account whats damaged.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Thu Apr 24, 2014 8:00 am    Post subject: Reply with quote

i'm ok with some of the files being completely inaccessible, well let's see how many that would be.

Code:
# Rescue Logfile. Created by GNU ddrescue version 1.16
# Command line: ddrescue -b 512 -r 8 -f /dev/sdd /dev/sdb ddrescue.log
# current_pos  current_status
0xA728A5FE00     +
#      pos        size  status
0x00000000  0xA2FBE8F000  +
0xA2FBE8F000  0x00003000  -
0xA2FBE92000  0x008E0000  +
0xA2FC772000  0x00006000  -
0xA2FC778000  0x0001A000  +
0xA2FC792000  0x00001000  -
0xA2FC793000  0x00199000  +
0xA2FC92C000  0x00003000  -
0xA2FC92F000  0x000C6000  +
0xA2FC9F5000  0x00001000  -
0xA2FC9F6000  0x00488000  +
0xA2FCE7E000  0x00001000  -
0xA2FCE7F000  0x00008000  +
0xA2FCE87000  0x00001000  -
0xA2FCE88000  0x000E6000  +
0xA2FCF6E000  0x00005000  -
0xA2FCF73000  0x0060A000  +
0xA2FD57D000  0x00004000  -
0xA2FD581000  0x0000B000  +
0xA2FD58C000  0x00001000  -
0xA2FD58D000  0x00002000  +
0xA2FD58F000  0x0000F000  -
0xA2FD59E000  0x000D0000  +
0xA2FD66E000  0x00002000  -
0xA2FD670000  0x00042000  +
0xA2FD6B2000  0x00006000  -
0xA2FD6B8000  0x000A1000  +
0xA2FD759000  0x00001000  -
0xA2FD75A000  0x00016000  +
0xA2FD770000  0x00002000  -
0xA2FD772000  0x00ADB000  +
0xA2FE24D000  0x00002000  -
0xA2FE24F000  0x00005000  +
0xA2FE254000  0x00001000  -
0xA2FE255000  0x00001000  +
0xA2FE256000  0x0000B000  -
0xA2FE261000  0x00172000  +
0xA2FE3D3000  0x00005000  -
0xA2FE3D8000  0x002C8000  +
0xA2FE6A0000  0x00006000  -
0xA2FE6A6000  0x00006000  +
0xA2FE6AC000  0x00001000  -
0xA2FE6AD000  0x00002000  +
0xA2FE6AF000  0x00001000  -
0xA2FE6B0000  0x1C121000  +
0xA31A7D1000  0x00002000  -
0xA31A7D3000  0x00137000  +
0xA31A90A000  0x00001000  -
0xA31A90B000  0x2AD8C000  +
0xA345697000  0x00002000  -
0xA345699000  0x001BE000  +
0xA345857000  0x00002000  -
0xA345859000  0x3E2D43000  +
0xA72859C000  0x00001000  -
0xA72859D000  0x00005000  +
0xA7285A2000  0x00007000  -
0xA7285A9000  0x003E4000  +
0xA72898D000  0x00010000  -
0xA72899D000  0x000B4000  +
0xA728A51000  0x0000F000  -
0xA728A60000  0x41B8356000  +



would it be safe then to just replace that 'fourth' drive with this new copied drive? and it will function normally except for those few files or directories?
_________________
help the needy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54211
Location: 56N 3W

PostPosted: Thu Apr 24, 2014 9:58 pm    Post subject: Reply with quote

remix,

Code:
#      pos        size  status
0x00000000  0xA2FBE8F000  +


Says that from the start of the drive, to block 0xA2FBE8F000' all the data has been recovered.
At
Code:
#      pos        size  status
0xA2FBE8F000  0x00003000  -
is the first bad area. A 512b block is 0x200 byles, so 0x1000 is eight blocks. (Its hex)
So this area is 24 (decimal) blocks.
Each Status + is recovered data. Each status - is data yet to be recovered.

You can do better than
Code:
ddrescue -b 512 -r 8 -f /dev/sdd /dev/sdb ddrescue.log


Code:
ddrescue -b 512 -r 16 --direct  --try-again --retrim -f /dev/sdd /dev/sdb ddrescue.log
may get back more data.
Don't forget to do all six faces/edges.

The idea is to restart the raid with this drive in place of the failing drive but not yet.
What metadata version is the raid set. You need to know that the metadata is recovered.
mdadm -E /dev/.... will tell.

What filesystem is on the raid set?

Once you add the recovered drive into the raid set, you have decided that further data recovery is not worthwhile.
The raid can't tell the data is corrupt due to the unrecovered data. It will just operate in degraded mode and assume all is well.
When you add another drive, it will regenerate the redundent data based on whatever is on the other drives at that time
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Mon Apr 28, 2014 5:24 am    Post subject: Reply with quote

thanks for the info, it makes sense to me.

the filesystem on the raid partitions is reiserfs

the new log is long so i pastie'd it here http://pastie.org/9118746
_________________
help the needy
Back to top
View user's profile Send private message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Mon Apr 28, 2014 6:14 am    Post subject: Reply with quote

i don't think i recovered enough, not sure what i did wrong (other than not having backups)

Code:
OptimusPrime / # mdadm --assemble /dev/md4 --scan --force
mdadm: /dev/sdd5 has no superblock - assembly aborted
OptimusPrime / # mdadm --assemble /dev/md5 --scan --force
mdadm: /dev/sdb6 has no superblock - assembly aborted
OptimusPrime / # mdadm --assemble /dev/md6 --scan --force
mdadm: /dev/sdb7 has no superblock - assembly aborted


I have 2 partitions that seem to be ok

Code:
OptimusPrime / # cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md4 : inactive sda5[0](S) sdd5[3](S) sdc5[2](S)
      527373312 blocks

md1 : active raid5 sda1[0] sdd1[3] sdc1[2] sdb1[1]
      102558528 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md2 : active raid5 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      776397120 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md3 : inactive sda3[0](S) sdb3[4](S) sdd3[3](S) sdc3[2](S)
      859412736 blocks

unused devices: <none>



should i just forfeit all the data in those 3 partitions and reformat?
_________________
help the needy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54211
Location: 56N 3W

PostPosted: Mon Apr 28, 2014 9:47 pm    Post subject: Reply with quote

remix,

We are not done yet.

Can you remember the parameters to --create that you used when you created md3 and md4?
What does
Code:
mdadm -E /dev/sd[abcd]3

Also
Code:
mdadm -E dev/sd[abcd]5

How many elements are in each raid set?
How many have at least some data in now?

Read and understand RAID Recovery. In a nutshell, you can recreate the metadata.
You need to do it on a degraded array. Its best if you don't get the raid metadata version wrong as metadata version 0.9 is at the end of eacd partition and the filesystem starts in the normal place, as if raid was not in use. With metadata version >=1, the metadata is at the start of the volume, where the filesystem superblock would be.
You can recover from getting it wrong but its best that you don't need to

The basic idea is to run a mdadm --create to rewrite the raid metadata in exactly the way you did when you first made the raid set but in degraded mode with the known clean option, so the raid is not resynced. That leaves your original data in place.

Were you able to recover any more data?
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Thu May 08, 2014 10:57 pm    Post subject: Reply with quote

i created the md devices using

Code:
mdadm --create --verbose --level=5 --raid-devices=4 /dev/md4 /dev/sdb5 /dev/sdc5 /dev/sdd5 /dev/sde5
mdadm --create --verbose --level=5 --raid-devices=4 /dev/md5 /dev/sdb6 /dev/sdc6 /dev/sdd6 /dev/sde6
mdadm --create --verbose --level=5 --raid-devices=4 /dev/md6 /dev/sdb7 /dev/sdc7 /dev/sdd7 /dev/sde7
...


the output of mdadm -E /dev/sd[abcd]5

looks like i'll need to restore the superblocks in /dev/sdb
/dev/sdb is the drive that i restored the old failed drive into

when you wrote
Quote:
The basic idea is to run a mdadm --create to rewrite the raid metadata in exactly the way you did when you first made the raid set but in degraded mode with the known clean option, so the raid is not resynced. That leaves your original data in place.


do you mean 'in degraded mode' by adding only 3 of the 4 disks?
i read the RAID Recovery guide and i didn't get how to perform what you asked.
_________________
help the needy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54211
Location: 56N 3W

PostPosted: Fri May 09, 2014 11:44 am    Post subject: Reply with quote

remix,

There is no need to use --create on the raid sets that are now working. There is nothing to do to them if /proc/mdstat shows that they are up to strength.
If its /dev/md4 thats the problem, I need the output of
Code:
mdadm -E /dev/sd[abcd]5
and I need to know what you think is in each /dev/sd?5.

Your command
Code:
mdadm --create --verbose --level=5 --raid-devices=4 /dev/md4 /dev/sdb5 /dev/sdc5 /dev/sdd5 /dev/sde5
makes use of a few mdadm defaults.
Like
--chunk= ... its now 512k, it used to be 64k
--metadata= its now 1.2, it used to be 0.90

The raid metadata that you need to create is a data structure that points to your data. Getting --chunk= incorrect, is harmless, you can have an many goes as you want but it must be correct to allow the kernel to read the filesystem on the raid. It tells how big the individual data elements are on the drive, so reading 512k at a time, when its actually 64k doesn't work.

The --metadata= is rather more important. It tells where the raid metadata is on the underlying block device. If its wrong, it will either overwrite the end of your filesystem or the filesystem (not raid)

I was considering --create in degraded mode possibly with --assume-clean ... depending what
Code:
mdadm -E /dev/sd[abcd]5
shows and what you believe is on each partition also with the explicit --chunk= and --metadata= values that
Code:
mdadm -E /dev/sd[abcd]5
will show.
Its also important to choose the 'best' 3 of the four elements from the raid set.


Code:
$ sudo mdadm -E /dev/sda5
Password:
/dev/sda5:
          Magic : a92b4efc
        Version : 0.90.00       <----
           UUID : 5e3cadd4:cfd2665d:96901ac7:6d8f5a5d
  Creation Time : Sat Apr 11 20:30:16 2009
     Raid Level : raid5
  Used Dev Size : 5253120 (5.01 GiB 5.38 GB)
     Array Size : 15759360 (15.03 GiB 16.14 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 126

    Update Time : Sun Mar 16 11:02:16 2014
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 78b17729 - correct
         Events : 77

         Layout : left-symmetric
     Chunk Size : 64K       <-----

I've highlighted my --chunk= and --metadata= above.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
remix
l33t
l33t


Joined: 28 Apr 2004
Posts: 797
Location: hawaii

PostPosted: Sun Jun 15, 2014 6:48 am    Post subject: Reply with quote

Chunk Size : 64K
Metadata : 0.90.00

Code:

# mdadm -E /dev/sd[abcd]5
/dev/sda5:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bfb60f39:66b601a2:fbf2ea5a:12bfd232 (local to host OptimusPrime)
  Creation Time : Mon Mar  8 22:09:01 2010
     Raid Level : raid5
  Used Dev Size : 175791104 (167.65 GiB 180.01 GB)
     Array Size : 527373312 (502.94 GiB 540.03 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 4

    Update Time : Fri Apr 18 18:23:26 2014
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 2
  Spare Devices : 0
       Checksum : 87adfde0 - correct
         Events : 17379

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8        5        0      active sync   /dev/sda5

   0     0       8        5        0      active sync   /dev/sda5
   1     1       0        0        1      faulty removed
   2     2       8       37        2      active sync   /dev/sdc5
   3     3       0        0        3      faulty removed
mdadm: No md superblock detected on /dev/sdb5.
/dev/sdc5:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bfb60f39:66b601a2:fbf2ea5a:12bfd232 (local to host OptimusPrime)
  Creation Time : Mon Mar  8 22:09:01 2010
     Raid Level : raid5
  Used Dev Size : 175791104 (167.65 GiB 180.01 GB)
     Array Size : 527373312 (502.94 GiB 540.03 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 4

    Update Time : Fri Apr 18 18:23:26 2014
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 2
  Spare Devices : 0
       Checksum : 87adfe04 - correct
         Events : 17379

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2       8       37        2      active sync   /dev/sdc5

   0     0       8        5        0      active sync   /dev/sda5
   1     1       0        0        1      faulty removed
   2     2       8       37        2      active sync   /dev/sdc5
   3     3       0        0        3      faulty removed
/dev/sdd5:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bfb60f39:66b601a2:fbf2ea5a:12bfd232 (local to host OptimusPrime)
  Creation Time : Mon Mar  8 22:09:01 2010
     Raid Level : raid5
  Used Dev Size : 175791104 (167.65 GiB 180.01 GB)
     Array Size : 527373312 (502.94 GiB 540.03 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 4

    Update Time : Fri Apr 18 18:22:24 2014
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0
       Checksum : 87adb9e3 - correct
         Events : 17375

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8       53        3      active sync   /dev/sdd5

   0     0       8        5        0      active sync   /dev/sda5
   1     1       0        0        1      faulty removed
   2     2       8       37        2      active sync   /dev/sdc5
   3     3       8       53        3      active sync   /dev/sdd5

_________________
help the needy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54211
Location: 56N 3W

PostPosted: Sun Jun 15, 2014 12:02 pm    Post subject: Reply with quote

remix,

First of all, understand that something will be corrupt but we have no idea what.
As it stands, that raid set should assemble and run with the --force option, if not, we need te rewrite the raid meta data, which does nothing to the user data on the raid.
Its just like rewriting a partition table.

We must be sure to pass the chunk size and metadata versions to mdadm --create as 64k and 0.90 are no longer the defaults and we want to recreate the raid metadata as was, so your data reappears.

Code:
/dev/sda5:
    Update Time : Fri Apr 18 18:23:26 2014
         Events : 17379

/dev/sdc5:
    Update Time : Fri Apr 18 18:23:26 2014
          Events : 17379

/dev/sdd5:
      Update Time : Fri Apr 18 18:22:24 2014
         Events : 17375

Notice the update times and Event Counts /dev/sdd5 is a few writes behind. They may be anthing. Also /dev/sdb5 is missing.

Before you go any further, understand that assembling the raid and mounting any filesystem it may contain are separate operations.
Getting the raid assembled is a prerequsite to reading the filesystem but depending on whats damaged, there may be further steps to get at your data.

If all else fails ...
Code:
mdadm --create /dev/md4 --metadata=0.90 --raid-devices=4 --chunk=64 --level=raid5 --assume-clean /dev/sda5 missing  /dev/sdc5 /dev/sdd5

Before you do that, make sure you understand what it is trying to do.

After it completes
Code:
mdadm -E /dev/sd[abcd]5
should show
Code:
      Number   Major   Minor   RaidDevice State
   0     0       8        5        0      active sync   /dev/sda5
   1     1       0        0        1      missing
   2     2       8       37        2      active sync   /dev/sdc5
   3     3       8       53        3      active sync   /dev/sdd5
Its important that the partitions are in the same slots.
The Event counts will all be zero and the raid should be assembled and running. Look in /proc/mdstat.

So far so good. The next step is to try to mount /dev/md4 read only and look around.
Code:
mount -o ro /dev/md4 /mnt/someplace
There are lots of reason that can fail and a few things to try to fix it.
Do not be tempted to run fsck. That makes guesses about what to do and often does the wrong thing.
You are not ready to allow any writes to the filesystem yet, even if it mounts.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum