SSD data recovery

Message

krotuss · Post by **krotuss** » Sat Nov 11, 2023 4:44 pm

Hi,

I need to transfer data from my 1TB faling SSD to intermediate storage and then to another drive with minimum data loss. Main partition is LUKS encrypted btrfs. What tools/process do you recommend? I am thinking about using ddrescue at drive level, but I have read that it may be slow (days) and it also seems to be optimized for HDDs. Are there any special switches that I should use with SSD? Or would it be possible to use btrfs send/receive even if drive has read errors? Thanks.

Post by **NeddySeagoon** » Sat Nov 11, 2023 6:30 pm

krotuss,

As you say, ddrescue is about the best there is, even for SSDs.

It is HDD optimised. The tricks it uses to coax one last read from rotating rust will most likely fail.
ddrescue itself won't take days unless you refuse to accept defeat.

Do a normal default run first. That won't take much longer than reading the entire drive.
That recovers the maximum amount of data in the minimum time.

You must make the map file. At the end of the run, the recovered data will have 'holes' in, where the SSD could not be read. The map is a list of where the 'holes' are.
On subsequent runs, use the same map file. ddrescue will only try to fill the holes.

Post or pastebin the log file, along with the output of smartctl -x for the drive.

I had my first SSD fail the other day. /var was full of bad blocks. /var/db/pkg was broken, so portage had lost its mind but world was intact, so it could all be recreated in a new home.

Code: Select all

# Mapfile. Created by GNU ddrescue version 1.27
# Command line: ddrescue -b512 -O -d -A -r16 /dev/sdi /home/PiRouter.img /home/PiRouter.map

That Command line is the last command I gave ddrescue before I gave up.

Code: Select all

#      pos        size  status
0x00000000  0x1863D000  +
0x1863D000  0x00002000  -
...

The first line, ending in + is a piece of recovered data, then than are two unrecovered 4k blocks ... so the map file continues.

Read the man page to see what those options do. I was surprised that making ddrescue try harder recovered more data after the first pass.

Its possible to mount the partitions in the drive image file and look round too.

krotuss · Post by **krotuss** » Sun Nov 12, 2023 6:51 pm

Thanks, I have done basic run and it took slightly over 7 hours:

Code: Select all

ddrescue /dev/sdf 870_EVO.img 870_EVO.map
GNU ddrescue 1.27
Press Ctrl-C to interrupt
     ipos:    1000 GB, non-trimmed:    7372 kB,  current rate:  90570 kB/s
     opos:    1000 GB, non-scraped:        0 B,  average rate:  73516 kB/s
non-tried:    2356 MB,  bad-sector:        0 B,    error rate:       0 B/s
  rescued:  997841 MB,   bad areas:        0,        run time:  3h 46m 13s
pct rescued:   99.76%, read errors:      212,  remaining time:         28s
                              time since last successful read:          0s
Copying non-tried blocks... Pass 1 (forwards)
     ipos:    1793 MB, non-trimmed:   13938 kB,  current rate:   7372 kB/s
     opos:    1793 MB, non-scraped:        0 B,  average rate:  73002 kB/s
non-tried:  937820 kB,  bad-sector:        0 B,    error rate:   98304 B/s
  rescued:  999253 MB,   bad areas:        0,        run time:  3h 48m  8s
pct rescued:   99.90%, read errors:      390,  remaining time:      1m 26s
                              time since last successful read:          0s
Copying non-tried blocks... Pass 2 (backwards)
     ipos:    1793 MB, non-trimmed:   19312 kB,  current rate:    200 kB/s
     opos:    1793 MB, non-scraped:        0 B,  average rate:  72602 kB/s
non-tried:  884015 kB,  bad-sector:        0 B,    error rate:   61440 B/s
  rescued:  999301 MB,   bad areas:        0,        run time:  3h 49m 24s
pct rescued:   99.90%, read errors:      537,  remaining time:         16m
                              time since last successful read:          0s
Copying non-tried blocks... Pass 4 (backwards)
     ipos:  995881 MB, non-trimmed:  176263 kB,  current rate:   1536 kB/s
     opos:  995881 MB, non-scraped:        0 B,  average rate:  64983 kB/s
non-tried:        0 B,  bad-sector:        0 B,    error rate:   36864 B/s
  rescued:    1000 GB,   bad areas:        0,        run time:  4h 16m 29s
pct rescued:   99.98%, read errors:     4209,  remaining time:      1m 50s
                              time since last successful read:          0s
Copying non-tried blocks... Pass 5 (forwards)
     ipos:  995883 MB, non-trimmed:        0 B,  current rate:   32768 B/s
     opos:  995883 MB, non-scraped:   82750 kB,  average rate:  60627 kB/s
non-tried:        0 B,  bad-sector:    3770 kB,    error rate:    4096 B/s
  rescued:    1000 GB,   bad areas:     7364,        run time:  4h 34m 56s
pct rescued:   99.99%, read errors:    11573,  remaining time:         19m
                              time since last successful read:          0s
Trimming failed blocks... (forwards)
     ipos:  995883 MB, non-trimmed:        0 B,  current rate:       0 B/s
     opos:  995883 MB, non-scraped:        0 B,  average rate:  38324 kB/s
non-tried:        0 B,  bad-sector:   41680 kB,    error rate:     512 B/s
  rescued:    1000 GB,   bad areas:     6762,        run time:  7h 14m 56s
pct rescued:   99.99%, read errors:    85616,  remaining time:         n/a
                              time since last successful read:         28s
Scraping failed blocks... (forwards)
Finished

Its 870 EVO 2021 model with ~500 Reallocated_Sector_Ct and dmesg filling with 'failed command: READ DMA EXT'.
If I run

Code: Select all

ddrescue -b512 -O -d -A -r16

will I be able to abort it, without compromising what was recovered so far, if it takes to long? Also is there a way to determine if it is succeeding in recovering additional data? I found output to be little cryptic.

Post by **NeddySeagoon** » Mon Nov 13, 2023 6:12 pm

krotuss,

Yes and yes.

When you feed it the same map file, ddrescue fills in the holes.

It should start at

Code: Select all

     ipos:  995883 MB, non-trimmed:        0 B,  current rate:       0 B/s
     opos:  995883 MB, non-scraped:        0 B,  average rate:  38324 kB/s
non-tried:        0 B,  bad-sector:   41680 kB,    error rate:     512 B/s
  rescued:    1000 GB,   bad areas:     6762,        run time:  7h 14m 56s
pct rescued:   99.99%, read errors:    85616,  remaining time:         n/a
                              time since last successful read:         28s

Watch the

Code: Select all

bad areas:     6762

and

Code: Select all

average rate:  38324 kB/s

and time since last successful read.

One of the options in

Code: Select all

 -b512 -O -d -A -r16

should be direct I/O for the input device.

-r16 is 16 retries. It will only try to read 41680 kB, 16 times as its got the rest.
That 41680 kB is in 6762 bad areas.

Reallocated_Sector_Ct being non zero is a thing as drives age and remap hard to read sectors to spares.
Some spare sectors are used during manufacture to make drives appear to have no bad sectors when new.

A non zero Pending_Sector_Ct is a problem. That's a count of the sectors the drive knows that it cannot read.
If it could read them, they would be reallocated.

ddrescue has likely bumped the reallocated sector count.