Unable to read lvm partitionts after reboot

Message

mimosinnet · Post by **mimosinnet** » Wed Nov 18, 2020 11:48 pm

The server had the root partition in an LVM partition, and it seems that something happened to the disk after the reboot. The disk booting the system had a structure similar to this one:

Code: Select all

/dev/sda1  2M BIOS    boot
/dev/sda2  512M EFI   Grub
/dev/sda3  RestOfDisk Linux LVM

The server offered a backup server with different disks. This is the output of pvdisplay when looking at the disks from sysrescuecd:

Code: Select all

% pvdisplay
  /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdc: read failed after 0 of 4096 at 2000398843904: Input/output error
  /dev/sdc: read failed after 0 of 4096 at 2000398925824: Input/output error
  /dev/sdc: read failed after 0 of 4096 at 4096: Input/output error
  /dev/sdf: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdf: read failed after 0 of 4096 at 2000398843904: Input/output error
  /dev/sdf: read failed after 0 of 4096 at 2000398925824: Input/output error
  /dev/sdf: read failed after 0 of 4096 at 4096: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 2000398843904: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 2000398925824: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 4096: Input/output error
  /dev/sdi: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdi: read failed after 0 of 4096 at 2000398843904: Input/output error
  /dev/sdi: read failed after 0 of 4096 at 2000398925824: Input/output error
  /dev/sdi: read failed after 0 of 4096 at 4096: Input/output error
  /dev/sdj: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdj: read failed after 0 of 4096 at 2000398843904: Input/output error
  /dev/sdj: read failed after 0 of 4096 at 2000398925824: Input/output error
  /dev/sdj: read failed after 0 of 4096 at 4096: Input/output error

These are the errors that appear in dmesg during pvdisplay:

Code: Select all

[32391.608966] sd 3:0:3:0: [sdf] Unhandled sense code
[32391.609001] sd 3:0:3:0: [sdf]  
[32391.609007] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[32391.609015] sd 3:0:3:0: [sdf]  
[32391.609020] Sense Key : Medium Error [current] 
[32391.609029] Info fld=0xe8e088a0
[32391.609036] sd 3:0:3:0: [sdf]  
[32391.609042] Add. Sense: Mechanical positioning error
[32391.609049] sd 3:0:3:0: [sdf] CDB: 
[32391.609054] Read(10): 28 00 e8 e0 88 a0 00 00 08 00
[32391.609073] end_request: I/O error, dev sdf, sector 3907029152

The situation is that (a) before the reboot the system was working and (b) after the reboot half of the disks with LVM have issues. The disks that did not have LVM partitions work correctly. My only guess is that something happened before the reboot (there was an electric repair in the housing) and that the system boot on some form of RAID that affected the disks (Dell PowerEdge R515).

My questions are:

a) Is it possible to recover the disks? I have found articles suggesting that LVM partitions can be recovered from the information in /etc/lvm/. However, the root partition was in an LVM partition and I cannot access the folder. I have found posts like this one, but I am unsure that this applies to this case.

b) Is it possible to recover the information from the readable LVM partitions? Booting from sysrescuecd does not show LVM partitions in /dev/mapper.

I would very much appreciate any hint on these issues. Thanks!

Post by **NeddySeagoon** » Thu Nov 19, 2020 1:44 pm

mimosinnet,

Go through all the drives with

Code: Select all

smartctl -x /dev/...

If you don't understand it, put it all on a pastebin.

All those

Code: Select all

/dev/sdf: read failed after 0 of 4096 at 0: Input/output error

messages suggest its not the drives themselves.
It saying that the MBR cannot be read. Maybe the drive controller has failed somehow?
Having all those drives fail at the same instant is unlikely.

We need to know if the partition tables are MSDOS or EFI, or even not partition table.
If there is raid there, how is it done and what raid level.

It may be possible to reconstruct the metadata by hand to find your data. but we need to know how all the layers are stacked up.

mimosinnet · Post by **mimosinnet** » Fri Nov 20, 2020 8:11 am

NeddySeagoon! Many thanks for the answer!

Most of the disks have a GPT partition table and some of them a msdos one, and there was no raid. I will get the disks out of the server and see if they can be read in another box to discard the possibility of a faulty drive controller :O. This is what I get from parted -l:

Code: Select all

Error: /dev/sdc: unrecognised disk label
Model: TOSHIBA MK2001TRKB (scsi)                                          
Disk /dev/sdc: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags: 

Model: TOSHIBA MK2001TRKB (scsi)
Disk /dev/sdd: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start   End     Size    Type     File system  Flags
 1      32.3kB  1020MB  1020MB  primary
 2      1020MB  1025GB  1024GB  primary  ext4
 3      1025GB  2000GB  975GB   primary  ext4


Model: TOSHIBA MK2001TRKB (scsi)
Disk /dev/sde: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name   Flags
 1      1049kB  2000GB  2000GB               lvm_d  lvm


Error: /dev/sdf: unrecognised disk label
Model: TOSHIBA MK2001TRKB (scsi)                                          
Disk /dev/sdf: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags:
...

This is the output of smartctl -x /dev/sdc:

Code: Select all

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE SEEK ERROR RATE TOO HIGH [asc=5d, ascq=43]

Current Drive Temperature:     38 C
Drive Trip Temperature:        65 C

scsiGetStartStopData Failed [Input/output error]
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0   352239    352239    352239     414331    1031926.001           0
write:         0    10441     10441     10441      11835      22003.606           0
verify:        0        0         0         0          0         20.004           0

Non-medium error count:      500

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                  32   65535                 - [-   -    -]
# 2  Background long   Completed                  32   65535                 - [-   -    -]
....
#19  Background long   Completed                  32   65535                 - [-   -    -]
#20  Background long   Completed                  32   65535                 - [-   -    -]
Long (extended) Self Test duration: 20997 seconds [349.9 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 77156:15 [4629375 minutes]
    Number of background scans performed: 229,  scan progress: 0.00%
    Number of background medium scans performed: 0

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 35616:23  0000000012e35625  [1,18,7]   Recovered via rewrite in-place
   2 40656:07  0000000005fb07bf  [1,18,7]   Recovered via rewrite in-place
   ...
  68 76272:22  00000000156a703e  [1,18,7]   Recovered via rewrite in-place
  69 76608:09  000000000967f7b2  [1,18,7]   Recovered via rewrite in-place
  70 76944:04  0000000004902372  [1,18,7]   Recovered via rewrite in-place

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: power on
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000039388cadf0e
    attached SAS address = 0x500065b36789abff
    attached phy identifier = 2
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0

Post by **NeddySeagoon** » Fri Nov 20, 2020 8:51 am

mimosinnet,

Code: Select all

Error: /dev/sdc: unrecognised disk label

means that either the drive was never partitioned, which is OK, you can use unpartitioned drives if you want to or that the partition table is corrupt.
Parted will read both MSDOS and GPT partition tables.

Code: Select all

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE SEEK ERROR RATE TOO HIGH

The SMART Health Status is usually optimistic too.

Code: Select all

Accumulated power on time, hours:minutes 77156:15

That's a lot of running hours.
That drive is not fit for further service.

Exactly what drive controller were these drives attached too and how was it configured?
At least some Dell servers provide fakeraid if you are not careful. That matters here.

Given that /dev/sdc will be replaced, how will you restore the data?
Do you have a suitable backup or do you need to try to recover it?

Code: Select all

Error: /dev/sdc: unrecognised disk label

is consistent with your original post about cannot read block 0.

I really wanted all the drives entire smartctl output on a pastebin (one paste per drive) as the headline The SMART Health Status is often wrong (optimistic).
Its not until you delve into the numbers that the actual health of the drive emerges.

mimosinnet · Post by **mimosinnet** » Fri Nov 20, 2020 7:44 pm

Many thanks NeddySeagoon!

This is the pastebin of each drive: sdc, sdd, sde, sdf, sdg, sdh, sdi, sdj
This is the output of parted -l.

The controller is a SAS2008 PCI-Express Fusion-MPT, and this is the output of lshw. The server did not have any special configuration (no raid). The server worked as a backup service, and most of the contents exist somewhere else. There is some data that would be nice to be recovered. What is important is to know what might have happened, and I assume there may be an issue with the controller. I hope to be able to go to the housing facility next week and check if another server can read the disks. LVM drives cannot be mounted with the sysrescuecd usb

.

I very much appreciate your help!

. It has been quite stressful!

Post by **NeddySeagoon** » Fri Nov 20, 2020 8:10 pm

mimosinnet,

sdc is not fit for further use. To try to get its data back, you need to make an image of it with ddrescue.
I'll say no more about that right now. Jusd keep it in mind.

sdd still works but its well past its use by date.

Code: Select all

Manufactured in week 40 of year 2011
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  35
Specified load-unload count over device lifetime:  600000
...
Accumulated power on time, hours:minutes 77180:06

It's worn out by use of too frequent power saving.
It not fit for further service but the data can probably be read.

sde is the same as sdd.

sdf is the same as sdc which I covered previously.

Code: Select all

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE SEEK ERROR RATE TOO HIGH
...
scsiGetStartStopData Failed [Input/output error]

It can't get all the SMART data.

sdg is the same as sde and sdd.

sdh is another

Code: Select all

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE SEEK ERROR RATE TOO HIGH
...
scsiGetStartStopData Failed [Input/output error]

and sdi is the same

Code: Select all

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE SEEK ERROR RATE TOO HIGH [asc=5d, ascq=43]

Current Drive Temperature:     38 C
Drive Trip Temperature:        65 C

scsiGetStartStopData Failed [Input/output error]
Elements in grown defect list: 0

as is sdj

Code: Select all

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SERVO IMPENDING FAILURE SEEK ERROR RATE TOO HIGH [asc=5d, ascq=43]

Current Drive Temperature:     33 C
Drive Trip Temperature:        65 C

scsiGetStartStopData Failed [Input/output error]

It looks like sdc and scf have failed. At least they have areas that cannot be read.
The other drives are worn out mechanically.

The power on hours is equivalent to 8.8 years of continuous operation so they have done very well.

mimosinnet · Post by **mimosinnet** » Wed Nov 25, 2020 7:51 pm

NeddySeagoon wrote:sdc is not fit for further use. To try to get its data back, you need to make an image of it with ddrescue.

Code: Select all

Used ddrescue with /dev/sdc: 

% ddrescue -d -r1 /dev/sdc test.img test.log
GNU ddrescue 1.16
Press Ctrl-C to interrupt
rescued:         0 B,  errsize:   2000 GB,  current rate:        0 B/s
   ipos:     2952 MB,   errors:       1,    average rate:        0 B/s
   opos:     2952 MB,     time since last successful read:     1.3 d
Splitting failed blocks... 
Interrupted by user

This is the log file. It does not seem to be reading anything. It looks like the disk is dead

. This is still with the same disk controller, but the controller reads other disks. This is the listing of the image (0 bytes) and the logfile:

Code: Select all

% ls -lisah
total 132K
 2 4.0K drwxr-xr-x 3 root root 4.0K Nov 23 00:12 .
17    0 drwxr-xr-x 7 root root  140 Nov 18 11:54 ..
11  16K drwx------ 2 root root  16K Nov 23 00:08 lost+found
12    0 -rw-r--r-- 1 root root    0 Nov 23 00:12 test.img
13 112K -rw-r--r-- 1 root root 109K Nov 24 09:25 test.log

I will keep updating the information.

Post by **NeddySeagoon** » Wed Nov 25, 2020 8:59 pm

mimosinnet,

Code: Select all

ddrescue -d -r1

there are other command line option that are useful too.
I need to find one of my ddrescue logs or read the man page.

That

Code: Select all

rescued:         0 B,

Looks bad

Meanwhile, as a working hypothesis, lets assume that the platter spin bearings are worn.
Not unreasonable for an old drive that has had more than its rated start stops.
The spin bearings are 'air bearings'. There is no contact between the bearing surfaces in use at the design speeds.

When the bearings wear, the platter may wobble, so that the alignment in disturbed enough, so that the drive cannot get good reads in places.
We can use gravity to try to help. The drive han spent most of its life operating in a single position, on one of its faces or edges.
Call that position 1. Its got another 5 faces and edges, so try to continue the rescue another 5 times with the drive an a new face/edge every time.

If you give the same command

Code: Select all

ddrescue -d -r1 /dev/sdc test.img test.log

well, destination and log, ddrescuse will read the log and carry on from where it left off. It won't try to recover data it has already read.

From a brief read of the man page, aided by memory. Look at what they do.
-d is useful near the end but its very slow.
-M is good any time.
-A is good any time.
-r set a large number of retries, you only reed one more read. I use -r256
-R is good. Read from the end of the drive to the beginning. Very slow as it defeats read ahead but all the head steps are from the other side oc the track

Lots to try there.