View previous topic :: View next topic |
Author |
Message |
grooveman Veteran
Joined: 24 Feb 2003 Posts: 1217
|
Posted: Thu Dec 23, 2021 12:59 pm Post subject: Interpreting nvme-cli logs |
|
|
Hi.
I had some problems with an nvme drive I had. The system kept locking up. I couldn't backup the drive because it would get about 30 gigs in, then crash. So, I got a new drive, and restored from my last god backup to it, and it my system now works perfectly. A very happy ending to a story that could have been a disaster, and certainly testimony to having regular backups running...
but...
The old drive is still under warranty, and I'm trying to determine if it is any good anymore. I ran a shred on it... and it gave no complaints. That surprised me, so I wrote zeros to it -- and to my surprise, it executed this on the entire drive without a single complaint. At this point, I begin to wonder if there really is a problem with the drive... I hook it back up, and I use nvme-cli. I do a long test, and after a couple hours, I get my results:
Code: | Device Self Test Log for NVME device:nvme0
Current operation : 0
Current Completion : 0%
Self Test Result[0]:
Operation Result : 0
Self Test Code : 2
Valid Diagnostic Information : 0
Power on hours (POH) : 0x25d0
Vendor Specific : 0 0
Self Test Result[1]:
Operation Result : 0
Self Test Code : 1
Valid Diagnostic Information : 0
Power on hours (POH) : 0x25cf
Vendor Specific : 0 0
Self Test Result[2]:
Operation Result : 0
Self Test Code : 2
Valid Diagnostic Information : 0
Power on hours (POH) : 0x25c6
Vendor Specific : 0 0
Self Test Result[3]:
Operation Result : 0
Self Test Code : 1
Valid Diagnostic Information : 0
Power on hours (POH) : 0x25c2
Vendor Specific : 0 0
Self Test Result[4]:
Operation Result : 0
Self Test Code : 1
Valid Diagnostic Information : 0
Power on hours (POH) : 0x1161
Vendor Specific : 0 0
Self Test Result[5]:
Operation Result : 0xf
Self Test Result[6]:
Operation Result : 0xf
Self Test Result[7]:
Operation Result : 0xf
Self Test Result[8]:
Operation Result : 0xf
Self Test Result[9]:
Operation Result : 0xf
Self Test Result[10]:
Operation Result : 0xf
Self Test Result[11]:
Operation Result : 0xf
Self Test Result[12]:
Operation Result : 0xf
Self Test Result[13]:
Operation Result : 0xf
Self Test Result[14]:
Operation Result : 0xf
Self Test Result[15]:
Operation Result : 0xf
Self Test Result[16]:
Operation Result : 0xf
Self Test Result[17]:
Operation Result : 0xf
Self Test Result[18]:
Operation Result : 0xf
Self Test Result[19]:
Operation Result : 0xf
|
But what the heck do they mean? I cannot find this documented anywhere... I was expecting something less cryptic than this... or at least some thorough documentation on how to interpret the results... But what does Self Test Code 1 or 2 mean? If the drive is showing as healthy, there is no point in sending it back to Western Digital (it is an SN750, by the way). They will just throw it back in my face, and it will waste both of our time. Meanwhile, I'll have an NVME that I do not trust... that is of marginal use to me.
Anyone know of any documentation on this subject? Anyone know how to interpret this?
Thanks.
G _________________ To look without without looking within is like looking without without looking at all. |
|
Back to top |
|
|
Anon-E-moose Watchman
Joined: 23 May 2008 Posts: 6098 Location: Dallas area
|
Posted: Thu Dec 23, 2021 1:35 pm Post subject: |
|
|
Not sure what the tests are but there are some things you can check/investigate
Is there a firmware update for the nvme drive (check with WD support)
Not sure which kernel version you're running but there's always possibility that it might need a newer driver (later kernel)
Could be something not right between the MB and the nvme. _________________ PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Thu Dec 23, 2021 4:28 pm Post subject: |
|
|
Quote: | I couldn't backup the drive because it would get about 30 gigs in, then crash. |
How often did you run fstrim on your old drive? |
|
Back to top |
|
|
Hu Moderator
Joined: 06 Mar 2007 Posts: 21619
|
Posted: Thu Dec 23, 2021 4:50 pm Post subject: |
|
|
Did you ever get any kernel logs from when the old drive crashed, or was the system too broken to save those? If you got them, what did the kernel print?
My guess based on your reported symptoms is that the drive had a bad area that it handled very poorly, but when you rewrote the entire drive, you forced the drive to remap that area out of existence. The remaining sectors are usable, at least for now. Whether they will remain that way is unknown. |
|
Back to top |
|
|
grooveman Veteran
Joined: 24 Feb 2003 Posts: 1217
|
Posted: Thu Jan 27, 2022 3:38 pm Post subject: |
|
|
I didn't think you needed to run the trim function on contemporary drives.
The thing behaves normally, so I'm not sure why it got so grumpy.
Anyway, thanks for the input. _________________ To look without without looking within is like looking without without looking at all. |
|
Back to top |
|
|
jonas21 n00b
Joined: 24 Oct 2022 Posts: 1
|
Posted: Mon Oct 24, 2022 6:44 am Post subject: |
|
|
I was looking for the cryptic results, too. It seems this is not well documentated with nvme-cli. The codes are actually listed from the NVME spec, their meaning is as follows:
The "Operating Result" field:
0h Operation completed without error
1h Operation was aborted by a Device Self-test command
2h Operation was aborted by a Controller Level Reset Operation was aborted due to a removal of a namespace from the
3h namespace inventory
4h Operation was aborted due to the processing of a Format NVM command A fatal error or unknown test error occurred while the controller was
5h executing the device self-test operation and the operation did not complete Operation completed with a segment that failed and the segment that
6h failed is not known Operation completed with one or more failed segments and the first
7h segment that failed is indicated in the Segment Number field
8h Operation was aborted for unknown reason
9h Operation was aborted due to a sanitize operation Ah to Eh Reserved Fh Entry not used (does not contain a test result)
"Self Test Code" field:
0h Reserved
1h Short device self-test operation
2h Extended device self-test operation
3h to Dh Reserved
Eh Vendor specific
Fh Reserved
"Segment number" field:
Segment Number: This field indicates the segment number (refer to section 8.11) where the first self-test failure occurred. If Device Self-test Status field bits [3:0] are not set to 7h, then this field should be ignored.
"Valid Diagnostic information" field:
Bits 7:4 are reserved.
Bit 3 (SC Valid): If set to ‘1’, then the contents of Status Code field is valid. If cleared to ‘0’, then the contents of Status Code field is invalid.
Bit 2 (SCT Valid): If set to ‘1’, then the contents of Status Code Type field is valid. If cleared to ‘0’, then the contents of Status Code Type field is invalid.
Bit 1 (FLBA Valid): If set to ‘1’, then the contents of Failing LBA field is valid. If cleared to ‘0’, then the contents of Failing LBA field is invalid.
Bit 0 (NSID Valid): If set to ‘1’, then the contents of Namespace Identifier field is valid. If cleared to ‘0’, then the contents of Namespace Identifier field is invalid. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|