Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Interpreting nvme-cli logs
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
grooveman
Veteran
Veteran


Joined: 24 Feb 2003
Posts: 1217

PostPosted: Thu Dec 23, 2021 12:59 pm    Post subject: Interpreting nvme-cli logs Reply with quote

Hi.

I had some problems with an nvme drive I had. The system kept locking up. I couldn't backup the drive because it would get about 30 gigs in, then crash. So, I got a new drive, and restored from my last god backup to it, and it my system now works perfectly. A very happy ending to a story that could have been a disaster, and certainly testimony to having regular backups running...

but...

The old drive is still under warranty, and I'm trying to determine if it is any good anymore. I ran a shred on it... and it gave no complaints. That surprised me, so I wrote zeros to it -- and to my surprise, it executed this on the entire drive without a single complaint. At this point, I begin to wonder if there really is a problem with the drive... I hook it back up, and I use nvme-cli. I do a long test, and after a couple hours, I get my results:

Code:
Device Self Test Log for NVME device:nvme0
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0
  Self Test Code               : 2
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x25d0
  Vendor Specific              : 0 0
Self Test Result[1]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x25cf
  Vendor Specific              : 0 0
Self Test Result[2]:
  Operation Result             : 0
  Self Test Code               : 2
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x25c6
  Vendor Specific              : 0 0
Self Test Result[3]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x25c2
  Vendor Specific              : 0 0
Self Test Result[4]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1161
  Vendor Specific              : 0 0
Self Test Result[5]:
  Operation Result             : 0xf
Self Test Result[6]:
  Operation Result             : 0xf
Self Test Result[7]:
  Operation Result             : 0xf
Self Test Result[8]:
  Operation Result             : 0xf
Self Test Result[9]:
  Operation Result             : 0xf
Self Test Result[10]:
  Operation Result             : 0xf
Self Test Result[11]:
  Operation Result             : 0xf
Self Test Result[12]:
  Operation Result             : 0xf
Self Test Result[13]:
  Operation Result             : 0xf
Self Test Result[14]:
  Operation Result             : 0xf
Self Test Result[15]:
  Operation Result             : 0xf
Self Test Result[16]:
  Operation Result             : 0xf
Self Test Result[17]:
  Operation Result             : 0xf
Self Test Result[18]:
  Operation Result             : 0xf
Self Test Result[19]:
  Operation Result             : 0xf


But what the heck do they mean? I cannot find this documented anywhere... I was expecting something less cryptic than this... or at least some thorough documentation on how to interpret the results... But what does Self Test Code 1 or 2 mean? If the drive is showing as healthy, there is no point in sending it back to Western Digital (it is an SN750, by the way). They will just throw it back in my face, and it will waste both of our time. Meanwhile, I'll have an NVME that I do not trust... that is of marginal use to me.

Anyone know of any documentation on this subject? Anyone know how to interpret this?

Thanks.

G
_________________
To look without without looking within is like looking without without looking at all.
Back to top
View user's profile Send private message
Anon-E-moose
Watchman
Watchman


Joined: 23 May 2008
Posts: 6098
Location: Dallas area

PostPosted: Thu Dec 23, 2021 1:35 pm    Post subject: Reply with quote

Not sure what the tests are but there are some things you can check/investigate

Is there a firmware update for the nvme drive (check with WD support)
Not sure which kernel version you're running but there's always possibility that it might need a newer driver (later kernel)
Could be something not right between the MB and the nvme.
_________________
PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Thu Dec 23, 2021 4:28 pm    Post subject: Reply with quote

Quote:
I couldn't backup the drive because it would get about 30 gigs in, then crash.

How often did you run fstrim on your old drive?
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 21619

PostPosted: Thu Dec 23, 2021 4:50 pm    Post subject: Reply with quote

Did you ever get any kernel logs from when the old drive crashed, or was the system too broken to save those? If you got them, what did the kernel print?

My guess based on your reported symptoms is that the drive had a bad area that it handled very poorly, but when you rewrote the entire drive, you forced the drive to remap that area out of existence. The remaining sectors are usable, at least for now. Whether they will remain that way is unknown.
Back to top
View user's profile Send private message
grooveman
Veteran
Veteran


Joined: 24 Feb 2003
Posts: 1217

PostPosted: Thu Jan 27, 2022 3:38 pm    Post subject: Reply with quote

I didn't think you needed to run the trim function on contemporary drives.

The thing behaves normally, so I'm not sure why it got so grumpy.

Anyway, thanks for the input.
_________________
To look without without looking within is like looking without without looking at all.
Back to top
View user's profile Send private message
jonas21
n00b
n00b


Joined: 24 Oct 2022
Posts: 1

PostPosted: Mon Oct 24, 2022 6:44 am    Post subject: Reply with quote

I was looking for the cryptic results, too. It seems this is not well documentated with nvme-cli. The codes are actually listed from the NVME spec, their meaning is as follows:

The "Operating Result" field:

0h Operation completed without error
1h Operation was aborted by a Device Self-test command
2h Operation was aborted by a Controller Level Reset Operation was aborted due to a removal of a namespace from the
3h namespace inventory
4h Operation was aborted due to the processing of a Format NVM command A fatal error or unknown test error occurred while the controller was
5h executing the device self-test operation and the operation did not complete Operation completed with a segment that failed and the segment that
6h failed is not known Operation completed with one or more failed segments and the first
7h segment that failed is indicated in the Segment Number field
8h Operation was aborted for unknown reason
9h Operation was aborted due to a sanitize operation Ah to Eh Reserved Fh Entry not used (does not contain a test result)

"Self Test Code" field:

0h Reserved
1h Short device self-test operation
2h Extended device self-test operation
3h to Dh Reserved
Eh Vendor specific
Fh Reserved

"Segment number" field:

Segment Number: This field indicates the segment number (refer to section 8.11) where the first self-test failure occurred. If Device Self-test Status field bits [3:0] are not set to 7h, then this field should be ignored.


"Valid Diagnostic information" field:

Bits 7:4 are reserved.
Bit 3 (SC Valid): If set to ‘1’, then the contents of Status Code field is valid. If cleared to ‘0’, then the contents of Status Code field is invalid.
Bit 2 (SCT Valid): If set to ‘1’, then the contents of Status Code Type field is valid. If cleared to ‘0’, then the contents of Status Code Type field is invalid.
Bit 1 (FLBA Valid): If set to ‘1’, then the contents of Failing LBA field is valid. If cleared to ‘0’, then the contents of Failing LBA field is invalid.
Bit 0 (NSID Valid): If set to ‘1’, then the contents of Namespace Identifier field is valid. If cleared to ‘0’, then the contents of Namespace Identifier field is invalid.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum