Interpreting nvme-cli logs

grooveman · Veteran Joined: 24 Feb 2003 Posts: 1217

Hi.

I had some problems with an nvme drive I had. The system kept locking up. I couldn't backup the drive because it would get about 30 gigs in, then crash. So, I got a new drive, and restored from my last god backup to it, and it my system now works perfectly. A very happy ending to a story that could have been a disaster, and certainly testimony to having regular backups running...

but...

The old drive is still under warranty, and I'm trying to determine if it is any good anymore. I ran a shred on it... and it gave no complaints. That surprised me, so I wrote zeros to it -- and to my surprise, it executed this on the entire drive without a single complaint. At this point, I begin to wonder if there really is a problem with the drive... I hook it back up, and I use nvme-cli. I do a long test, and after a couple hours, I get my results:

Anon-E-moose · Posted: Thu Dec 23, 2021 1:35 pm Post subject:

Not sure what the tests are but there are some things you can check/investigate

Is there a firmware update for the nvme drive (check with WD support)
Not sure which kernel version you're running but there's always possibility that it might need a newer driver (later kernel)
Could be something not right between the MB and the nvme.
_________________
PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland

mike155 · Posted: Thu Dec 23, 2021 4:28 pm Post subject:

Hu · Moderator Joined: 06 Mar 2007 Posts: 21619

Did you ever get any kernel logs from when the old drive crashed, or was the system too broken to save those? If you got them, what did the kernel print?

My guess based on your reported symptoms is that the drive had a bad area that it handled very poorly, but when you rewrote the entire drive, you forced the drive to remap that area out of existence. The remaining sectors are usable, at least for now. Whether they will remain that way is unknown.

grooveman · Veteran Joined: 24 Feb 2003 Posts: 1217

I didn't think you needed to run the trim function on contemporary drives.

The thing behaves normally, so I'm not sure why it got so grumpy.

Anyway, thanks for the input.
_________________
To look without without looking within is like looking without without looking at all.

jonas21 · n00b Joined: 24 Oct 2022 Posts: 1

I was looking for the cryptic results, too. It seems this is not well documentated with nvme-cli. The codes are actually listed from the NVME spec, their meaning is as follows:

The "Operating Result" field:

0h Operation completed without error
1h Operation was aborted by a Device Self-test command
2h Operation was aborted by a Controller Level Reset Operation was aborted due to a removal of a namespace from the
3h namespace inventory
4h Operation was aborted due to the processing of a Format NVM command A fatal error or unknown test error occurred while the controller was
5h executing the device self-test operation and the operation did not complete Operation completed with a segment that failed and the segment that
6h failed is not known Operation completed with one or more failed segments and the first
7h segment that failed is indicated in the Segment Number field
8h Operation was aborted for unknown reason
9h Operation was aborted due to a sanitize operation Ah to Eh Reserved Fh Entry not used (does not contain a test result)

"Self Test Code" field:

0h Reserved
1h Short device self-test operation
2h Extended device self-test operation
3h to Dh Reserved
Eh Vendor specific
Fh Reserved

"Segment number" field:

Segment Number: This field indicates the segment number (refer to section 8.11) where the first self-test failure occurred. If Device Self-test Status field bits [3:0] are not set to 7h, then this field should be ignored.

"Valid Diagnostic information" field:

Bits 7:4 are reserved.
Bit 3 (SC Valid): If set to ‘1’, then the contents of Status Code field is valid. If cleared to ‘0’, then the contents of Status Code field is invalid.
Bit 2 (SCT Valid): If set to ‘1’, then the contents of Status Code Type field is valid. If cleared to ‘0’, then the contents of Status Code Type field is invalid.
Bit 1 (FLBA Valid): If set to ‘1’, then the contents of Failing LBA field is valid. If cleared to ‘0’, then the contents of Failing LBA field is invalid.
Bit 0 (NSID Valid): If set to ‘1’, then the contents of Namespace Identifier field is valid. If cleared to ‘0’, then the contents of Namespace Identifier field is invalid.