Forums

Skip to content

Advanced search
  • Quick links
    • Unanswered topics
    • Active topics
    • Search
  • FAQ
  • Login
  • Register
  • Board index Assistance Kernel & Hardware
  • Search

SSD drive failure in dmesg

Kernel not recognizing your hardware? Problems with power management or PCMCIA? What hardware is compatible with Gentoo? See here. (Only for kernels supported by Gentoo.)
Post Reply
Advanced search
11 posts • Page 1 of 1
Author
Message
sysnull008
n00b
n00b
Posts: 10
Joined: Sun Mar 06, 2022 3:38 am

SSD drive failure in dmesg

  • Quote

Post by sysnull008 » Sun Mar 06, 2022 3:47 am

I've had this problem before with my previous samsung 860 EVO 2TB SSD and after replacing the SATA cables 3 times and moving the SATA connector to various other SATA slots on the motherboard I finally decided to purchase a replacement.

So I bought a new 2TB Samsung evo (same model). After copying all the data from the old one via ddrescue and getting my system setup again I thought I was past this issue. However... the SAME error is popping up for my brand new 1 week old SSD!

See below

Code: Select all

[1227765.345170] ata8.00: exception Emask 0x10 SAct 0x7c200 SErr 0x0 action 0x6 frozen
[1227765.345176] ata8.00: irq_stat 0x08000000, interface fatal error
[1227765.345178] ata8.00: failed command: READ FPDMA QUEUED
[1227765.345179] ata8.00: cmd 60/08:48:80:cd:2a/00:00:35:00:00/40 tag 9 ncq dma 4096 in
                          res 40/00:88:f8:6b:29/00:00:35:00:00/40 Emask 0x10 (ATA bus error)
[1227765.345184] ata8.00: status: { DRDY }
[1227765.345186] ata8.00: failed command: WRITE FPDMA QUEUED
[1227765.345187] ata8.00: cmd 61/10:70:50:6b:29/00:00:35:00:00/40 tag 14 ncq dma 8192 out
                          res 40/00:88:f8:6b:29/00:00:35:00:00/40 Emask 0x10 (ATA bus error)
[1227765.345191] ata8.00: status: { DRDY }
[1227765.345192] ata8.00: failed command: WRITE FPDMA QUEUED
[1227765.345193] ata8.00: cmd 61/38:78:90:6b:29/00:00:35:00:00/40 tag 15 ncq dma 28672 out
                          res 40/00:88:f8:6b:29/00:00:35:00:00/40 Emask 0x10 (ATA bus error)
[1227765.345196] ata8.00: status: { DRDY }
[1227765.345197] ata8.00: failed command: WRITE FPDMA QUEUED
[1227765.345198] ata8.00: cmd 61/10:80:e0:6b:29/00:00:35:00:00/40 tag 16 ncq dma 8192 out
                          res 40/00:88:f8:6b:29/00:00:35:00:00/40 Emask 0x10 (ATA bus error)
[1227765.345202] ata8.00: status: { DRDY }
[1227765.345203] ata8.00: failed command: WRITE FPDMA QUEUED
[1227765.345204] ata8.00: cmd 61/08:88:f8:6b:29/00:00:35:00:00/40 tag 17 ncq dma 4096 out
                          res 40/00:88:f8:6b:29/00:00:35:00:00/40 Emask 0x10 (ATA bus error)
[1227765.345207] ata8.00: status: { DRDY }
[1227765.345208] ata8.00: failed command: READ FPDMA QUEUED
[1227765.345209] ata8.00: cmd 60/08:90:48:cd:cd/00:00:35:00:00/40 tag 18 ncq dma 4096 in
                          res 40/00:88:f8:6b:29/00:00:35:00:00/40 Emask 0x10 (ATA bus error)
[1227765.345212] ata8.00: status: { DRDY }
[1227765.345215] ata8: hard resetting link
[1227765.805161] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[1227765.805390] ata8.00: supports DRM functions and may not be fully accessible
[1227765.805751] ata8.00: disabling queued TRIM support
[1227765.808022] ata8.00: supports DRM functions and may not be fully accessible
[1227765.808376] ata8.00: disabling queued TRIM support
[1227765.810456] ata8.00: configured for UDMA/133
[1227765.810472] sd 7:0:0:0: [sdd] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
[1227765.810474] sd 7:0:0:0: [sdd] tag#9 Sense Key : Illegal Request [current]
[1227765.810476] sd 7:0:0:0: [sdd] tag#9 Add. Sense: Unaligned write command
[1227765.810479] sd 7:0:0:0: [sdd] tag#9 CDB: Read(10) 28 00 35 2a cd 80 00 00 08 00
[1227765.810481] blk_update_request: I/O error, dev sdd, sector 891997568 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[1227765.810498] ata8: EH complete
[1227765.810557] ata8.00: Enabling discard_zeroes_data
the sdd drive is my new replacement SSD. Here is the blk info via lsblk

Code: Select all

sdd                           8:48   0   1.8T  0 disk
└─systemvms                  253:18   0   1.8T  0 crypt /mnt/systemvms
The new SSD came with a SATA cable so I used it as well so thats also brand new. I'm very confused and not sure if my motherboard is the problem or something else entirely!

Not sure if its relevant but the ssd is a LUKS volume as can be seen above. The params are

Code: Select all

/dev/mapper/systemvms is active and is in use.
  type:    LUKS2
  cipher:  aes-xts-plain64
  keysize: 512 bits
  key location: keyring
  device:  /dev/sdd
  sector size:  512
  offset:  32768 sectors
  size:    3906996400 sectors
  mode:    read/write
  flags:   discards
Any help would be greatly appreciated.
Top
Zucca
Moderator
Moderator
User avatar
Posts: 4691
Joined: Thu Jun 14, 2007 10:31 pm
Location: Rasi, Finland
Contact:
Contact Zucca
Website

  • Quote

Post by Zucca » Sun Mar 06, 2022 9:26 am

Did this start out of nowhere? Or after kernel upgrade maybe?

It also could be just too aggressive SATA power saving.
You can check the power saving state by:

Code: Select all

cat /sys/class/scsi_host/host*/link_power_management_policy
(Note that I have nvme-only laptop here at hand, so I can't be sure of the path. Google gave this path for me.)

By echoing max_performance to the file that's associated with the ssd in question you can disable all power saving.
..: Zucca :..

Code: Select all

init=/sbin/openrc-init
-systemd -logind -elogind seatd
I am NaN! I am a man!
Top
sysnull008
n00b
n00b
Posts: 10
Joined: Sun Mar 06, 2022 3:38 am

  • Quote

Post by sysnull008 » Sun Mar 06, 2022 1:04 pm

Thanks for the reply!

Here is the result of that command (which was the correct path)

Code: Select all

max_performance
max_performance
max_performance
max_performance
max_performance
max_performance
max_performance
max_performance
max_performance
max_performance
This particular ssd is a sata ssd if that is relevant. Also This is an X399 Ryzen threadripper board. I have a Zen+ 2950X. One thing I can say for sure with this CPU is that it suffers from the hardware bug noted in the following link

https://bugs.gentoo.org/724314

I don't know if that is related or not. As for a recent kernel upgrade, well I forget but I believe it started happening after I upgraded to kernel 5.13.x or 5.14.x and ever since. But I don't recall *exactly* which kernel it started happening in but it definitely started happening after a kernel upgrade. I don't know if that is just a coincidence or not though.

[/url]
Top
NeddySeagoon
Administrator
Administrator
User avatar
Posts: 56080
Joined: Sat Jul 05, 2003 9:37 am
Location: 56N 3W

  • Quote

Post by NeddySeagoon » Sun Mar 06, 2022 1:09 pm

sysnull008,

Code: Select all

irq_stat 0x08000000, interface fatal error 
the kernel thinks that its an interface error but be need all of dmesg after a clean start to make sure we are looking at the first error.
That's the only one that matters.

Install smartmontools and run

Code: Select all

smartctl -x /dev/... 
and post the output. That's your drives view of the world.

ddrescue is really designed for rotating rust. Did it complete with no errors or do you have 'holes' in your recovered data?
The log would be good to see.
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Top
sysnull008
n00b
n00b
Posts: 10
Joined: Sun Mar 06, 2022 3:38 am

  • Quote

Post by sysnull008 » Tue Mar 08, 2022 2:10 am

I've attached the full dmesg and output of the smartctl command.

Since dmesg is pretty large I pastebin'd it.

Full dmesg

http://dpaste.com/G73WRZK78

smartmon tool output

Code: Select all

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.6-gentoo] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 EVO 2TB
Serial Number:    S620NJ0RA11598N
LU WWN Device Id: 5 002538 f31a16767
Firmware Version: SVT01B6Q
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar  7 21:03:33 2022 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x53) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 160) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   099   099   000    -    393
 12 Power_Cycle_Count       -O--CK   099   099   000    -    3
177 Wear_Leveling_Count     PO--C-   100   100   000    -    0
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010    -    0
181 Program_Fail_Cnt_Total  -O--CK   100   100   010    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   010    -    0
183 Runtime_Bad_Block       PO--C-   100   100   010    -    0
187 Uncorrectable_Error_Cnt -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O--CK   064   053   000    -    36
195 ECC_Error_Rate          -O-RC-   200   200   000    -    0
199 CRC_Error_Count         -OSRCK   100   100   000    -    0
235 POR_Recovery_Count      -O--C-   099   099   000    -    1
241 Total_LBAs_Written      -O--CK   099   099   000    -    3506605724
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      1  Comprehensive SMART error log
0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1           SL  VS      16  Device vendor specific log
0xa5           SL  VS      16  Device vendor specific log
0xce           SL  VS      16  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  256        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       256 (0x0100)
Device State:                        Active (0)
Current Temperature:                    36 Celsius
Power Cycle Min/Max Temperature:     26/40 Celsius
Lifetime    Min/Max Temperature:     40/47 Celsius
Specified Max Operating Temperature:    70 Celsius
Under/Over Temperature Limit Count:   0/0
SMART Status:                        0xc24f (PASSED)

SCT Temperature History Version:     2
Temperature Sampling Period:         10 minutes
Temperature Logging Interval:        10 minutes
Min/Max recommended Temperature:      0/70 Celsius
Min/Max Temperature Limit:            0/70 Celsius
Temperature History Size (Index):    128 (72)

Index    Estimated Time   Temperature Celsius
  73    2022-03-06 23:50    34  ***************
  74    2022-03-07 00:00    34  ***************
  75    2022-03-07 00:10    35  ****************
 ...    ..(  2 skipped).    ..  ****************
  78    2022-03-07 00:40    35  ****************
  79    2022-03-07 00:50    34  ***************
 ...    ..(  4 skipped).    ..  ***************
  84    2022-03-07 01:40    34  ***************
  85    2022-03-07 01:50    33  **************
  86    2022-03-07 02:00    34  ***************
  87    2022-03-07 02:10    34  ***************
  88    2022-03-07 02:20    34  ***************
  89    2022-03-07 02:30    33  **************
 ...    ..(  2 skipped).    ..  **************
  92    2022-03-07 03:00    33  **************
  93    2022-03-07 03:10    34  ***************
  94    2022-03-07 03:20    33  **************
 ...    ..(  3 skipped).    ..  **************
  98    2022-03-07 04:00    33  **************
  99    2022-03-07 04:10    34  ***************
 100    2022-03-07 04:20    33  **************
 ...    ..( 15 skipped).    ..  **************
 116    2022-03-07 07:00    33  **************
 117    2022-03-07 07:10    34  ***************
 118    2022-03-07 07:20    33  **************
 ...    ..(  3 skipped).    ..  **************
 122    2022-03-07 08:00    33  **************
 123    2022-03-07 08:10    34  ***************
 124    2022-03-07 08:20    34  ***************
 125    2022-03-07 08:30    35  ****************
 ...    ..(  5 skipped).    ..  ****************
   3    2022-03-07 09:30    35  ****************
   4    2022-03-07 09:40    36  *****************
   5    2022-03-07 09:50    35  ****************
 ...    ..( 16 skipped).    ..  ****************
  22    2022-03-07 12:40    35  ****************
  23    2022-03-07 12:50    34  ***************
  24    2022-03-07 13:00    35  ****************
  25    2022-03-07 13:10    35  ****************
  26    2022-03-07 13:20    34  ***************
  27    2022-03-07 13:30    34  ***************
  28    2022-03-07 13:40    35  ****************
  29    2022-03-07 13:50    35  ****************
  30    2022-03-07 14:00    34  ***************
  31    2022-03-07 14:10    35  ****************
  32    2022-03-07 14:20    34  ***************
 ...    ..(  3 skipped).    ..  ***************
  36    2022-03-07 15:00    34  ***************
  37    2022-03-07 15:10    35  ****************
  38    2022-03-07 15:20    34  ***************
  39    2022-03-07 15:30    35  ****************
  40    2022-03-07 15:40    34  ***************
 ...    ..( 14 skipped).    ..  ***************
  55    2022-03-07 18:10    34  ***************
  56    2022-03-07 18:20    36  *****************
  57    2022-03-07 18:30    36  *****************
  58    2022-03-07 18:40    34  ***************
 ...    ..(  6 skipped).    ..  ***************
  65    2022-03-07 19:50    34  ***************
  66    2022-03-07 20:00    35  ****************
 ...    ..(  3 skipped).    ..  ****************
  70    2022-03-07 20:40    35  ****************
  71    2022-03-07 20:50    36  *****************
  72    2022-03-07 21:00    36  *****************

SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4               3  ---  Lifetime Power-On Resets
0x01  0x010  4             393  ---  Power-on Hours
0x01  0x018  6      3506605724  ---  Logical Sectors Written
0x01  0x020  6        11402302  ---  Number of Write Commands
0x01  0x028  6        93868863  ---  Logical Sectors Read
0x01  0x030  6         1048935  ---  Number of Read Commands
0x01  0x038  6         3519000  ---  Date and Time TimeStamp
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               2  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              36  ---  Current Temperature
0x05  0x020  1              47  ---  Highest Temperature
0x05  0x028  1              40  ---  Lowest Temperature
0x05  0x058  1              70  ---  Specified Maximum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               8  ---  Number of Hardware Resets
0x06  0x010  4               0  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               0  N--  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            8  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            8  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC
As for ddrescue I let it run for about 24 hours and stopped. It got to 99.99% and stayed there almost the entire time. If the log file is useful I can try to find it although I might have deleted it.
Top
NeddySeagoon
Administrator
Administrator
User avatar
Posts: 56080
Joined: Sat Jul 05, 2003 9:37 am
Location: 56N 3W

  • Quote

Post by NeddySeagoon » Tue Mar 08, 2022 8:51 am

sysnull008,

A few things.

The first error is indeed

Code: Select all

[1227765.345170] ata8.00: exception Emask 0x10 SAct 0x7c200 SErr 0x0 action 0x6 frozen
[1227765.345176] ata8.00: irq_stat 0x08000000, interface fatal error
A few things to try.

1 Run the long test using smartctl.
That will read the entire drive but no data will pass over the interface.
If that fails,

Code: Select all

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   099   099   000    -    393 
with only 393 power on hours, look into a warranty return.

2. Replace the SATA Data cable.
Interface errors can be at the drive, the motherboard or the data cable. Poor quality cables are always a problem.

3. Try another SATA part on the motherboard, They fail too.

Code: Select all

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4               3  ---  Lifetime Power-On Resets
0x01  0x010  4             393  ---  Power-on Hours
0x01  0x018  6      3506605724  ---  Logical Sectors Written
0x01  0x020  6        11402302  ---  Number of Write Commands
0x01  0x028  6        93868863  ---  Logical Sectors Read
0x01  0x030  6         1048935  ---  Number of Read Commands 
That's a lot of writes in under 400 hours. Most of which are never read too.
That suggests that your /var/tmp/portage should be in tmpfs but its not.
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Top
sysnull008
n00b
n00b
Posts: 10
Joined: Sun Mar 06, 2022 3:38 am

  • Quote

Post by sysnull008 » Wed Mar 09, 2022 1:50 am

1 Run the long test using smartctl.
I will try the long test and return with results as soon as they're done. As an side, I *can* run this test while my filesystem is mounted and in use right? Just want to make sure.
2. Replace the SATA Data cable.
I've replaced the sata cable a few times now (with the previous drive) and now again with the new drive. I can do it again but so far it hasn't helped. I'm worried it might be a hardware problem as you suggested.
3. Try another SATA part on the motherboard, They fail too.
I've switched between 4 sata ports on the motherboard and they all seem to have issues. That being said I will power off the machine and switch again since I haven't tried that since I bought the new drive.
That's a lot of writes in under 400 hours. Most of which are never read too.
That suggests that your /var/tmp/portage should be in tmpfs but its not.
Wow that does look like a lot! I'm not sure if this is related but my drive is encrypted via LUKS. All of them are. Just wanted to make sure I mentioned that in case it was relevant.

As for tmpfs, my mounts look like this:

Code: Select all

Filesystem            Size  Used Avail Use% Mounted on
none                   63G  2.0M   63G   1% /run
udev                   10M     0   10M   0% /dev
tmpfs                  63G  1.4G   62G   3% /dev/shm
/dev/dm-0             458G   94G  341G  22% /
cgroup_root            10M     0   10M   0% /sys/fs/cgroup
tmpfs                  13G   16K   13G   1% /run/user/1000
/dev/mapper/systemvms  1.8T  1.4T  347G  81% /mnt/systemvms
Should I have moved /var/tmp/* into tmpfs? How would I go about that? I must have missed that in the handbook installation somehow.

I mentioned my sata issues to a friend and he thinks my motherboard might be the culprit. The board is:
https://www.msi.com/Motherboard/X399-GA ... -CARBON-AC

Its for my zen+ threadripper.

If I can't solve this issue I plan to save over the next few months and consider building a new system based on AM4 and Zen3. I'd prefer not to have to rebuild my entire system though. I have many many VMs and data that are important.

Do you think there is any chance my board is the problem?

Thanks again for the help!
Top
Hu
Administrator
Administrator
Posts: 24385
Joined: Tue Mar 06, 2007 5:38 am

  • Quote

Post by Hu » Wed Mar 09, 2022 4:05 pm

sysnull008 wrote:As an side, I *can* run this test while my filesystem is mounted and in use right? Just want to make sure.
Yes. SMART self-tests can be run safely while the drive is in use. Note that the firmware typically will abort the self-test early if the drive is powered down, so you should not halt, suspend, or hibernate the system until the test completes. Aborting the self-test will not harm the drive or its contents, but you would need to start it again from the beginning in order to get a valid result.
sysnull008 wrote:Wow that does look like a lot! I'm not sure if this is related but my drive is encrypted via LUKS. All of them are.
LUKS might influence the counts a little bit, but I would not expect a significant write amplification effect. Many people use LUKS-encrypted SSDs and do not experience early drive failure.
sysnull008 wrote:Should I have moved /var/tmp/* into tmpfs?
If you want to reduce unnecessary writes to the SSD, and have sufficient RAM, yes.
sysnull008 wrote:How would I go about that?
Add an fstab entry to mount a tmpfs there.
sysnull008 wrote:Do you think there is any chance my board is the problem?
"Any" is quite loose, so yes, I think there is a non-zero chance that you have a defective motherboard. :) Whether that chance is high enough to speculatively replace the board, I can't say. I suggest you wait for other posters to comment on this before committing to the purchase of new hardware.

As for rebuilding your system, I would not worry about it. Nothing we have seen indicates that you would need to recreate your data files from scratch. At worst, your drive might die suddenly and force you to restore the data from the most recent backup. Assuming you have such a backup and can restore from it, the worst case I can see is that you are forced to replace all your hardware, and copy the data from backup. Don't buy anything yet though. We don't have damning evidence implicating any particular component.
Top
Anon-E-moose
Watchman
Watchman
User avatar
Posts: 6566
Joined: Fri May 23, 2008 7:31 pm
Location: Dallas area

  • Quote

Post by Anon-E-moose » Wed Mar 09, 2022 4:09 pm

sysnull008 wrote:I mentioned my sata issues to a friend and he thinks my motherboard might be the culprit. The board is:
https://www.msi.com/Motherboard/X399-GA ... -CARBON-AC
What version is your bios?
UM780 xtx, 6.18 zen kernel, gcc 15, openrc, wayland
minixforum m1-s1 max -- same software as above but used for ai learning


Zealots are gonna be zealots, just like haters are gonna be haters
Top
sysnull008
n00b
n00b
Posts: 10
Joined: Sun Mar 06, 2022 3:38 am

  • Quote

Post by sysnull008 » Thu Mar 10, 2022 1:46 am

I am giving the smartctl a go now with the the long test and will update as soon as it completes (im guessing tomorrow).

As for my bios version using dmidecode I get the following snipped output:

Code: Select all

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 1.C0
        Release Date: 11/14/2018
That does look pretty old.

I'm hopeful on the upcoming hdd long test but I am very curious why my writes are so abnormally high. Perhaps a bios update is also a good idea.

EDIT

Adding full dmidecode in case its useful
http://dpaste.com/DTK7UVYDZ

EDIT 2

Actually thinking about the high amount of writes I wonder if its expected. This drive is a replacement for my last SSD, and as such I imaged the last SSD (via ddrescue as the errors in dmesg showed it was failing) and ultimately copied over the imaged data onto the current drive. Wouldn't that explain the high amount of writes but low relative amount of reads? I could be wrong of course.
Top
sysnull008
n00b
n00b
Posts: 10
Joined: Sun Mar 06, 2022 3:38 am

  • Quote

Post by sysnull008 » Fri Mar 11, 2022 1:18 am

After running the long test I've run the tool again to view the test results via:
smartctl -a /dev/sdd

I hope this output summarizes the long test but if not let me know.

https://dpaste.com/59B8SE4CK
Top
Post Reply

11 posts • Page 1 of 1

Return to “Kernel & Hardware”

Jump to
  • Assistance
  • ↳   News & Announcements
  • ↳   Frequently Asked Questions
  • ↳   Installing Gentoo
  • ↳   Multimedia
  • ↳   Desktop Environments
  • ↳   Networking & Security
  • ↳   Kernel & Hardware
  • ↳   Portage & Programming
  • ↳   Gamers & Players
  • ↳   Other Things Gentoo
  • ↳   Unsupported Software
  • Discussion & Documentation
  • ↳   Documentation, Tips & Tricks
  • ↳   Gentoo Chat
  • ↳   Gentoo Forums Feedback
  • ↳   Duplicate Threads
  • International Gentoo Users
  • ↳   中文 (Chinese)
  • ↳   Dutch
  • ↳   Finnish
  • ↳   French
  • ↳   Deutsches Forum (German)
  • ↳   Diskussionsforum
  • ↳   Deutsche Dokumentation
  • ↳   Greek
  • ↳   Forum italiano (Italian)
  • ↳   Forum di discussione italiano
  • ↳   Risorse italiane (documentazione e tools)
  • ↳   Polskie forum (Polish)
  • ↳   Instalacja i sprzęt
  • ↳   Polish OTW
  • ↳   Portuguese
  • ↳   Documentação, Ferramentas e Dicas
  • ↳   Russian
  • ↳   Scandinavian
  • ↳   Spanish
  • ↳   Other Languages
  • Architectures & Platforms
  • ↳   Gentoo on ARM
  • ↳   Gentoo on PPC
  • ↳   Gentoo on Sparc
  • ↳   Gentoo on Alternative Architectures
  • ↳   Gentoo on AMD64
  • ↳   Gentoo for Mac OS X (Portage for Mac OS X)
  • Board index
  • All times are UTC
  • Delete cookies

© 2001–2026 Gentoo Foundation, Inc.

Powered by phpBB® Forum Software © phpBB Limited

Privacy Policy

 

 

magic