View previous topic :: View next topic |
Author |
Message |
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Fri Mar 16, 2018 7:55 am Post subject: Time to put this drive to rest? |
|
|
My home server had a huge load spike. I went to investigate:
Code: | SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1817
3 Spin_Up_Time 0x0027 173 173 021 Pre-fail Always - 2308
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 79
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 18885
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 78
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 34
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 177
194 Temperature_Celsius 0x0022 115 098 000 Old_age Always - 28
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 2
|
tail of /var/log/disk/btrfs/current: | 2018-03-16T09:35:03+0200 [kernel] [5929925.151666] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 842, flush 0, corrupt 0, gen 0
2018-03-16T09:35:03+0200 [kernel] [5929925.151674] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 843, flush 0, corrupt 0, gen 0
2018-03-16T09:35:03+0200 [kernel] [5929925.151677] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 844, flush 0, corrupt 0, gen 0
2018-03-16T09:35:03+0200 [kernel] [5929925.151681] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 845, flush 0, corrupt 0, gen 0
2018-03-16T09:35:03+0200 [kernel] [5929925.151684] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 846, flush 0, corrupt 0, gen 0
2018-03-16T09:35:03+0200 [kernel] [5929925.151688] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 847, flush 0, corrupt 0, gen 0
2018-03-16T09:35:03+0200 [kernel] [5929925.151691] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 848, flush 0, corrupt 0, gen 0
2018-03-16T09:35:07+0200 [kernel] [5929929.561469] BTRFS info (device sdc4): read error corrected: ino 353768 off 99848192 (dev /dev/sdc4 sector 1342830504)
2018-03-16T09:35:07+0200 [kernel] [5929929.561522] BTRFS info (device sdc4): read error corrected: ino 353768 off 99856384 (dev /dev/sdc4 sector 1342830520)
2018-03-16T09:35:07+0200 [kernel] [5929929.561528] BTRFS info (device sdc4): read error corrected: ino 353768 off 99852288 (dev /dev/sdc4 sector 1342830512)
2018-03-16T09:35:07+0200 [kernel] [5929929.561574] BTRFS info (device sdc4): read error corrected: ino 353768 off 99860480 (dev /dev/sdc4 sector 1342830528)
2018-03-16T09:35:07+0200 [kernel] [5929929.561642] BTRFS info (device sdc4): read error corrected: ino 353768 off 99864576 (dev /dev/sdc4 sector 1342830536)
2018-03-16T09:35:07+0200 [kernel] [5929929.561698] BTRFS info (device sdc4): read error corrected: ino 353768 off 99868672 (dev /dev/sdc4 sector 1342830544)
2018-03-16T09:35:07+0200 [kernel] [5929929.561770] BTRFS info (device sdc4): read error corrected: ino 353768 off 99872768 (dev /dev/sdc4 sector 1342830552) | Also part of /var/log/everything: | [kernel] [5929925.150153] ata3.00: exception Emask 0x0 SAct 0x610 SErr 0x0 action 0x0
[kernel] [5929925.150156] ata3.00: irq_stat 0x40000008
[kernel] [5929925.150160] ata3.00: failed command: READ FPDMA QUEUED
[kernel] [5929925.150168] ata3.00: cmd 60/38:20:a8:ff:ed/00:00:58:00:00/40 tag 4 ncq dma 28672 in
[kernel] [5929925.150168] res 41/40:00:aa:ff:ed/00:00:58:00:00/00 Emask 0x409 (media error) <F>
[kernel] [5929925.150170] ata3.00: status: { DRDY ERR }
[kernel] [5929925.150172] ata3.00: error: { UNC }
[kernel] [5929925.151625] ata3.00: configured for UDMA/133
[kernel] [5929925.151647] sd 2:0:0:0: [sdc] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[kernel] [5929925.151651] sd 2:0:0:0: [sdc] tag#4 Sense Key : 0x3 [current]
[kernel] [5929925.151654] sd 2:0:0:0: [sdc] tag#4 ASC=0x11 ASCQ=0x4
[kernel] [5929925.151658] sd 2:0:0:0: [sdc] tag#4 CDB: opcode=0x28 28 00 58 ed ff a8 00 00 38 00
[kernel] [5929925.151661] blk_update_request: I/O error, dev sdc, sector 1491992490
[kernel] [5929925.151666] BTRFS error (device sdc4): bdev /dev/sdc4 errs: wr 0, rd 842, flush 0, corrupt 0, gen 0 | Then follows the usual btrfs errors.
It's pretty old 1TB (I guess) WB Blue spinning platter. I can drop it out from the btrfs pool and the raid-1 array too. No problem.
I'm more interested of the smart data above.
Current_Pending_Sector with value of 1 and Multi_Zone_Error_Rate with value of 2 seem to indicate of impending total failure of the drive. Right? _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
bunder Bodhisattva
Joined: 10 Apr 2004 Posts: 5934
|
Posted: Fri Mar 16, 2018 10:24 am Post subject: |
|
|
one pending sector isn't really a whole lot to worry about.
i'd be more concerned that smart found one problem but btrfs found many consecutive errors.
theoretically you could try wiping the drive and keep using it, but when in doubt throw it out. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Fri Mar 16, 2018 11:25 am Post subject: |
|
|
bunder wrote: | when in doubt throw it out. | I already made an order for 2TB Toshiba and 2TB WD RED.
I might as well grow hard disk space at the same time... Or left the other as a spare.
Raw_Read_Error_Rate value of that drive is just too high for me to accept. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
Last edited by Zucca on Fri Mar 16, 2018 11:48 am; edited 1 time in total |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Fri Mar 16, 2018 11:42 am Post subject: |
|
|
Quote: | Raw_Read_Error_Rate value of that drive is just too high for me to accept. |
You're kidding, aren't you? Look at the value on my Seagate ST32000644NS hard disk:
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 083 063 044 Pre-fail Always - 204787750
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 76
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1
7 Seek_Error_Rate 0x000f 076 060 030 Pre-fail Always - 48133305
9 Power_On_Hours 0x0032 050 050 000 Old_age Always - 44329
|
The drive works perfectly fine. A high value for Raw_Read_Error_Rate means nothing - at least not on Seagate drives.
You could do a 'dd if=/dev/sdX of=/dev/null bs=10M' to test your drive. It will take a couple of hours, but if you don't get any errors, you'll know that the drive is ok. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Fri Mar 16, 2018 12:01 pm Post subject: |
|
|
Strange.
All the other drives I have (five more) have Raw_Read_Error_Rate between 0 and 2.
With one exception of 6, which is also WD Blue 1TB. But it seems to be about half of the age of the other...
Also one of my drives, WB BLUE 2TB has Load_Cycle_Count of 230395, while on others its under 500. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Fri Mar 16, 2018 12:45 pm Post subject: |
|
|
Unfortunately, many of the SMART parameters and values are mostly meaningless, because they are not standardized.
The only SMART parameters that seem to be useful to (pre-) detect a drive failure are: Reallocated_Sector_Ct and Current_Pending_Sector.
A high value for Load_Cycle_Count may indicate trouble. Look at the data sheet of your drive, the number of allowed load cycles should be specified. High values typically mean that the drive supports APM (Advanced Power Management). I try to avoid such drives, at least for servers. Use '/sbin/hdparm -B /dev/sdX' to check if your drive supports APM. If you want, you can disable APM using '/sbin/hdparm -B 255 /dev/sdX'. After you disabled APM, Load_Cycle_Count should stop rising.
EDIT: I just looked at the specification sheet of WD Blue 2TB drives. It specifies '300.000' load cycles. If your current value is 230395, you definitely should do something! |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Fri Mar 16, 2018 1:10 pm Post subject: |
|
|
mike155 wrote: | Unfortunately, many of the SMART parameters and values are mostly meaningless, because they are not standardized. | I've always wondered why. Every drive manufacturer supports smart, but the values are some sort of guess play. Bah! Luckily I have something to compare on. All my drives are WD.
mike155 wrote: | I just looked at the specification sheet of WD Blue 2TB drives. It specifies '300.000' load cycles. If your current value is 230395, you definitely should do something! |
Zucca wrote: | I already made an order for 2TB Toshiba and 2TB WD RED.
I might as well grow hard disk space at the same time... Or left the other as a spare. | ... Will be interesting to see how the smart values on the Toshiba one evolve...
Also I'll recheck my hdparm configurations. Thanks. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
P.Kosunen Guru
Joined: 21 Nov 2005 Posts: 309 Location: Finland
|
Posted: Fri Mar 16, 2018 5:01 pm Post subject: |
|
|
mike155 wrote: | A high value for Load_Cycle_Count may indicate trouble. |
On Greens i have seen millions, i wouldn't worry about couple hundred thousand.
http://idle3-tools.sourceforge.net/
Could try to increase parking time a bit. |
|
Back to top |
|
|
frostschutz Advocate
Joined: 22 Feb 2005 Posts: 2977 Location: Germany
|
Posted: Fri Mar 16, 2018 5:11 pm Post subject: |
|
|
bunder wrote: | one pending sector isn't really a whole lot to worry about. |
that is what the hard drive vendors want to make you believe.
a hard drive is supposed to store data - not lose it. with one pending sector, it already lost data. that's not acceptable.
I'd replace the drive. If there is no backup, ddrescue. Once the drive is removed / ddrescue'd, you can do a destructive badblocks and decide whether it's worth giving it another shot or not. Either way, I would no longer trust it with important data.
idle3 is built into hdparm as well (-J) - I have used it on my WD Green drives and they lived for a long time... (still running) ...but I don't know if that's just luck or in anyway related to idle3. There is a lot of panic about this but no reports of massive failures (like deathstar et al.) |
|
Back to top |
|
|
Jaglover Watchman
Joined: 29 May 2005 Posts: 8291 Location: Saint Amant, Acadiana
|
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54236 Location: 56N 3W
|
Posted: Fri Mar 16, 2018 6:36 pm Post subject: |
|
|
Zucca,
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 |
The drive has lost data already and knows it.
Run the long self test. That Pending Sector count might get worse.
Raw values are often packed bit field, so big numbers are not always a cause for concern.
The VALUE WORST and THRESH are nornalised,
If VALUE or WORST is <= THRESH, that smart parameter has failed.
You have a drive that can't read its own writing. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Fri Mar 16, 2018 7:14 pm Post subject: |
|
|
Here's the WD Green in my desktop for comparison -
Code: | SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 173 155 021 Pre-fail Always - 6308
4 Start_Stop_Count 0x0032 097 097 000 Old_age Always - 3307
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 042 042 000 Old_age Always - 42934
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 3287
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 35
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 3307
194 Temperature_Celsius 0x0022 117 105 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 |
The values that are non-zero on yours (1, 197, 200) definitely point to a failing drive. Multi-zone error could indicate it suffered a head crash, however unlikely they may be nowadays. It's even more improbable that the situation will get better from here however. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Fri Mar 16, 2018 8:03 pm Post subject: |
|
|
Current_Pending_Sector is now at 0. Other critical numbers haven't changed.
I have done nothing yet. I'll wait till Monday/Tuesday for the new disks.
Meanwhile I start pulling out that one disk from the system... on software side of things, I mean. I have redundancy on all the data, so pulling one from the system isn't much of a task. It just takes some time to rebalance itself. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54236 Location: 56N 3W
|
Posted: Sat Mar 17, 2018 12:24 pm Post subject: |
|
|
Zucca,
If the reallocated sector count did not change, the drive read the sector and was happy with the result.
If the reallocated sector count has increased, the drive got a good read and moved the data.
The reallocated sector count is supposed to increase as the drive ages and data from difficult to read sectors is moved.
The pending sector count should always be zero. Thats a count or the sectors the drive knows it can't read.
A long test may be informative. The drive will read the entire data area without any IO. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Jaglover Watchman
Joined: 29 May 2005 Posts: 8291 Location: Saint Amant, Acadiana
|
Posted: Sat Mar 17, 2018 2:32 pm Post subject: |
|
|
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always - 119150
2 Throughput_Performance 0x0005 100 100 030 Pre-fail Offline - 12910592
3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always - 1
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 120
5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 0 (2000 0)
7 Seek_Error_Rate 0x000f 100 100 047 Pre-fail Always - 903
8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 007 007 000 Old_age Always - 46663
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 120
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 65
193 Load_Cycle_Count 0x0032 071 071 000 Old_age Always - 580829
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 39 (Min/Max 22/57)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 27
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 (0 6924)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000f 100 100 060 Pre-fail Always - 7741
203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always - 429512721134
240 Head_Flying_Hours 0x003e 200 200 000 Old_age Always - 0
|
This drive is on 24x7, it has been running like this for at least 2 years. I wait all the time for it to fail, but it keeps running. Shall I take a hammer? _________________ My Gentoo installation notes.
Please learn how to denote units correctly! |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Sat Mar 17, 2018 9:39 pm Post subject: |
|
|
frostschutz wrote: | idle3 is built into hdparm as well (-J) - I have used it on my WD Green drives and they lived for a long time... (still running) ...but I don't know if that's just luck or in anyway related to idle3. There is a lot of panic about this but no reports of massive failures (like deathstar et al.) | I have WD Greens also (head parking adjusted). They've been working flawlessly. smart data shows no signs of aging. I only see two WD Blues going down. The other one does not error out, but it has the head parking count of 230k.
I've now removed the faulty drive from raid1 arrays and btrfs pool removal is going at the moment. I wonder it btrfs balances the data among the rest of the drives now, as the removal is taking a long time...
After that I can run the long test for the drive reporting errors. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Sun Mar 18, 2018 11:08 pm Post subject: |
|
|
Finally.
I did full balancing of the btrfs pool. Started at 2018-03-17T22:40:04 and ended at 2018-03-19T00:40:57. I knew it would take some time, but I disragarded the warning. Silly me. :P
Next time I'll adjust the balancing filters. Anyway. This means I don't need to start using my backups at the moment. Everything's fine. Next I'll run the long smart tests. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
|