View previous topic :: View next topic |
Author |
Message |
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Thu Mar 07, 2019 11:27 am Post subject: Hi. It's me again, with possible disk failures. |
|
|
Two of my spinning platters give me worrying signals: Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 7 | and Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1
|
I have one unused 3TB Toshiba ready here as a replacement.
I'm about to order another soon.
Anyway. I think I should change the first one now.
Do anyone think the latter is anything to worry about at the moment?
I ran long smart tests yesterday and none of the critical values changed. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
Verdazil n00b
Joined: 14 Feb 2019 Posts: 47 Location: One small country ...
|
Posted: Thu Mar 07, 2019 1:13 pm Post subject: |
|
|
Change Pre-fail HDD SMART parameters usually indicates for disk physical problems and supposed replacing the disk as soon as possible.
Change Old_age HDD SMART parameters is not critical. _________________ GA-Z170X-UD3 / i7-6700K / DDR4 32GB / Radeon RX 570 / TL-WDN4800 / Samsung SSD 850 EVO 250 Gb + WD Green WDC 2 Tb / BenQ BL2711U + LG TV 42LF650V |
|
Back to top |
|
|
mike155 Advocate
Joined: 17 Sep 2010 Posts: 4438 Location: Frankfurt, Germany
|
Posted: Thu Mar 07, 2019 2:03 pm Post subject: |
|
|
The value of Raw_Read_Error_Rate is 7? How cute!
Below is the output of my Seagate Constellation ES disks:
Code: | smartctl -a /dev/sda | egrep "(Raw_Read_Error_Rate|Hardware_ECC_Recovered)"
1 Raw_Read_Error_Rate 0x000f 082 063 044 Pre-fail Always - 195672011
195 Hardware_ECC_Recovered 0x001a 032 014 000 Old_age Always - 195672011
smartctl -a /dev/sdb | egrep "(Raw_Read_Error_Rate|Hardware_ECC_Recovered)"
1 Raw_Read_Error_Rate 0x000f 081 063 044 Pre-fail Always - 140065734
195 Hardware_ECC_Recovered 0x001a 032 014 000 Old_age Always - 140065734
|
The disks work perfectly fine. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Thu Mar 07, 2019 3:00 pm Post subject: |
|
|
Seagame must have different method of reporting those values.
Come to think of it... I wonder if my single Seagate drive was actually fine after all... Although I think I had read read errors with the drive back then. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Sun Mar 10, 2019 12:45 am Post subject: |
|
|
*sigh*
https://superuser.com/questions/151288/why-do-different-manufacturers-have-different-s-m-a-r-t-value#153326
So yes. Seagate may report very high valus of read error rates.
I think I trust btrfs diagnostics a little more now. If the fileystem reports a read error then is clearly hasn't got the data from a disk, thus indicating an error. Especially if the number of errors raise. So at the moment: Code: | # btrfs dev stats /dev/sda
[/dev/sda].write_io_errs 0
[/dev/sda].read_io_errs 1
[/dev/sda].flush_io_errs 0
[/dev/sda].corruption_errs 0
[/dev/sda].generation_errs 0 | ... is indicating possible pre-failure. I need to keep my eye on that drive. Meanwhile I think I'll buy another spare...
But not all the smart data is useless. At least head parking count should be something to count on. I have one drive with over 200k parks. So I have put that drive to the pile of disks from where I pick drives and put them in my dard drive dock to backup files "off site". _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54236 Location: 56N 3W
|
Posted: Sun Mar 10, 2019 1:29 am Post subject: |
|
|
Zucca,
Be very wary of big values in the RAW field. They may be vendor specific packed bit fields.
Post the smartctl -a output. Run a long test, then post it again.
The long test does a full surface scan. If it fails, or there are changes in the smartctl output between before and after a long test, the changes need to be understood. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Sun Mar 10, 2019 12:07 pm Post subject: |
|
|
I did the long test before I started this topic. Since then the "critical" values have not changed. Also all the test have passed succesfully. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
krinn Watchman
Joined: 02 May 2003 Posts: 7470
|
Posted: Sun Mar 10, 2019 2:29 pm Post subject: |
|
|
don't forget the tests limitations too.
it's doable to pass test on a fs reading sector # on raid1, while one of disk has that sector damage and the other answer to the query. making fs all ok, while the disk is not
the fs only report trouble to part of the disk use by the fs, so btrfs only reporting 1 error mean btrfs has only seen an error, while the disk may be damage somewhere not use by that fs.
same for smart test (software test done by smartctrl), for me it only test the array, and not the disk themselves, i had a disk failure report by my card, while smartctrl was reporting all was fine.
So for me, if you're not using hardware raid, i would trust better the smartctl results, and assume the disk has fail really 7 times.
Upto you to judge if 7 read failure is critical or not. (think also a dead sector read 7 times would report 7 errors, while only one sector is bad) |
|
Back to top |
|
|
Akkara Bodhisattva
Joined: 28 Mar 2006 Posts: 6702 Location: &akkara
|
Posted: Mon Mar 11, 2019 4:56 am Post subject: |
|
|
Also check your cables. Sometimes the connectors seem to soften (for lack of a better word) over time and the connection isn't as solid as it used to be. Had that happen here recently, looked like a failing disk which "fixed itself" after it was pulled and placed in a different machine for analysis. Turned out the cable wasn't contacting well. _________________ Many think that Dilbert is a comic. Unfortunately it is a documentary. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54236 Location: 56N 3W
|
Posted: Mon Mar 11, 2019 1:35 pm Post subject: |
|
|
What Akkara said but don't forget the power cables.
I've had several SATA power connectors go high resistance and get hot to the point where they char and give off smoke.
A high resistance SATA power connector will play havoc with the dynamic voltage regulation on the drive.
They probably failed a long time before I noticed as the detection was by eventually going short circuit and taking my server down.
Localisation with Mk1 nose was trivial :) _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Tue Mar 12, 2019 7:29 pm Post subject: |
|
|
krinn wrote: | the fs only report trouble to part of the disk use by the fs, so btrfs only reporting 1 error mean btrfs has only seen an error, while the disk may be damage somewhere not use by that fs. | I have all my disk on the same btrfs "pool". No partition table. btrfs is on the whole disk.
I also run regular btrfs scrubs to find any errors.
Cables and/or connectors, indeed, might play part here. I need to check them next time I clean up dust from my server. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
krinn Watchman
Joined: 02 May 2003 Posts: 7470
|
Posted: Tue Mar 12, 2019 7:42 pm Post subject: |
|
|
Not all disk is use by a fs, even when you use the whole disk.
Seems, you never had a sector 0 dead disk |
|
Back to top |
|
|
Ant P. Watchman
Joined: 18 Apr 2009 Posts: 6920
|
Posted: Tue Mar 12, 2019 9:50 pm Post subject: |
|
|
A small CRC error count is probably better than a zero one. It means your drive's error detection and correction mechanisms are working.
Anything can cause those, a cosmic ray could've passed through the cable. SATA is *slightly* more reliable than old IDE in that regard due to the smaller surface area. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Wed Mar 13, 2019 12:03 am Post subject: |
|
|
krinn wrote: | Seems, you never had a sector 0 dead disk | Nope. I don't think I've had such luxury.
If that ever happens I'm still quite safe. Redundancy + backups.
Ant P. wrote: | A small CRC error count is probably better than a zero one. It means your drive's error detection and correction mechanisms are working. | Good point. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54236 Location: 56N 3W
|
Posted: Wed Mar 13, 2019 6:42 pm Post subject: |
|
|
Zucca,
Sector 0 used to be special. In the days before drives could hide bad sectors, a failed sector 0 meant the drive was scrap.
A failed and not relocated sector 0 still means the drive is scrap. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3343 Location: Rasi, Finland
|
Posted: Wed Mar 13, 2019 10:46 pm Post subject: |
|
|
NeddySeagoon wrote: | A failed and not relocated sector 0 still means the drive is scrap. | ... still valid today?
Anyways. I've been playing around with skdump from libatasmart...
(Btw provided to us by Mr. Poettering. ;)) Code: | # for d in /dev/sd{a,b,c,d,e}; do echo "${d##*/}: $(($(skdump --power-on "$d")/1000/60/60/24)) days - $(skdump --bad "$d") bad sectors"; done
sda: 1873 days - 0 bad sectors
sdb: 1823 days - 0 bad sectors
sdc: 2051 days - 0 bad sectors
sdd: 338 days - 0 bad sectors
sde: 331 days - 0 bad sectors |
So sda is the one with 7 raw read errors and sdc with one UDMA CRC error.
Bad thing is... sda and adb are the same model. Bought at the same time (based on the age).
So... I think I'll replace sdc. Then buy two different drives to replace sda and/or sdb. _________________ ..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
|