Hi. It's me again, with possible disk failures.

Zucca

Two of my spinning platters give me worrying signals:

Verdazil · Posted: Thu Mar 07, 2019 1:13 pm Post subject:

Change Pre-fail HDD SMART parameters usually indicates for disk physical problems and supposed replacing the disk as soon as possible.
Change Old_age HDD SMART parameters is not critical.
_________________
GA-Z170X-UD3 / i7-6700K / DDR4 32GB / Radeon RX 570 / TL-WDN4800 / Samsung SSD 850 EVO 250 Gb + WD Green WDC 2 Tb / BenQ BL2711U + LG TV 42LF650V

mike155 · Posted: Thu Mar 07, 2019 2:03 pm Post subject:

The value of Raw_Read_Error_Rate is 7? How cute!

Below is the output of my Seagate Constellation ES disks:

Zucca · Posted: Thu Mar 07, 2019 3:00 pm Post subject:

Seagame must have different method of reporting those values.

Come to think of it... I wonder if my single Seagate drive was actually fine after all... Although I think I had read read errors with the drive back then.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--

Zucca · Posted: Sun Mar 10, 2019 12:45 am Post subject:

*sigh*

https://superuser.com/questions/151288/why-do-different-manufacturers-have-different-s-m-a-r-t-value#153326

So yes. Seagate may report very high valus of read error rates.
I think I trust btrfs diagnostics a little more now. If the fileystem reports a read error then is clearly hasn't got the data from a disk, thus indicating an error. Especially if the number of errors raise. So at the moment:

NeddySeagoon · Posted: Sun Mar 10, 2019 1:29 am Post subject:

Zucca,

Be very wary of big values in the RAW field. They may be vendor specific packed bit fields.

Post the smartctl -a output. Run a long test, then post it again.
The long test does a full surface scan. If it fails, or there are changes in the smartctl output between before and after a long test, the changes need to be understood.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

Zucca · Posted: Sun Mar 10, 2019 12:07 pm Post subject:

I did the long test before I started this topic. Since then the "critical" values have not changed. Also all the test have passed succesfully.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--

krinn · Watchman Joined: 02 May 2003 Posts: 7470

don't forget the tests limitations too.
it's doable to pass test on a fs reading sector # on raid1, while one of disk has that sector damage and the other answer to the query. making fs all ok, while the disk is not

the fs only report trouble to part of the disk use by the fs, so btrfs only reporting 1 error mean btrfs has only seen an error, while the disk may be damage somewhere not use by that fs.

same for smart test (software test done by smartctrl), for me it only test the array, and not the disk themselves, i had a disk failure report by my card, while smartctrl was reporting all was fine.

So for me, if you're not using hardware raid, i would trust better the smartctl results, and assume the disk has fail really 7 times.
Upto you to judge if 7 read failure is critical or not. (think also a dead sector read 7 times would report 7 errors, while only one sector is bad)

Akkara · Posted: Mon Mar 11, 2019 4:56 am Post subject:

Also check your cables. Sometimes the connectors seem to soften (for lack of a better word) over time and the connection isn't as solid as it used to be. Had that happen here recently, looked like a failing disk which "fixed itself" after it was pulled and placed in a different machine for analysis. Turned out the cable wasn't contacting well.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.

NeddySeagoon · Posted: Mon Mar 11, 2019 1:35 pm Post subject:

What Akkara said but don't forget the power cables.

I've had several SATA power connectors go high resistance and get hot to the point where they char and give off smoke.

A high resistance SATA power connector will play havoc with the dynamic voltage regulation on the drive.
They probably failed a long time before I noticed as the detection was by eventually going short circuit and taking my server down.

Localisation with Mk1 nose was trivial :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

Zucca · Posted: Tue Mar 12, 2019 7:29 pm Post subject:

krinn · Watchman Joined: 02 May 2003 Posts: 7470

Not all disk is use by a fs, even when you use the whole disk.

Seems, you never had a sector 0 dead disk

Ant P. · Watchman Joined: 18 Apr 2009 Posts: 6920

A small CRC error count is probably better than a zero one. It means your drive's error detection and correction mechanisms are working.

Anything can cause those, a cosmic ray could've passed through the cable. SATA is *slightly* more reliable than old IDE in that regard due to the smaller surface area.

Zucca · Posted: Wed Mar 13, 2019 12:03 am Post subject:

NeddySeagoon · Posted: Wed Mar 13, 2019 6:42 pm Post subject:

Zucca,

Sector 0 used to be special. In the days before drives could hide bad sectors, a failed sector 0 meant the drive was scrap.
A failed and not relocated sector 0 still means the drive is scrap.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

Zucca · Posted: Wed Mar 13, 2019 10:46 pm Post subject: