Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Hi. It's me again, with possible disk failures.
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Thu Mar 07, 2019 11:27 am    Post subject: Hi. It's me again, with possible disk failures. Reply with quote

Two of my spinning platters give me worrying signals:
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       7
and
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1


I have one unused 3TB Toshiba ready here as a replacement.
I'm about to order another soon.

Anyway. I think I should change the first one now.

Do anyone think the latter is anything to worry about at the moment?
I ran long smart tests yesterday and none of the critical values changed.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Verdazil
n00b
n00b


Joined: 14 Feb 2019
Posts: 47
Location: One small country ...

PostPosted: Thu Mar 07, 2019 1:13 pm    Post subject: Reply with quote

Change Pre-fail HDD SMART parameters usually indicates for disk physical problems and supposed replacing the disk as soon as possible.
Change Old_age HDD SMART parameters is not critical.
_________________
GA-Z170X-UD3 / i7-6700K / DDR4 32GB / Radeon RX 570 / TL-WDN4800 / Samsung SSD 850 EVO 250 Gb + WD Green WDC 2 Tb / BenQ BL2711U + LG TV 42LF650V
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Thu Mar 07, 2019 2:03 pm    Post subject: Reply with quote

The value of Raw_Read_Error_Rate is 7? How cute!

Below is the output of my Seagate Constellation ES disks:
Code:
smartctl -a /dev/sda | egrep "(Raw_Read_Error_Rate|Hardware_ECC_Recovered)"
  1 Raw_Read_Error_Rate     0x000f  082  063  044  Pre-fail  Always  -  195672011
195 Hardware_ECC_Recovered  0x001a  032  014  000  Old_age   Always  -  195672011

smartctl -a /dev/sdb | egrep "(Raw_Read_Error_Rate|Hardware_ECC_Recovered)"
  1 Raw_Read_Error_Rate     0x000f  081  063  044  Pre-fail  Always  -  140065734
195 Hardware_ECC_Recovered  0x001a  032  014  000  Old_age   Always  -  140065734

The disks work perfectly fine.
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Thu Mar 07, 2019 3:00 pm    Post subject: Reply with quote

Seagame must have different method of reporting those values.

Come to think of it... I wonder if my single Seagate drive was actually fine after all... Although I think I had read read errors with the drive back then.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Sun Mar 10, 2019 12:45 am    Post subject: Reply with quote

*sigh*

https://superuser.com/questions/151288/why-do-different-manufacturers-have-different-s-m-a-r-t-value#153326

So yes. Seagate may report very high valus of read error rates.
I think I trust btrfs diagnostics a little more now. If the fileystem reports a read error then is clearly hasn't got the data from a disk, thus indicating an error. Especially if the number of errors raise. So at the moment:
Code:
# btrfs dev stats /dev/sda
[/dev/sda].write_io_errs    0
[/dev/sda].read_io_errs     1
[/dev/sda].flush_io_errs    0
[/dev/sda].corruption_errs  0
[/dev/sda].generation_errs  0
... is indicating possible pre-failure. I need to keep my eye on that drive. Meanwhile I think I'll buy another spare...

But not all the smart data is useless. At least head parking count should be something to count on. I have one drive with over 200k parks. So I have put that drive to the pile of disks from where I pick drives and put them in my dard drive dock to backup files "off site".
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54236
Location: 56N 3W

PostPosted: Sun Mar 10, 2019 1:29 am    Post subject: Reply with quote

Zucca,

Be very wary of big values in the RAW field. They may be vendor specific packed bit fields.

Post the smartctl -a output. Run a long test, then post it again.
The long test does a full surface scan. If it fails, or there are changes in the smartctl output between before and after a long test, the changes need to be understood.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Sun Mar 10, 2019 12:07 pm    Post subject: Reply with quote

I did the long test before I started this topic. Since then the "critical" values have not changed. Also all the test have passed succesfully.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7470

PostPosted: Sun Mar 10, 2019 2:29 pm    Post subject: Reply with quote

don't forget the tests limitations too.
it's doable to pass test on a fs reading sector # on raid1, while one of disk has that sector damage and the other answer to the query. making fs all ok, while the disk is not

the fs only report trouble to part of the disk use by the fs, so btrfs only reporting 1 error mean btrfs has only seen an error, while the disk may be damage somewhere not use by that fs.

same for smart test (software test done by smartctrl), for me it only test the array, and not the disk themselves, i had a disk failure report by my card, while smartctrl was reporting all was fine.

So for me, if you're not using hardware raid, i would trust better the smartctl results, and assume the disk has fail really 7 times.
Upto you to judge if 7 read failure is critical or not. (think also a dead sector read 7 times would report 7 errors, while only one sector is bad)
Back to top
View user's profile Send private message
Akkara
Bodhisattva
Bodhisattva


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Mon Mar 11, 2019 4:56 am    Post subject: Reply with quote

Also check your cables. Sometimes the connectors seem to soften (for lack of a better word) over time and the connection isn't as solid as it used to be. Had that happen here recently, looked like a failing disk which "fixed itself" after it was pulled and placed in a different machine for analysis. Turned out the cable wasn't contacting well.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54236
Location: 56N 3W

PostPosted: Mon Mar 11, 2019 1:35 pm    Post subject: Reply with quote

What Akkara said but don't forget the power cables.

I've had several SATA power connectors go high resistance and get hot to the point where they char and give off smoke.

A high resistance SATA power connector will play havoc with the dynamic voltage regulation on the drive.
They probably failed a long time before I noticed as the detection was by eventually going short circuit and taking my server down.

Localisation with Mk1 nose was trivial :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Tue Mar 12, 2019 7:29 pm    Post subject: Reply with quote

krinn wrote:
the fs only report trouble to part of the disk use by the fs, so btrfs only reporting 1 error mean btrfs has only seen an error, while the disk may be damage somewhere not use by that fs.
I have all my disk on the same btrfs "pool". No partition table. btrfs is on the whole disk.
I also run regular btrfs scrubs to find any errors.

Cables and/or connectors, indeed, might play part here. I need to check them next time I clean up dust from my server.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7470

PostPosted: Tue Mar 12, 2019 7:42 pm    Post subject: Reply with quote

Not all disk is use by a fs, even when you use the whole disk.

Seems, you never had a sector 0 dead disk :)
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Tue Mar 12, 2019 9:50 pm    Post subject: Reply with quote

A small CRC error count is probably better than a zero one. It means your drive's error detection and correction mechanisms are working.

Anything can cause those, a cosmic ray could've passed through the cable. SATA is *slightly* more reliable than old IDE in that regard due to the smaller surface area.
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Wed Mar 13, 2019 12:03 am    Post subject: Reply with quote

krinn wrote:
Seems, you never had a sector 0 dead disk :)
Nope. I don't think I've had such luxury. :P
If that ever happens I'm still quite safe. ;) Redundancy + backups.

Ant P. wrote:
A small CRC error count is probably better than a zero one. It means your drive's error detection and correction mechanisms are working.
Good point.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54236
Location: 56N 3W

PostPosted: Wed Mar 13, 2019 6:42 pm    Post subject: Reply with quote

Zucca,

Sector 0 used to be special. In the days before drives could hide bad sectors, a failed sector 0 meant the drive was scrap.
A failed and not relocated sector 0 still means the drive is scrap.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Zucca
Moderator
Moderator


Joined: 14 Jun 2007
Posts: 3343
Location: Rasi, Finland

PostPosted: Wed Mar 13, 2019 10:46 pm    Post subject: Reply with quote

NeddySeagoon wrote:
A failed and not relocated sector 0 still means the drive is scrap.
... still valid today?

Anyways. I've been playing around with skdump from libatasmart...
(Btw provided to us by Mr. Poettering. ;))
Code:
# for d in /dev/sd{a,b,c,d,e}; do echo "${d##*/}: $(($(skdump --power-on "$d")/1000/60/60/24)) days - $(skdump --bad "$d") bad sectors"; done

sda: 1873 days - 0 bad sectors
sdb: 1823 days - 0 bad sectors
sdc: 2051 days - 0 bad sectors
sdd: 338 days - 0 bad sectors
sde: 331 days - 0 bad sectors

So sda is the one with 7 raw read errors and sdc with one UDMA CRC error.

Bad thing is... sda and adb are the same model. Bought at the same time (based on the age).
So... I think I'll replace sdc. Then buy two different drives to replace sda and/or sdb.
_________________
..: Zucca :..
Gentoo IRC channels reside on Libera.Chat.
--
Quote:
I am NaN! I am a man!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum