View previous topic :: View next topic |
Author |
Message |
doublehp Guru
Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Mon Jan 26, 2015 8:06 pm Post subject: Strange hard disk behaviour |
|
|
I have 5 identical disks in RAID6, and several minor issues. System is sometimes slow, and sometimes has small freeeses (from 2 to 45s) or slowliness. All disks have been bought the same day; thus consecutive serial numbers. Still, there are odd details when you compare numbers:
Code: | for i in a b c d e ; do smartctl -a /dev/sd$i ; done >/tmp/logsmart
# for i in Model: Serial Firmware Capacity: Standard Local Raw_Read_Error_Rate Start_Stop_Count Reallocated_Sector_Ct Seek_Error_Rate Power_On_Hours Spin_Retry_Count Power_Cycle_Count End-to-End_Error Reported_Uncorrect Command_Timeout High_Fly_Writes Airflow_Temperature_Cel G-Sense_Error_Rate Power-Off_Retract_Count Load_Cycle_Count Temperature_Celsius Current_Pending_Sector Offline_Uncorrectable UDMA_CRC_Error_Count ; do cat /tmp/logsmart | grep $i ; done
Device Model: ST3000VX000-1CU166
Device Model: ST3000VX000-1CU166
Device Model: ST3000VX000-1CU166
Device Model: ST3000VX000-1CU166
Device Model: ST3000VX000-1CU166
Serial Number: W1F4VK**
Serial Number: W1F4YC**
Serial Number: W1F4YC**
Serial Number: W1F4YJ**
Serial Number: W1F500**
Firmware Version: CV23
Firmware Version: CV23
Firmware Version: CV23
Firmware Version: CV23
Firmware Version: CV23
User Capacity: 3,000,592,982,016 bytes
User Capacity: 3,000,592,982,016 bytes
User Capacity: 3,000,592,982,016 bytes
User Capacity: 3,000,592,982,016 bytes
User Capacity: 3,000,592,982,016 bytes
ATA Standard is: ATA-8-ACS revision 4
ATA Standard is: ATA-8-ACS revision 4
ATA Standard is: ATA-8-ACS revision 4
ATA Standard is: ATA-8-ACS revision 4
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Mon Jan 26 20:51:17 2015 CET
Local Time is: Mon Jan 26 20:51:17 2015 CET
Local Time is: Mon Jan 26 20:51:17 2015 CET
Local Time is: Mon Jan 26 20:51:17 2015 CET
Local Time is: Mon Jan 26 20:51:17 2015 CET
1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 216042608
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 124000288
1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 227983400
1 Raw_Read_Error_Rate 0x000f 114 099 006 Pre-fail Always - 65078888
1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 7971288
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 209
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 157
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 157
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 339
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 162
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 161651019
7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 159918983
7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 161245470
7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 162518232
7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 162816713
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3981
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3931
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3931
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3887
9 Power_On_Hours 0x0032 046 046 000 Old_age Always - 47821
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 157
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 157
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 157
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 157
12 Power_Cycle_Count 0x0032 092 092 020 Old_age Always - 8411
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 4295032833
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 1207
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 995
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 983
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 467
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 1127
190 Airflow_Temperature_Cel 0x0022 068 059 045 Old_age Always - 32 (Lifetime Min/Max 12/33)
190 Airflow_Temperature_Cel 0x0022 069 059 045 Old_age Always - 31 (Lifetime Min/Max 12/33)
190 Airflow_Temperature_Cel 0x0022 068 057 045 Old_age Always - 32 (Lifetime Min/Max 12/34)
190 Airflow_Temperature_Cel 0x0022 069 059 045 Old_age Always - 31 (Lifetime Min/Max 11/32)
190 Airflow_Temperature_Cel 0x0022 066 051 045 Old_age Always - 34 (Lifetime Min/Max 12/36)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 72
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 20
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 202
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 209
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 157
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 157
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 339
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 162
194 Temperature_Celsius 0x0022 032 041 000 Old_age Always - 32 (0 12 0 0)
194 Temperature_Celsius 0x0022 031 041 000 Old_age Always - 31 (0 12 0 0)
194 Temperature_Celsius 0x0022 032 043 000 Old_age Always - 32 (0 12 0 0)
194 Temperature_Celsius 0x0022 031 041 000 Old_age Always - 31 (0 11 0 0)
194 Temperature_Celsius 0x0022 034 049 000 Old_age Always - 34 (0 12 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
|
There are values like Raw_Read_Error_Rate which I never understood the meaning, and can't speak at all. But for some values, it's very strange to me that disks may have different records:
- Start_Stop_Count how the hell could disks differ by more than 3 units ?
- Power_On_Hours ??? sde is 10 times older than the other disks ?
- Power_Cycle_Count ???
Firmware is the same; ATA is the same.
It's usually sde that behaves a strange way, but for Command_Timeout it's sdb.
I have not set any kind of energy saving, or retract timeout. Any way, even if I had, the raid should make all disks always busy the same way; and any request on any disk should imply similar request on other disks.
Do I have bad hardware ? bad SATA cable ? is sde going to die faster than other disks ? Why do sdb and sde have different issues ?
When I boiught them, they had similar and low values. _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54237 Location: 56N 3W
|
Posted: Mon Jan 26, 2015 8:41 pm Post subject: |
|
|
doublehp,
VALUE WORST and THRESH are normalised values. If VALUE or WORST is less than or equal to THRESH, the parameter has failed.
Your smarl log looks OK
Run Code: | echo repair > /sys/block/md2/md/sync_action | Change the md2 to suit your raid.
Once that completes, post your logs again so we can see before and after.
The command is much like a sync, except that the entire raid will be read and writes will only be performed if a read fails, or the raid set does not agree on the parity, in which case the odd drive will be rewritten.
This process may cause sectors to be relocated.
Be aware that sector relocation can be a long process. In some instances, mdadm may kick a drive from the raid set while its in process.
sde seems to be older that the rest - was it used when you got it? _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
doublehp Guru
Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Mon Jan 26, 2015 10:02 pm Post subject: |
|
|
As you say ... for Power_On_Hours Power_Cycle_Count ... VALUES in the VALUE column is really different for sde than other disks:
Code: |
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3981
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3931
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3931
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3887
9 Power_On_Hours 0x0032 046 046 000 Old_age Always - 47821
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 157
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 157
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 157
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 157
12 Power_Cycle_Count 0x0032 092 092 020 Old_age Always - 8411 |
I run a monthly repair, since i bought them (may 2014).
They have been compared at first install, and always been used similar way, even during install. The power cycle count should not differ by more than 2 units, and the runtime should match within 30mn (ok, maybe 2h big max). I am already surprised by the disparity in sda,b,c,d ... almost 100h diff ... looks huge to me. 100h over 3000 would make sens only if the time is counted by a internal clock; then clock jit could explain a 2.5% variation in measurement. But 2.5% jit is already HUGE for a clocking error. But the topic of the day is sde: 10:1 ratio !!!
And for powercycle count: a,b,c,d are just equal. Does it mean sde has an issue in the power cable ?
If the disk has supply issues, and it records started hours, then this could explain the timing issues: if the disk counts starting hours (by opposition of accomplished hours), then each power cycle would induce a POH increment; so i should have 3900+8400=12300 ... and I have 47000 hours ... so, not the good explanation.
Ah ... I knew I had the proof, let me find the old logs ... i also keep a weekly smartctl for every single disk.
ST3000VX000-1CU166_W1F50***__3.000.592.982.016_bytes__3.000.G__sdf__2014-05-27_20-36-47.smart
Code: | SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 110 099 006 Pre-fail Always - 25639096
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 3
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 92811
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 139
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 3
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 362
190 Airflow_Temperature_Cel 0x0022 059 051 045 Old_age Always - 41 (Lifetime Min/Max 41/41)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 3
194 Temperature_Celsius 0x0022 041 049 000 Old_age Always - 41 (0 24 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
|
ST3000VX000-1CU166_W1F50***__3.000.592.982.016_bytes__3.000.G__sde__2014-05-27_23-39-59.smart
Code: | ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 110 099 006 Pre-fail Always - 26307088
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 7
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 93019
9 Power_On_Hours 0x0032 050 050 000 Old_age Always - 43983
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 092 092 020 Old_age Always - 8256
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 362
190 Airflow_Temperature_Cel 0x0022 064 051 045 Old_age Always - 36 (Lifetime Min/Max 35/41)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 7
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
194 Temperature_Celsius 0x0022 036 049 000 Old_age Always - 36 (0 24 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
|
What the hell could happen in less than 4h causing this ?
Code: | 0 0 2015-01-26_22-48-27 22:48:27 @pts/3 root@uranus:/var/log/smartctl
522# cat ST3000VX000-1CU166_W1F50***__3.000.592.982.016_bytes__3.000.G__sdf__2014-05-27_20-36-47.smart | grep Power_Cycle_Count
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 3
0 0 2015-01-26_22-49-01 22:49:01 @pts/3 root@uranus:/var/log/smartctl
523# cat ST3000VX000-1CU166_W1F50***__3.000.592.982.016_bytes__3.000.G__sde__2014-05-27_23-11-08.smart | grep Power_Cycle_Count
12 Power_Cycle_Count 0x0032 092 092 020 Old_age Always - 8256
0 0 2015-01-26_22-49-24 22:49:24 @pts/3 root@uranus:/var/log/smartctl
|
I am reading my bash history for that day, and I can't find what I did wrong.
But, for today, it explains the strange values. And difference calculated over these 4h is identical to the delta between disks today: 43844(+/- 1) and 8253 (+/- 100).
So, whatever happened that day, it's probably unrelated with the bug I have been trying to track today.
EDIT: since the device changed name taking devices by serial number), there was a hardaware manipulation around. I have also found more than 3 rebbots in the time laps. Maybe some cable was missconnected. _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png
Last edited by doublehp on Mon Jan 26, 2015 10:18 pm; edited 2 times in total |
|
Back to top |
|
|
doublehp Guru
Joined: 11 Apr 2005 Posts: 473 Location: FRANCE
|
Posted: Mon Jan 26, 2015 10:04 pm Post subject: |
|
|
* do your weekly backups
* do your monthly checkups
* log everything, including things you can't think about (partition tables, SMART history, bash history)
* make offline backup _________________ DEMAINE Benoît-Pierre (aka DoubleHP ) http://www.demaine.info/
>o_/ Coin coin coin \_o<
to contact me (MSN,ICQ, JABBER, Skype ... ) http://benoit.demaine.info/contact.png |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|