| View previous topic :: View next topic |
| Author |
Message |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Sun Jun 10, 2012 11:15 am Post subject: HDD problem or not ? |
|
|
Smartctl reports errors one one of my drives
| Code: |
root@server:~# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-24-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: Hitachi HDS721010DLE630
Serial Number: MSE5235V0K8ZKU
LU WWN Device Id: 5 000cca 37cc7dc0a
Firmware Version: MS2OA5R0
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sun Jun 10 19:09:23 2012 WST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 8283) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 138) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 099 099 016 Pre-fail Always - 65537
2 Throughput_Performance 0x0005 140 140 054 Pre-fail Offline - 76
3 Spin_Up_Time 0x0007 113 113 024 Pre-fail Always - 200 (Average 204)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 219
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 113 113 020 Pre-fail Offline - 35
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 525
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 218
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 219
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 219
194 Temperature_Celsius 0x0002 230 230 000 Old_age Always - 26 (Min/Max 18/46)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 34 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 34 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 60 19 08 00 Error: UNC at LBA = 0x00081960 = 530784
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 00 60 19 08 40 00 05:01:24.913 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 05:01:24.913 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 05:01:24.912 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 05:01:24.912 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 05:01:24.912 SET FEATURES [Set transfer mode]
Error 33 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 60 19 08 00 Error: UNC at LBA = 0x00081960 = 530784
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 00 60 19 08 40 00 05:01:21.709 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 05:01:21.709 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 05:01:21.709 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 05:01:21.708 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 05:01:21.708 SET FEATURES [Set transfer mode]
Error 32 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 60 19 08 00 Error: UNC at LBA = 0x00081960 = 530784
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 00 60 19 08 40 00 05:01:18.505 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 05:01:18.505 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 05:01:18.505 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 05:01:18.504 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 05:01:18.504 SET FEATURES [Set transfer mode]
Error 31 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 60 19 08 00 Error: UNC at LBA = 0x00081960 = 530784
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 00 60 19 08 40 00 05:01:15.302 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 05:01:15.301 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 05:01:15.301 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 05:01:15.301 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 05:01:15.300 SET FEATURES [Set transfer mode]
Error 30 occurred at disk power-on lifetime: 464 hours (19 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 60 19 08 00 Error: UNC at LBA = 0x00081960 = 530784
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 00 60 19 08 40 00 05:01:12.098 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 05:01:12.098 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 05:01:12.097 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 05:01:12.097 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 05:01:12.097 SET FEATURES [Set transfer mode]
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 520 -
# 2 Extended offline Completed without error 00% 306 -
# 3 Extended offline Completed without error 00% 135 -
# 4 Short offline Completed without error 00% 132 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
|
Looking online the general advise seems to be to replace the drive. Now the drive is only a couple of months old and still within warranty so replacing it would most likely not be a problem. Digging a little deeper however using badblocks I am not seeing any issue. The drive is part of an MDADM raid1 array. I tried fsck /dev/sda but for some reason fsck is not able to check raid members. So I used badblocks instead, but this didn't find any issue.
| Code: | root@server:~# badblocks -v /dev/sda
Checking blocks 0 to 976762583
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)
|
Even though the error messages from smartctl are concerning, since the test report does not highlight any errors I am even more confused. I read the page on smartctl on badblocks which suggests marking the bad blocks and working around the issue. Since the drives are still relatively new I am reluctant to simply ignore the issue.
Any ideas what to do about this ? |
|
| Back to top |
|
 |
Logicien Guru


Joined: 16 Sep 2005 Posts: 491 Location: Montréal
|
Posted: Sun Jun 10, 2012 3:10 pm Post subject: |
|
|
Hello,
some BIOS have a test for hard drives. It could be usefull to use it. Some have an interface to display/analyse SMART data.
If you can use badblocks from a live-cd and use the -n (non destructive) or -w (destructive that erase data) option, read man badblocks, that would be the best test to know if there's some badblocks on the harddrive.
The 34 errors that where logged by the SMART feature of the harddrive are not necessarily related (all) to badblocks, if some can be found.
A deeper knowledge than the mine could be able to understand what was leading to those last five errors and if RAID have something to do with them.
The command smartctl -a /dev/sda do not report any SMART error on my harddrive even if there's a lot of Pre-fail and Old_age TYPE reported. _________________ Paul |
|
| Back to top |
|
 |
eccerr0r Advocate

Joined: 01 Jul 2004 Posts: 3037 Location: USA
|
Posted: Sun Jun 10, 2012 11:53 pm Post subject: |
|
|
It could have been a prefetch read that failed but then subsequently rewritten and passed just file after the rewrite. It's hard to say what to do at this point, but likely the HD manufacturer will claim the disk is just fine and not a candidate for warranty service...
I know that Hitachi GST drives require the use of their disk fitness test standalone disk to test their disks for an error code, which is submitted in the RMA. But DFT has passed on disks that have errors on them... _________________ Core-i7-2700K@4.1GHz/8GB RAM/180GB SSD/Intel HD3000 graphics
What the heck am I advocating? |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Mon Jun 11, 2012 12:15 am Post subject: |
|
|
| I am busy running bad sectors -n on both drives in the array. Will report back in 3 days time. |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Fri Jun 15, 2012 11:34 pm Post subject: Never ending badblocks |
|
|
Last sunday | started running badblocks -v -n /dev/sda / b and it is still running. Using iotop I see that is is reading / writing at about 3M/s. Taking this into account and given a 1TB drive I figure the whole process should take about 100 hours (4 days). However now I am on day 5 and it is still running. Looking online it has been suggested that a 1TB drive should take around 72 hours.
Should I be patient and let this continue.
Are there any other options to check the drives and verify if they are faulty. |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Sun Jun 17, 2012 2:29 am Post subject: Impatience got the better of me |
|
|
| Being a little impatient I stopped bad blocks and ran some other diagnostics. I verified my RAID array which still seemed fine. Then a read the man page for badblocks and came across the -s option. Now I am running badblocks again with it enabled. In 13:45h it completed just over 10%. So to check the entire drive will take around 5.7 days. I should have just waited on the first run. |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Mon Jun 18, 2012 11:42 pm Post subject: Difference performance |
|
|
I started running badblocks at roughly the same time, however sdb seems to be running a lot faster. Any ideas why that could be. Given that both sda and sdb are the same model purchases at the same time I would have expected closer results. It seems that sdb is speeding up as the test progresses. Both drives are part of the same RAID1 array and should contains the same data.
| Quote: | root@panda:~# badblocks -n -v -s /dev/sda
Checking for bad blocks in non-destructive read-write mode
From block 0 to 976762583
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: 43.76% done, 59:00:16 elapsed. (0/0/0 errors) |
| Quote: | root@panda:~# badblocks -n -v -s /dev/sdb
Checking for bad blocks in non-destructive read-write mode
From block 0 to 976762583
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: 49.25% done, 58:52:24 elapsed. (0/0/0 errors) |
|
|
| Back to top |
|
 |
eccerr0r Advocate

Joined: 01 Jul 2004 Posts: 3037 Location: USA
|
Posted: Tue Jun 19, 2012 5:15 am Post subject: |
|
|
I've got disks that have SMART errors logged yet the disk has still lasted a long time...
It's possible that the disk had some trouble reading sectors and hence get behind (see if more errors popped up in the SMART logs!), but also have to consider the disk scheduler sometimes isn't completely fair and some may get the lion's share of the load...
If you're paranoid about it, go ahead and download HGST's drive fitness test. It will give you a code that will enable RMA if applicable. _________________ Core-i7-2700K@4.1GHz/8GB RAM/180GB SSD/Intel HD3000 graphics
What the heck am I advocating? |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Tue Jun 19, 2012 5:25 am Post subject: |
|
|
| I tried running Hitachi DFT however the software didn't recognise my drives / controller. In fact it didn't find any drives present on the problem PC. I tested the DFT disk in on another computer and there it found the drives. Maybe they don't have support for my controller. |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Wed Jun 20, 2012 3:51 am Post subject: |
|
|
The difference in time taken between the drives is growing. sdb is almost 10% ahead of sda.
| Code: | root@panda:~# badblocks -n -v -s /dev/sda
Checking for bad blocks in non-destructive read-write mode
From block 0 to 976762583
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: 62.42% done, 87:10:31 elapsed. (0/0/0 errors) |
| Code: | root@panda:~# badblocks -n -v -s /dev/sdb
Checking for bad blocks in non-destructive read-write mode
From block 0 to 976762583
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: 71.57% done, 87:10:58 elapsed. (0/0/0 errors) |
|
|
| Back to top |
|
 |
eccerr0r Advocate

Joined: 01 Jul 2004 Posts: 3037 Location: USA
|
Posted: Wed Jun 20, 2012 1:11 pm Post subject: |
|
|
You might have to take your controllers out of AHCI mode temporarily for DFT to work and place them in legacy mode. This should be a BIOS option.
Make sure you change it back after trying DFT. _________________ Core-i7-2700K@4.1GHz/8GB RAM/180GB SSD/Intel HD3000 graphics
What the heck am I advocating? |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Fri Jun 22, 2012 1:46 am Post subject: |
|
|
Thanks for that suggestion I will try this next, once badblocks has completed
SDA is still busy and only at 90%. SDB finished already after around 130 hours.
| Code: | root@panda:~# badblocks -n -v -s /dev/sdb
Checking for bad blocks in non-destructive read-write mode
From block 0 to 976762583
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: done
Pass completed, 0 bad blocks found. (0/0/0 errors)
|
|
|
| Back to top |
|
 |
Ant P. Veteran

Joined: 18 Apr 2009 Posts: 1992 Location: UK
|
Posted: Fri Jun 22, 2012 2:10 am Post subject: |
|
|
| All attribute counters are showing good values and the only errors there seem to be error responses from an unsupported command. Were you playing around with hdparm by any chance? |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Mon Jun 25, 2012 4:52 am Post subject: |
|
|
| Neither Hitachi DFT not badblocks found any errors. I guess I can (will have to) live with the drives a little longer. For DFT to work I had to change the BIOS from AHCI to IDE. Only then did it recognise the drives correctly. Thanks for the pointer. |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Fri Jun 29, 2012 8:20 am Post subject: |
|
|
Today smartctl found this :
Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors |
|
| Back to top |
|
 |
Herring42 Guru


Joined: 10 Mar 2004 Posts: 361 Location: Buckinghamshire
|
Posted: Fri Jun 29, 2012 2:03 pm Post subject: |
|
|
Looking at your original post, I'd say that your drive is beginning to fail.
I don't know if you are aware of how drives work, but I'll assume not
The drive has a store of spare sectors that are transparently mapped to bad sectors as they are found by the drive. The drive monitors the CRC on the sector, as well as the read current, and so can tell when a sector is about to / has failed. The sector is marked bad, and the data is copied to the spare sector. The relevant SMART line is this:
| Code: | | 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 |
This would indicate that zero sectors had been reallocated at the time of that report!
Note, this process happens transparently to programs such as badblocks, though running badblocks will cause the drive to discover any more bad sectors. You can only detect them using smartctl.
You can get the drive to perform a surface scan itself using smartctl:
| Code: | | smartctl -t long /dev/sdX |
This will be far faster than badblocks! My 1TB drive completes in just over four hours.
Once the spare sectors have been used up, the drive will start reporting bad sectors that badblocks will pick up. It should be noted that once this stage has been reached, the drive will die very rapidly.
Personally, I like to be safe, rather than sorry and replace the drive once it starts to reallocate sectors, though if you monitor the situation, the drive would probably last a good long time after it's first reallocation. When a drive is accessing sectors that have been reallocated, it will of course be slower, as they are read out of sequence.
Other relevant lines are:
| Code: |
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
|
Read more here: http://en.wikipedia.org/wiki/S.M.A.R.T.#ATA_S.M.A.R.T._attributes _________________ "The problem with quotes on the internet is that it is difficult
to determine whether or not they are genuine." -- Abraham Lincoln |
|
| Back to top |
|
 |
eccerr0r Advocate

Joined: 01 Jul 2004 Posts: 3037 Location: USA
|
Posted: Fri Jun 29, 2012 2:41 pm Post subject: |
|
|
keep in mind that using smartctl -t will not remap sectors. It will merely flag bad sectors only.
The hard drive does not know what sectors are unused (deleted) or used. If it suddenly could not read a sector, how could it remap the sector without losing data? The best plan is to just leave a bad sector as it is until the user is notified (meaning tried to read it). It makes note of this sector in the logs... why? read on...
Badblocks in *destructive* write mode, however, tells the disk that you don't care about any data on the disk because you're writing junk to the disk. In this case when you're *writing* a block to a bad block, NOW the hard disk knows you don't care about the sector, and will initiate the remap and the user won't know the difference that the sector got changed from under him/her... It uses the data from the read sector when it failed to know this sector had failed in the past to determine whether or not to remap. _________________ Core-i7-2700K@4.1GHz/8GB RAM/180GB SSD/Intel HD3000 graphics
What the heck am I advocating? |
|
| Back to top |
|
 |
lostinspace2011 Apprentice

Joined: 09 Sep 2005 Posts: 161 Location: UK
|
Posted: Fri Jun 29, 2012 3:20 pm Post subject: |
|
|
| I would have thought that after running bad blocks the driver would have hit every sector and done the remapping. |
|
| Back to top |
|
 |
eccerr0r Advocate

Joined: 01 Jul 2004 Posts: 3037 Location: USA
|
Posted: Fri Jun 29, 2012 3:23 pm Post subject: |
|
|
More may have shown up... this is not a good sign.
But make sure you're in *DESTRUCTIVE* mode...
Nondestructive mode can do some strange stuff because it's actually reading. _________________ Core-i7-2700K@4.1GHz/8GB RAM/180GB SSD/Intel HD3000 graphics
What the heck am I advocating? |
|
| Back to top |
|
 |
Herring42 Guru


Joined: 10 Mar 2004 Posts: 361 Location: Buckinghamshire
|
Posted: Fri Jun 29, 2012 8:09 pm Post subject: |
|
|
| lostinspace2011 wrote: | | I would have thought that after running bad blocks the driver would have hit every sector and done the remapping. |
Indeed so, which is probably why you are now seeing some errors. What does 'smartctl -a /dev/sda' say now? _________________ "The problem with quotes on the internet is that it is difficult
to determine whether or not they are genuine." -- Abraham Lincoln |
|
| Back to top |
|
 |
|