View previous topic :: View next topic |
Author |
Message |
dufeu l33t
Joined: 30 Aug 2002 Posts: 924 Location: US-FL-EST
|
Posted: Mon May 02, 2011 9:12 pm Post subject: Reading SMART hard drives through USB |
|
|
This post is a result of my search to find out more information on the status of various SMART capable hard drives across various of my systems. For a write up on SMART hard drive technology, see the wikipedia article.
As noted in the article, there isn't a standard for accessing SMART information from drives connected via USB based controllers. This is because the ATA command set used to gather SMART information is not part of the UBS standards for interacting with hard drives. This was clearly a major oversight on the part of the relevent standards body - 'nuff said.
To determine what SMART capable hard drives are available on a system, you would normally execute:and receive results similar to: Code: | /dev/sda -d scsi [SCSI]
/dev/sdb -d scsi [SCSI]
/dev/sdc -d scsi [SCSI]
/dev/sdd -d scsi [SCSI] |
Yet, on the very same system, executing:reveals: Code: | Filesystem Size Used Avail Use% Mounted on
rootfs 363G 114G 232G 33% /
/dev/root 363G 114G 232G 33% /
rc-svcdir 1.0M 132K 892K 13% /lib64/rc/init.d
udev 10M 292K 9.8M 3% /dev
shm 3.7G 292K 3.7G 1% /dev/shm
/dev/sdb1 917G 730G 188G 80% /home
/dev/sda4 559G 478G 80G 86% /pub00
/dev/sdc1 1.8T 1.5T 347G 82% /pub01
/dev/sdd1 1.8T 1.7T 95G 95% /pub02
/dev/sde1 917G 196G 722G 22% /pubu01
/dev/sdf1 1.8T 1.8T 35G 99% /pubu02
/dev/sdg1 1.4T 1.1T 299G 79% /pubu03
/dev/sda1 31M 6.5M 23M 23% /boot |
This is quite the discrepancy.
Hard drives /dev/sde, /dev/sdf and /dev/sdg are attatched through USB ports. They are respectively 1T Seagate, 2T Seagate and 1.5T WD external USB 2.0 based hard drives. So how can we get SMART status information from these hard drives?
While there is no standard for doing so, several of the USB chip manufacturers support the ability to pass through raw ATA commands to any hard drives attached to them. To be completely clear, the bottleneck for SMART status information is not the USB chips/logic which reside on your motherboard, but rather, the USB chips which reside on the external device at hand. Fortunately, there has been a fair amount of effort expended regarding which USB chips permit the pass through of raw ATA commands and which devices these chips are present in. In addition, there are modifiers for the 'smartctl' command which will enable you to tell 'smartctl' to pass these commands through in raw form and retrieve the resulting staus information. A list of the known capable devices resides at the smartmontools wiki.
Also, execute: to read the instructions for using smartctl.
The modifiers to the smartctl command follow the name of the device and can be one these: Code: | -d usbcypress
-d usbjmicron
-d sat | The Cypress USB chip uses a format they refer to as ATACB for supporting the passing of raw ATA commands through USB.
I didn't see what JMicron calls their method.
SAT basically tells smartctl to treat the attached device as if it's through a Standard AT connection {i.e. PATA, not SATA}.
The full command format might then look like: Code: | # smartctl -a /dev/sde -d usbcypress | or perhaps Code: | # smartctl -a /dev/sde -d sat |
These are typical 'fail' results: Code: | # smartctl -a /dev/sdg -d usbcypress
smartctl 5.40 2010-10-16 r3189 [x86_64-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. |
These are typical 'sucess' results: Code: | # smartctl -a /dev/sdg -d sat
smartctl 5.40 2010-10-16 r3189 [x86_64-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green family
Device Model: WDC WD15EADS-11R6B1
Serial Number: WD-WCAVY2024420
Firmware Version: 80.00A80
User Capacity: 1,500,301,910,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon May 2 16:31:12 2011 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (30000) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 142 141 021 Pre-fail Always - 9858
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1564
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1516
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 6
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 197 197 000 Old_age Always - 9014
194 Temperature_Celsius 0x0022 094 092 000 Old_age Always - 58
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay. |
In addition to the above 'smartctl' command modifiers, the modifiers themselves can have modifiers. These are generally counters to get to a specific device. i.e. if you're using a USB connected NAS box with multiple drives where they can take the form of 'x' {the number of the drive you want - 0, 1, 2 .. N} or the number '12' which is apparently {pure guess on my part without reading the relevant technical specifications} beyond the maximum number of supported devices on a single chip. Such a command might look like: Code: | # smartctl -a /dev/sdg -d sat,x
# smartctl -a /dev/sdg -d sat,12 |
So .. now you may be wondering why I was so interested in finding this information out? OK, I know you're not really interested, but I'll tell you anyway.
It's my opinion that the single most causative factor in electronics failure is excess heat. This is more true for hard disks as any other piece of electronic equipment. I happened to notice while installing some 2T hard drives and subsequently re-arranging older drives that some of the drives were pretty darn hot to the touch. A quick command: Code: | # smartctl -a /dev/sdb | grep Temper | displayed: Code: | 194 Temperature_Celsius 0x0032 046 253 000 Old_age Always - 56 | 56 Celcius {OUCH!!!} That's more than 'pretty darned hot'!
This particular case is configured such that the hard drives are spaced with 1/2 gaps minimum. I usually consider this sufficient for adequate cooling. It turned out that all the non Samsung dives in that case were running 55-56 C while the Samsung drives were running 47-48 C. This is much too hot so it was back to the stripped parts bin looking for a suitable fan to stick in front of the hard drive bay. These are the final results: Code: | # smartctl -a /dev/sda -d ata | grep Temper
190 Airflow_Temperature_Cel 0x0022 054 032 045 Old_age Always In_the_past 46
194 Temperature_Celsius 0x0022 104 082 000 Old_age Always - 46
pyrotekk ~ # smartctl -a /dev/sdb -d ata | grep Temper
194 Temperature_Celsius 0x0032 046 253 000 Old_age Always - 40
pyrotekk ~ # smartctl -a /dev/sdc -d ata | grep Temper
194 Temperature_Celsius 0x0032 046 253 000 Old_age Always - 40
pyrotekk ~ # smartctl -a /dev/sdd -d ata | grep Temper
190 Airflow_Temperature_Cel 0x0022 067 050 000 Old_age Always - 33
194 Temperature_Celsius 0x0022 139 088 000 Old_age Always - 33
pyrotekk ~ # smartctl -a /dev/sdf -d ata | grep Temper
190 Airflow_Temperature_Cel 0x0022 067 048 000 Old_age Always - 33
194 Temperature_Celsius 0x0022 139 082 000 Old_age Always - 33
pyrotekk ~ # smartctl -a /dev/sdg -d ata | grep Temper
190 Airflow_Temperature_Cel 0x0022 069 057 000 Old_age Always - 31 (Min/Max 26/31)
194 Temperature_Celsius 0x0022 068 056 000 Old_age Always - 32 (Min/Max 26/32) |
Things are much improved! Note that /dev/sda is in a drive bay separate from the other drives, hence the higher temperature.
BTW - the examples displayed here to show temperature readings are from a different system than the original examples used to display USB query results. The similar temperature results from that system are: Code: | pyrodyno pubroot # smartctl -a /dev/sda | grep Tempera
190 Airflow_Temperature_Cel 0x0022 065 055 045 Old_age Always - 35 (Min/Max 34/38)
194 Temperature_Celsius 0x0022 035 045 000 Old_age Always - 35 (0 23 0 0)
pyrodyno pubroot # smartctl -a /dev/sdb | grep Tempera
190 Airflow_Temperature_Cel 0x0022 072 069 000 Old_age Always - 28 (Min/Max 27/29)
194 Temperature_Celsius 0x0022 072 067 000 Old_age Always - 28 (Min/Max 26/30)
pyrodyno pubroot # smartctl -a /dev/sdc | grep Tempera
194 Temperature_Celsius 0x0002 064 062 000 Old_age Always - 34 (Min/Max 28/38)
pyrodyno pubroot # smartctl -a /dev/sdd | grep Tempera
194 Temperature_Celsius 0x0002 064 063 000 Old_age Always - 32 (Min/Max 27/37)
pyrodyno pubroot # smartctl -a /dev/sdf -d sat | grep Tempera
190 Airflow_Temperature_Cel 0x0022 057 042 045 Old_age Always In_the_past 43 (0 122 50 37)
194 Temperature_Celsius 0x0022 043 058 000 Old_age Always - 43 (0 19 0 0)
pyrodyno pubroot # smartctl -a /dev/sdg -d sat | grep Tempera
194 Temperature_Celsius 0x0022 100 092 000 Old_age Always - 52 |
Note how much warmer the external hard drives /dev/sdf and /dev/sdg are. As I noted earlier, /dev/sdf is a Seagate and /dev/sdg is a WD. Both drives are standalone close to but not next to each other with plenty of available air flow. The 2T Seagate external drives don't appear to be accessible for SMART status information. I suspect the Cypress USB chip might have issues supporting such large drives and Seagate may have used a different USB chip. You can see in the wiki page of known devices that the 2T Seagate external drive is an open question mark.
{edit: same day 2 hours later} - The 2T Seagate ended up readable by adding the ',12' modifier: Code: | pyrodyno pubroot # smartctl -a /dev/sde -d sat,12 | grep Tempera
190 Airflow_Temperature_Cel 0x0022 055 042 045 Old_age Always In_the_past 45 (0 122 50 37)
194 Temperature_Celsius 0x0022 045 058 000 Old_age Always - 45 (0 19 0 0) |
It's running at 45C. Acceptable though I'd prefer under 40C. The WD external drive was unchanged at 52C which is still quite disappointing since the drives were purchased within days of each other and are directly comparable in terms of technology generation.
{end edit}
{edit: same day 3 hours later} - Points of information:- The SMART technology wikipedia article referenced in the first paragraph includes a list of SMART attributes and what they {most likely} mean. {different manufacturers may not ascribe identical meanings to the same numbered attributes}
- Some of these SMART attributes are highlighted to indicate those attributes which are more relevant in terms of predicting imminent failure.
- Temperature is a good attribute to monitor when assembling a system or adding a new drive. Air flow is important and you don't want a drive inadvertently stuck in a local hot spot.
{end edit}
I hope you find the information here helpful. _________________ People whom think M$ is mediocre, don't know the half of it.
Last edited by dufeu on Mon May 02, 2011 10:36 pm; edited 3 times in total |
|
Back to top |
|
|
BradN Advocate
Joined: 19 Apr 2002 Posts: 2391 Location: Wisconsin (USA)
|
Posted: Mon May 02, 2011 9:29 pm Post subject: |
|
|
I remember reading an article about Google's hard drive management statistics (failure rates of drives compared to various pieces of data, including drive temperature).
Check it out here: http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/disk_failures.pdf
On page 6 there are a couple graphs about temperature vs failure rate. Their findings seem to show that temperatures *below* 30C increase failure rates in the first couple years of operation, and temperatures above 40C increase failure rates in years 3-4 of operation.
I would suspect an increase in low temperature failures would be due to mechanical failure (bearing/lubrication, etc not being as effective when colder), as controller chips are generally quite happy at low temperatures, but I could be wrong.
Also, note that since these are statistics across their entire range of drives (many brands/models), that doesn't automatically mean that a particular drive is more likely to fail at a given temperature (some might be perfectly capable of reliable operation at 55C, who knows), just that the aggregate of their drive pool exhibits those rates. |
|
Back to top |
|
|
dufeu l33t
Joined: 30 Aug 2002 Posts: 924 Location: US-FL-EST
|
Posted: Mon May 02, 2011 9:53 pm Post subject: |
|
|
BradN wrote: | On page 6 there are a couple graphs about temperature vs failure rate. Their findings seem to show that temperatures *below* 30C increase failure rates in the first couple years of operation, and temperatures above 40C increase failure rates in years 3-4 of operation. |
Absolutely correct.
I usually don't consider the 'too cool' case because I don't ever have anything running less than 26C {regular room temps}.
FWIW - I've experienced a lessor failure rate with Samsung drives due, I believe, to their cooler running temperatures. In all of my systems with mixed drives, the Samsungs typically run 28-34C while everything else typically runs 38-46C. I usually ensure at least 1/2 spacing gaps between drives or forced airflow for drive bays with close spacing.
In this particular case {pryotekk}, I noticed, only with this incident, that the area of the front panel before the drive bay is a solid sheet. There are no breather holes at all. I cut some slots and stuck a scrapped 120mm fan there. Whatever works. Dropping down from 56C to 33C is a big plus!
FWIW, I'd try to avoid running any hard drive over 46C .. which means I'm not really happy with the WD external drive. _________________ People whom think M$ is mediocre, don't know the half of it. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|