View previous topic :: View next topic |
Author |
Message |
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Thu Sep 29, 2011 9:27 am Post subject: mdadm error log -- does it exist? |
|
|
I had one HDD crash some minutes ago. It's RAID5 so no worries (yet ) about the data. Right now I have only remote SSH access to the server.
I'd like to know what was the cause of the crash and if it's recoverable (or not) after power cycle (restart) of the computer with failed drive (it as well may be broken SATA cable, not for the first time for me). Faulty hdd got kicked out of the RAID, but also it is not responding to anything:
Code: | Sep 29 10:41:57 [kernel] md/raid:md0: Disk failure on sdg1, disabling device.
Sep 29 10:41:57 [kernel] md/raid:md0: Operation continuing on 5 devices.
Sep 29 10:41:57 [sSMTP] Creating SSL connection to host
Sep 29 10:41:57 [sSMTP] SSL connection using DHE_RSA_AES_128_CBC_SHA1
Sep 29 10:41:59 [sSMTP] Sent mail for aaaa(221 2.0.0 Bye) uid=0 username=root outbytes=1015
Sep 29 10:41:59 [mdadm] Fail event detected on md device /dev/md0
Sep 29 10:41:59 [sSMTP] Creating SSL connection to host
Sep 29 10:41:59 [sSMTP] SSL connection using DHE_RSA_AES_128_CBC_SHA1
Sep 29 10:42:00 [sSMTP] Sent mail for aaaa (221 2.0.0 Bye) uid=0 username=root outbytes=1069
Sep 29 10:42:00 [mdadm] Fail event detected on md device /dev/md0, component device /dev/sdg1
|
Code: | smartctl -a /dev/sdg
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.4-gentoo] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: /8:0:0:0
Product:
User Capacity: 600,332,565,813,390,450 bytes [600 PB]
Logical block size: 774843950 bytes
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
|
I wouldn't say no to 600 PB... unfortunately it's in alternative universe now
Code: | cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md2 : active raid0 sdc1[0] sdb1[1]
1953518848 blocks super 1.2 128k chunks
md0 : active raid5 sde1[0] sdd1[6] sdg1[3](F) sdf1[4] sdh1[1] sda1[5]
7325680640 blocks super 1.2 level 5, 256k chunk, algorithm 2 [6/5] [UU_UUU]
bitmap: 2/11 pages [8KB], 65536KB chunk
unused devices: <none>
|
But I also noticed:
Code: | cat /sys/block/md0/md/dev-sdg1/errors
16
|
Is there any way to check what kind of errors those 16 were? This would help me decide if I have to buy a new drive on my way home |
|
Back to top |
|
|
jbest n00b
Joined: 29 Sep 2011 Posts: 3
|
Posted: Thu Sep 29, 2011 4:32 pm Post subject: |
|
|
My raid5 array and my raid1 array failed this morning, too, in a very similar way to yours:
Code: | smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0-ARCH] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: /3:0:1:0
Product:
User Capacity: 600,332,565,813,390,450 bytes [600 PB]
Logical block size: 774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
|
Code: | smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0-ARCH] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: /3:0:0:0
Product:
User Capacity: 600,332,565,813,390,450 bytes [600 PB]
Logical block size: 774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
|
What's different, though, is I don't have any errors on any of the partitions that were part of the array:
Code: | # cat /sys/block/md127/md/dev-sda2/errors
0
# cat /sys/block/md127/md/dev-sdb2/errors
0
# cat /sys/block/md127/md/dev-sdc2/errors
0
|
And the raid1 array:
Code: | # cat /sys/block/md126/md/dev-sda1/errors
0
# cat /sys/block/md126/md/dev-sdb1/errors
0
# cat /sys/block/md126/md/dev-sdc1/errors
0 |
This is the second time in a week that this has happened, can't figure out why this is going on. I'd call it just plain drive failures, but I know sdb and sdc (the drives that "failed") are on the same SATA controller. I purchased sda and sdc at the same time, and sdb is a two week old drive. That doesn't really add up for me...
dmesg has a lot of information, but I have no idea where to go from here with it:
http://pastebin.com/UuUq2mg1
Raid info:
http://pastebin.com/SM8d3d0w |
|
Back to top |
|
|
drescherjm Advocate
Joined: 05 Jun 2004 Posts: 2790 Location: Pittsburgh, PA, USA
|
Posted: Thu Sep 29, 2011 4:45 pm Post subject: |
|
|
I see this when a drive has too many UREs and the drive goes completely offline trying to fix them. Or when a drive totally dies. This seems to happen a few times a year with my arrays based on Seagate 7200.10 and 7200.11 drives. I now have moved all arrays to raid 6 and monitor the status of 5 key SMART parameters to better predict drive failure. Btw when I have had failure like more than 2 drives kicked out of a raid6, I was able to recover by using ddrescue ()to recover the readable parts) on the drives that were kicked out of the array to new disks. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
|
jbest n00b
Joined: 29 Sep 2011 Posts: 3
|
Posted: Thu Sep 29, 2011 4:49 pm Post subject: |
|
|
drescherjm wrote: | I see this when a drive has too many UREs and the drive goes completely offline trying to fix them. This seems to happen a few times a year with my arrays based on Seagate 7200.10 and 7200.11 drives. |
Forgive my ignorance, but URE? I'm guessing "RE" is read error, but I can't figure out what the "U" is for. It's too early in the morning for me.
FWIW, these are all 2TB Seagate 5900RPM "green" drives. |
|
Back to top |
|
|
drescherjm Advocate
Joined: 05 Jun 2004 Posts: 2790 Location: Pittsburgh, PA, USA
|
|
Back to top |
|
|
drescherjm Advocate
Joined: 05 Jun 2004 Posts: 2790 Location: Pittsburgh, PA, USA
|
Posted: Thu Sep 29, 2011 5:06 pm Post subject: |
|
|
As for the OPs question. I do not believe a log exists for this. However most of the time you will see errors for a drive in your dmesg. More than what you posted. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
|
jbest n00b
Joined: 29 Sep 2011 Posts: 3
|
Posted: Thu Sep 29, 2011 5:33 pm Post subject: |
|
|
Excellent! Thanks!
I just bought a backup drive, I'll rsync all of the data off with a livecd tonight and go from there.
Cheers! |
|
Back to top |
|
|
drescherjm Advocate
Joined: 05 Jun 2004 Posts: 2790 Location: Pittsburgh, PA, USA
|
Posted: Thu Sep 29, 2011 5:37 pm Post subject: |
|
|
Note about that script. For some manufacturers (like seagate) some of the params may be bogus. You will know that when a value is like 5443455 and you are expecting 10..
BTW, I did not explain exactly what the script does. It enumerates all /dev/sd devices. Checks to see if the device is in any of your mdadm arrays and prints 5 key SMART params for the drive. I use this at work for my 75 to 100 drives in mdadm arrays. I also use nagios to monitor the temps and the reallocated sectors count for each drive. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Fri Sep 30, 2011 4:20 am Post subject: |
|
|
Thanks for all your replies!
Fortunately for me it was only a matter of reseating SATA cable for failed drive. After that it showed up fine and smartctl did not found any errors. Also write intent bitmap saved me from 1,5 days of resync |
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Fri Sep 30, 2011 6:43 am Post subject: |
|
|
jbest wrote: | This is the second time in a week that this has happened, can't figure out why this is going on. I'd call it just plain drive failures, but I know sdb and sdc (the drives that "failed") are on the same SATA controller |
I'm almost sure it's the SATA controller or cable that causes your trouble. |
|
Back to top |
|
|
drescherjm Advocate
Joined: 05 Jun 2004 Posts: 2790 Location: Pittsburgh, PA, USA
|
Posted: Fri Sep 30, 2011 11:28 am Post subject: |
|
|
mbar wrote: | Thanks for all your replies!
Fortunately for me it was only a matter of reseating SATA cable for failed drive. After that it showed up fine and smartctl did not found any errors. |
Not even UDMA_CRC_Error_Count?
Quote: | Also write intent bitmap saved me from 1,5 days of resync |
Slow machine? At work it takes me less than 9 hours to resync a 9 drive (2TB 7200 RPM hitachi 7k2000) mdadm raid 6 on a 3 year old core2quad. However most of the drives are connected to an intel sascui8 hba card. _________________ John
My gentoo overlay
Instructons for overlay |
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Mon Oct 03, 2011 11:44 am Post subject: |
|
|
Yeah, you are right:
Code: | UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 23 |
Highest of all my drives (others are in 0 to 4 UDMA errors range).
Slow machine? It has simple desktop motherboard (Nvidia + AMD Phenom X3, onboard SATA + 2 PCIE SATA "Dumb" Silicon Image controllers) and the hard drives are Samsung HD154UI (5400 RPM) so not exactly an I/O speed demon |
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Tue Oct 04, 2011 10:29 am Post subject: |
|
|
fcuk, it happened again, in the same funny way:
Code: | cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md0 : active raid5 sdg1[3](F) sdh1[1] sda1[5] sdf1[4] sdd1[6] sde1[0]
7325680640 blocks super 1.2 level 5, 256k chunk, algorithm 2 [6/5] [UU_UUU]
bitmap: 11/11 pages [44KB], 65536KB chunk
md2 : active raid0 sdb1[1] sdc1[0]
1953518848 blocks super 1.2 128k chunks
|
Code: | smartctl -a /dev/sdg
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.4-gentoo-r2] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: /8:0:0:0
Product:
User Capacity: 600,332,565,813,390,450 bytes [600 PB]
Logical block size: 774843950 bytes
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
|
Time to replace those dumb Silicon Image controllers, seems that one of them is failing.
Last time the failed drive made through SMART test with "drive good" result. |
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Tue Oct 04, 2011 5:17 pm Post subject: |
|
|
This time maybe not so "good health":
Code: | 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 12 |
but:
Code: | 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 |
Should I be very concerned? I had mainly Offline_Uncorrectable failures in the past, and those are definite .
Anyway, the almost full log:
Code: | === START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x03) Offline data collection activity
is in progress.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (19188) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 33) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 072 072 011 Pre-fail Always - 9140
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1386
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 11089
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 7815
10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 268
13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 12
184 End-to-End_Error 0x0033 100 100 000 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 074 056 000 Old_age Always - 26 (Min/Max 21/26)
194 Temperature_Celsius 0x0022 068 054 000 Old_age Always - 32 (Min/Max 21/32)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 40019
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 27
200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Offline Self-test routine in progress 80% 7815 -
# 2 Offline Aborted by host 90% 7814 -
# 3 Short offline Completed without error 00% 7732 -
# 4 Offline Completed without error 00% 7626 -
# 5 Extended offline Interrupted (host reset) 40% 7522 -
# 6 Extended offline Completed without error 00% 6225 -
# 7 Extended offline Completed without error 00% 3942 -
# 8 Extended offline Completed without error 00% 3649 -
# 9 Short offline Completed without error 00% 3631 -
#10 Offline Completed without error 00% 3138 -
#11 Offline Completed without error 50% 3112 -
#12 Offline Aborted by host 10% 1866 -
#13 Offline Completed without error 00% 1181 -
#14 Short offline Completed without error 00% 425 -
#15 Short offline Completed without error 00% 425 -
#16 Short offline Completed without error 00% 13 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
|
|
|
Back to top |
|
|
mbar Veteran
Joined: 19 Jan 2005 Posts: 1990 Location: Poland
|
Posted: Tue Jan 22, 2013 8:23 am Post subject: |
|
|
mbar wrote: | Time to replace those dumb Silicon Image controllers, seems that one of them is failing.
|
Yes, SATA controller was faulty, a new one solved the problem. |
|
Back to top |
|
|
|