Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
mdadm error log -- does it exist?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
mbar
Veteran
Veteran


Joined: 19 Jan 2005
Posts: 1990
Location: Poland

PostPosted: Thu Sep 29, 2011 9:27 am    Post subject: mdadm error log -- does it exist? Reply with quote

I had one HDD crash some minutes ago. It's RAID5 so no worries (yet ;)) about the data. Right now I have only remote SSH access to the server.
I'd like to know what was the cause of the crash and if it's recoverable (or not) after power cycle (restart) of the computer with failed drive (it as well may be broken SATA cable, not for the first time for me). Faulty hdd got kicked out of the RAID, but also it is not responding to anything:

Code:
Sep 29 10:41:57 [kernel] md/raid:md0: Disk failure on sdg1, disabling device.
Sep 29 10:41:57 [kernel] md/raid:md0: Operation continuing on 5 devices.
Sep 29 10:41:57 [sSMTP] Creating SSL connection to host
Sep 29 10:41:57 [sSMTP] SSL connection using DHE_RSA_AES_128_CBC_SHA1
Sep 29 10:41:59 [sSMTP] Sent mail for aaaa(221 2.0.0 Bye) uid=0 username=root outbytes=1015
Sep 29 10:41:59 [mdadm] Fail event detected on md device /dev/md0
Sep 29 10:41:59 [sSMTP] Creating SSL connection to host
Sep 29 10:41:59 [sSMTP] SSL connection using DHE_RSA_AES_128_CBC_SHA1
Sep 29 10:42:00 [sSMTP] Sent mail for aaaa (221 2.0.0 Bye) uid=0 username=root outbytes=1069
Sep 29 10:42:00 [mdadm] Fail event detected on md device /dev/md0, component device /dev/sdg1


Code:
 smartctl -a /dev/sdg
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.4-gentoo] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /8:0:0:0
Product:
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

I wouldn't say no to 600 PB... unfortunately it's in alternative universe now ;)

Code:
 cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md2 : active raid0 sdc1[0] sdb1[1]
      1953518848 blocks super 1.2 128k chunks

md0 : active raid5 sde1[0] sdd1[6] sdg1[3](F) sdf1[4] sdh1[1] sda1[5]
      7325680640 blocks super 1.2 level 5, 256k chunk, algorithm 2 [6/5] [UU_UUU]
      bitmap: 2/11 pages [8KB], 65536KB chunk

unused devices: <none>


But I also noticed:
Code:
 cat /sys/block/md0/md/dev-sdg1/errors
16


Is there any way to check what kind of errors those 16 were? This would help me decide if I have to buy a new drive on my way home :)
Back to top
View user's profile Send private message
jbest
n00b
n00b


Joined: 29 Sep 2011
Posts: 3

PostPosted: Thu Sep 29, 2011 4:32 pm    Post subject: Reply with quote

My raid5 array and my raid1 array failed this morning, too, in a very similar way to yours:

Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0-ARCH] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /3:0:1:0
Product:             
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0-ARCH] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /3:0:0:0
Product:             
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


What's different, though, is I don't have any errors on any of the partitions that were part of the array:

Code:
# cat /sys/block/md127/md/dev-sda2/errors
0
# cat /sys/block/md127/md/dev-sdb2/errors
0
# cat /sys/block/md127/md/dev-sdc2/errors
0


And the raid1 array:
Code:
# cat /sys/block/md126/md/dev-sda1/errors
0
# cat /sys/block/md126/md/dev-sdb1/errors
0
# cat /sys/block/md126/md/dev-sdc1/errors
0



This is the second time in a week that this has happened, can't figure out why this is going on. I'd call it just plain drive failures, but I know sdb and sdc (the drives that "failed") are on the same SATA controller. I purchased sda and sdc at the same time, and sdb is a two week old drive. That doesn't really add up for me...

dmesg has a lot of information, but I have no idea where to go from here with it:
http://pastebin.com/UuUq2mg1

Raid info:
http://pastebin.com/SM8d3d0w
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2790
Location: Pittsburgh, PA, USA

PostPosted: Thu Sep 29, 2011 4:45 pm    Post subject: Reply with quote

I see this when a drive has too many UREs and the drive goes completely offline trying to fix them. Or when a drive totally dies. This seems to happen a few times a year with my arrays based on Seagate 7200.10 and 7200.11 drives. I now have moved all arrays to raid 6 and monitor the status of 5 key SMART parameters to better predict drive failure. Btw when I have had failure like more than 2 drives kicked out of a raid6, I was able to recover by using ddrescue ()to recover the readable parts) on the drives that were kicked out of the array to new disks.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
jbest
n00b
n00b


Joined: 29 Sep 2011
Posts: 3

PostPosted: Thu Sep 29, 2011 4:49 pm    Post subject: Reply with quote

drescherjm wrote:
I see this when a drive has too many UREs and the drive goes completely offline trying to fix them. This seems to happen a few times a year with my arrays based on Seagate 7200.10 and 7200.11 drives.


Forgive my ignorance, but URE? I'm guessing "RE" is read error, but I can't figure out what the "U" is for. It's too early in the morning for me.

FWIW, these are all 2TB Seagate 5900RPM "green" drives.
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2790
Location: Pittsburgh, PA, USA

PostPosted: Thu Sep 29, 2011 5:03 pm    Post subject: Reply with quote

I believe it is unrecoverable read error. These show as Current_Pending_Sector and / or "Offline_Uncorrectable" in SMART.


BTW, here is a link to my script that checks the smart params:

https://raw.github.com/drescherjm/jmdgentoooverlay/master/Other/shell-scripts/examine_mdraid.sh
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2790
Location: Pittsburgh, PA, USA

PostPosted: Thu Sep 29, 2011 5:06 pm    Post subject: Reply with quote

As for the OPs question. I do not believe a log exists for this. However most of the time you will see errors for a drive in your dmesg. More than what you posted.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
jbest
n00b
n00b


Joined: 29 Sep 2011
Posts: 3

PostPosted: Thu Sep 29, 2011 5:33 pm    Post subject: Reply with quote

drescherjm wrote:
BTW, here is a link to my script that checks the smart params:

https://raw.github.com/drescherjm/jmdgentoooverlay/master/Other/shell-scripts/examine_mdraid.sh


Excellent! Thanks!

I just bought a backup drive, I'll rsync all of the data off with a livecd tonight and go from there.

Cheers!
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2790
Location: Pittsburgh, PA, USA

PostPosted: Thu Sep 29, 2011 5:37 pm    Post subject: Reply with quote

Note about that script. For some manufacturers (like seagate) some of the params may be bogus. You will know that when a value is like 5443455 and you are expecting 10..

BTW, I did not explain exactly what the script does. It enumerates all /dev/sd devices. Checks to see if the device is in any of your mdadm arrays and prints 5 key SMART params for the drive. I use this at work for my 75 to 100 drives in mdadm arrays. I also use nagios to monitor the temps and the reallocated sectors count for each drive.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
mbar
Veteran
Veteran


Joined: 19 Jan 2005
Posts: 1990
Location: Poland

PostPosted: Fri Sep 30, 2011 4:20 am    Post subject: Reply with quote

Thanks for all your replies!
Fortunately for me it was only a matter of reseating SATA cable for failed drive. After that it showed up fine and smartctl did not found any errors. Also write intent bitmap saved me from 1,5 days of resync :)
Back to top
View user's profile Send private message
mbar
Veteran
Veteran


Joined: 19 Jan 2005
Posts: 1990
Location: Poland

PostPosted: Fri Sep 30, 2011 6:43 am    Post subject: Reply with quote

jbest wrote:
This is the second time in a week that this has happened, can't figure out why this is going on. I'd call it just plain drive failures, but I know sdb and sdc (the drives that "failed") are on the same SATA controller


I'm almost sure it's the SATA controller or cable that causes your trouble.
Back to top
View user's profile Send private message
drescherjm
Advocate
Advocate


Joined: 05 Jun 2004
Posts: 2790
Location: Pittsburgh, PA, USA

PostPosted: Fri Sep 30, 2011 11:28 am    Post subject: Reply with quote

mbar wrote:
Thanks for all your replies!
Fortunately for me it was only a matter of reseating SATA cable for failed drive. After that it showed up fine and smartctl did not found any errors.


Not even UDMA_CRC_Error_Count?

Quote:
Also write intent bitmap saved me from 1,5 days of resync :)


Slow machine? At work it takes me less than 9 hours to resync a 9 drive (2TB 7200 RPM hitachi 7k2000) mdadm raid 6 on a 3 year old core2quad. However most of the drives are connected to an intel sascui8 hba card.
_________________
John

My gentoo overlay
Instructons for overlay
Back to top
View user's profile Send private message
mbar
Veteran
Veteran


Joined: 19 Jan 2005
Posts: 1990
Location: Poland

PostPosted: Mon Oct 03, 2011 11:44 am    Post subject: Reply with quote

Yeah, you are right:
Code:
UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       23

Highest of all my drives (others are in 0 to 4 UDMA errors range).

Slow machine? It has simple desktop motherboard (Nvidia + AMD Phenom X3, onboard SATA + 2 PCIE SATA "Dumb" Silicon Image controllers) and the hard drives are Samsung HD154UI (5400 RPM) so not exactly an I/O speed demon :)
Back to top
View user's profile Send private message
mbar
Veteran
Veteran


Joined: 19 Jan 2005
Posts: 1990
Location: Poland

PostPosted: Tue Oct 04, 2011 10:29 am    Post subject: Reply with quote

fcuk, it happened again, in the same funny way:

Code:
cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md0 : active raid5 sdg1[3](F) sdh1[1] sda1[5] sdf1[4] sdd1[6] sde1[0]
      7325680640 blocks super 1.2 level 5, 256k chunk, algorithm 2 [6/5] [UU_UUU]
      bitmap: 11/11 pages [44KB], 65536KB chunk

md2 : active raid0 sdb1[1] sdc1[0]
      1953518848 blocks super 1.2 128k chunks


Code:
smartctl -a /dev/sdg
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.4-gentoo-r2] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /8:0:0:0
Product:
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


Time to replace those dumb Silicon Image controllers, seems that one of them is failing.
Last time the failed drive made through SMART test with "drive good" result.
Back to top
View user's profile Send private message
mbar
Veteran
Veteran


Joined: 19 Jan 2005
Posts: 1990
Location: Poland

PostPosted: Tue Oct 04, 2011 5:17 pm    Post subject: Reply with quote

This time maybe not so "good health":

Code:
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       12


but:

Code:
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0


Should I be very concerned? I had mainly Offline_Uncorrectable failures in the past, and those are definite ;).

Anyway, the almost full log:
Code:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x03)   Offline data collection activity
               is in progress.
               Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)   The previous self-test routine completed
               without error or no self-test has ever
               been run.
Total time to complete Offline
data collection:       (19188) seconds.
Offline data collection
capabilities:           (0x7b) SMART execute Offline immediate.
               Auto Offline data collection on/off support.
               Suspend Offline collection upon new
               command.
               Offline surface scan supported.
               Self-test supported.
               Conveyance Self-test supported.
               Selective Self-test supported.
SMART capabilities:            (0x0003)   Saves SMART data before entering
               power-saving mode.
               Supports SMART auto save timer.
Error logging capability:        (0x01)   Error logging supported.
               General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 255) minutes.
Conveyance self-test routine
recommended polling time:     (  33) minutes.
SCT capabilities:           (0x003f)   SCT Status supported.
               SCT Error Recovery Control supported.
               SCT Feature Control supported.
               SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   072   072   011    Pre-fail  Always       -       9140
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1386
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       11089
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       7815
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       268
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       12
184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   056   000    Old_age   Always       -       26 (Min/Max 21/26)
194 Temperature_Celsius     0x0022   068   054   000    Old_age   Always       -       32 (Min/Max 21/32)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       40019
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       27
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Self-test routine in progress 80%      7815         -
# 2  Offline             Aborted by host               90%      7814         -
# 3  Short offline       Completed without error       00%      7732         -
# 4  Offline             Completed without error       00%      7626         -
# 5  Extended offline    Interrupted (host reset)      40%      7522         -
# 6  Extended offline    Completed without error       00%      6225         -
# 7  Extended offline    Completed without error       00%      3942         -
# 8  Extended offline    Completed without error       00%      3649         -
# 9  Short offline       Completed without error       00%      3631         -
#10  Offline             Completed without error       00%      3138         -
#11  Offline             Completed without error       50%      3112         -
#12  Offline             Aborted by host               10%      1866         -
#13  Offline             Completed without error       00%      1181         -
#14  Short offline       Completed without error       00%       425         -
#15  Short offline       Completed without error       00%       425         -
#16  Short offline       Completed without error       00%        13         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Back to top
View user's profile Send private message
mbar
Veteran
Veteran


Joined: 19 Jan 2005
Posts: 1990
Location: Poland

PostPosted: Tue Jan 22, 2013 8:23 am    Post subject: Reply with quote

mbar wrote:
Time to replace those dumb Silicon Image controllers, seems that one of them is failing.


Yes, SATA controller was faulty, a new one solved the problem.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum