View previous topic :: View next topic |
Author |
Message |
creaker l33t
Joined: 14 Jul 2012 Posts: 651
|
Posted: Thu Sep 18, 2014 8:12 am Post subject: [Solved] SATA drives mess |
|
|
I faced with an amazing error. It even isn't an error, I do not know how to classify it.
OK, I have a server that running samba. Box has two sata drives onboard - sda and sdb.
I listened to music over the samba share and suddenly amarok stoped to play. I checked for shares and found that one of two shares disappears, mount doesn't show share mounted. OK, I logged into server over ssh and checked samba - seems it works. Dmesg doesn't show any fault. But running blkid I noticed very strange thing: one of my drives for some strange reason became /dev/sdc... Initially it was /dev/sda.
Code: | server ~ # blkid
/dev/sdb1: LABEL="GENTOO" UUID="63731ef5-a607-45c0-8cfb-36b11ad07a1f" TYPE="ext4" PARTUUID="0d310d30-01"
/dev/sdb2: LABEL="TORR" UUID="31d9a576-545f-41f1-a859-e5482b2a421c" TYPE="ext4" PARTUUID="0d310d30-02"
/dev/sdc1: LABEL="SHARE" UUID="b53719e3-20c0-468b-95eb-f54b4b7b9844" TYPE="ext4" PARTUUID="805607c7-01" |
fdisk -l shows the same devices: sdb & sdc instead of sda & sdb.
I rebooted the server and checked it again - it is sda & sdb as it should be:
Code: | server ~ # blkid
/dev/sda1: LABEL="SHARE" UUID="b53719e3-20c0-468b-95eb-f54b4b7b9844" TYPE="ext4" PARTUUID="805607c7-01"
/dev/sdb1: LABEL="GENTOO" UUID="63731ef5-a607-45c0-8cfb-36b11ad07a1f" TYPE="ext4" PARTUUID="0d310d30-01"
/dev/sdb2: LABEL="TORR" UUID="31d9a576-545f-41f1-a859-e5482b2a421c" TYPE="ext4" PARTUUID="0d310d30-02"
|
Last week I got such a joke 5 or 6 times.
Who and why shifts drives enumeration (from sda to sdc)?
Last edited by creaker on Sun Sep 21, 2014 5:01 pm; edited 1 time in total |
|
Back to top |
|
|
lagalopex Guru
Joined: 16 Oct 2004 Posts: 562
|
Posted: Thu Sep 18, 2014 9:32 am Post subject: |
|
|
Have a look at the output of dmesg when this happens.
You may also check your sda/sdc's smart values. (sys-apps/smartmontools => smartctl -a /dev/sda) and perhaps running some tests. |
|
Back to top |
|
|
Anon-E-moose Watchman
Joined: 23 May 2008 Posts: 6098 Location: Dallas area
|
Posted: Thu Sep 18, 2014 10:49 am Post subject: Re: SATA drives mess |
|
|
creaker wrote: | I faced with an amazing error. It even isn't an error, I do not know how to classify it
...
Who and why shifts drives enumeration (from sda to sdc)? |
That can happen when the sata controller quits talking to the device (whatever reason) then reconnects.
There should have been a corresponding sata message in dmesg if that was the case.
I've seen it when the power supply wasn't working properly (I replaced it and problem went away)
or if one is overclocking the bus (sata controllers).
I suppose it could also happen if some component of the mb is going bad (south bridge IIRC) _________________ PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland |
|
Back to top |
|
|
creaker l33t
Joined: 14 Jul 2012 Posts: 651
|
Posted: Thu Sep 18, 2014 12:36 pm Post subject: |
|
|
ОК, have to wait for it occurs once more to check dmesg, though I checked dmesg and doesn't notice any message added to it since boot. |
|
Back to top |
|
|
lagalopex Guru
Joined: 16 Oct 2004 Posts: 562
|
Posted: Thu Sep 18, 2014 1:03 pm Post subject: |
|
|
The smart values can be checked at any time, the scores are saved within the disks firmware.
Your sig mentions 250Gb hdd... this size is quite old. Maybe the disks just fail? (=> look at the smart values!) |
|
Back to top |
|
|
creaker l33t
Joined: 14 Jul 2012 Posts: 651
|
Posted: Thu Sep 18, 2014 1:31 pm Post subject: |
|
|
I do not see any errors in smartctl output. Complete smart report:
http://pastebin.com/Ba11HCXp
No any fault or slowness on data accessing/filesystem navigation, nothing unusual. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54239 Location: 56N 3W
|
Posted: Thu Sep 18, 2014 7:27 pm Post subject: |
|
|
creaker,
The failure mechanism is that the drive is lost to the system but udev has not noticed the disconnect before the drive comes back.
As /dev/sda han not been released, the drive is allocated /dev/sdc. When udev catches up, it removes /dev/sda
I would expect dmesg to contain a similar set of messages as it produces for a USB drive disconnect and reconnect, unless the box rebooted.
Is the uptime correct?
The smart log looks good. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
creaker l33t
Joined: 14 Jul 2012 Posts: 651
|
Posted: Fri Sep 19, 2014 7:09 am Post subject: |
|
|
I've caught it. The dmesg tail:
Code: | [ 958.706670] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x380000 action 0x6
[ 958.706673] ata1.00: BMDMA stat 0x25
[ 958.706676] ata1: SError: { 10B8B Dispar BadCRC }
[ 958.706679] ata1.00: failed command: READ DMA
[ 958.706684] ata1.00: cmd c8/00:00:88:61:1a/00:00:00:00:00/ef tag 0 dma 131072 in
res 51/84:f0:98:61:1a/00:00:00:00:00/ef Emask 0x10 (ATA bus error)
[ 958.706686] ata1.00: status: { DRDY ERR }
[ 958.706688] ata1.00: error: { ICRC ABRT }
[ 958.706694] ata1: hard resetting link
[ 958.706695] ata1: nv: skipping hardreset on occupied port
[ 959.190575] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 959.250708] ata1.00: model number mismatch 'ST3160318AS' != '3160318AS \xffffff80'
[ 959.250712] ata1.00: revalidation failed (errno=-19)
[ 959.250716] ata1: limiting SATA link speed to 1.5 Gbps
[ 959.250718] ata1.00: limiting speed to UDMA/133:PIO3
[ 964.193775] ata1: hard resetting link
[ 964.193776] ata1: nv: skipping hardreset on occupied port
[ 964.684112] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 964.704224] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 964.704228] ata1.00: revalidation failed (errno=-5)
[ 964.704229] ata1.00: disabled
[ 969.687311] ata1: hard resetting link
[ 970.597921] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 970.618030] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 975.601123] ata1: hard resetting link
[ 981.554949] ata1: link is slow to respond, please be patient (ready=0)
[ 985.637577] ata1: SRST failed (errno=-16)
[ 985.637580] ata1: hard resetting link
[ 991.591413] ata1: link is slow to respond, please be patient (ready=-19)
[ 995.674039] ata1: SRST failed (errno=-16)
[ 995.674042] ata1: hard resetting link
[ 1001.627872] ata1: link is slow to respond, please be patient (ready=-19)
[ 1030.746621] ata1: SRST failed (errno=-16)
[ 1030.746624] ata1: limiting SATA link speed to 1.5 Gbps
[ 1030.746627] ata1: hard resetting link
[ 1031.627525] ata1: SATA link down (SStatus 100 SControl 310)
[ 1031.627542] sd 0:0:0:0: [sda]
[ 1031.627543] Result: hostbyte=0x00 driverbyte=0x08
[ 1031.627545] sd 0:0:0:0: [sda]
[ 1031.627546] Sense Key : 0xb [current] [descriptor]
[ 1031.627549] Descriptor sense data with sense descriptors (in hex):
[ 1031.627550] 72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 1031.627555] 0f 1a 61 98
[ 1031.627558] sd 0:0:0:0: [sda]
[ 1031.627559] ASC=0x47 ASCQ=0x0
[ 1031.627562] sd 0:0:0:0: [sda] CDB:
[ 1031.627562] cdb[0]=0x28: 28 00 0f 1a 61 88 00 01 00 00
[ 1031.627568] end_request: I/O error, dev sda, sector 253387144
[ 1031.627579] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627581] sd 0:0:0:0: killing request
[ 1031.627586] ata1: EH complete
[ 1031.627594] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627596] sd 0:0:0:0: [sda] killing request
[ 1031.627599] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627606] sd 0:0:0:0: [sda] Unhandled error code
[ 1031.627608] sd 0:0:0:0: [sda]
[ 1031.627609] Result: hostbyte=0x01 driverbyte=0x00
[ 1031.627610] sd 0:0:0:0: [sda] CDB:
[ 1031.627611] cdb[0]=0x28: 28 00 0f 1a 62 88 00 01 00 00
[ 1031.627615] end_request: I/O error, dev sda, sector 253387400
[ 1031.627637] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627641] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627645] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627647] sd 0:0:0:0: [sda] READ CAPACITY(16) failed
[ 1031.627649] sd 0:0:0:0: [sda]
[ 1031.627650] Result: hostbyte=0x01 driverbyte=0x00
[ 1031.627652] sd 0:0:0:0: [sda] Sense not available.
[ 1031.627655] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627658] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627661] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627663] sd 0:0:0:0: [sda] READ CAPACITY failed
[ 1031.627665] sd 0:0:0:0: [sda]
[ 1031.627665] Result: hostbyte=0x01 driverbyte=0x00
[ 1031.627667] sd 0:0:0:0: [sda] Sense not available.
[ 1031.627670] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627673] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627678] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627681] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627684] sd 0:0:0:0: [sda] Asking for cache data failed
[ 1031.627686] sd 0:0:0:0: [sda] Assuming drive cache: write through
[ 1031.627691] sda: detected capacity change from 160041885696 to 0
[ 1031.627715] ata1.00: detaching (SCSI 0:0:0:0)
[ 1031.630278] sd 0:0:0:0: [sda] Stopping disk
[ 1031.630303] sd 0:0:0:0: [sda] START_STOP FAILED
[ 1031.630306] sd 0:0:0:0: [sda]
[ 1031.630307] Result: hostbyte=0x04 driverbyte=0x00
[ 1091.849488] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.849936] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.850304] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.850687] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.851056] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.851434] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.851789] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7602198: comm smbd: reading directory lblock 0
|
sda drive lost, but was not reassigned as sdc this time. blkid shows only sdb.
I did not reboot machine and trying to get drive recognized again. |
|
Back to top |
|
|
creaker l33t
Joined: 14 Jul 2012 Posts: 651
|
Posted: Fri Sep 19, 2014 7:29 am Post subject: |
|
|
Scanned for lost drive at all available hosts:
Code: | server ~ # echo "0 0 0" > /sys/class/scsi_host/host0/scan
server ~ # echo "0 0 0" > /sys/class/scsi_host/host1/scan
server ~ # echo "0 0 0" > /sys/class/scsi_host/host2/scan
server ~ # echo "0 0 0" > /sys/class/scsi_host/host3/scan |
it produced these lines in dmesg:
Code: | [ 2161.534570] ata1: exception Emask 0x10 SAct 0x0 SErr 0x19d0000 action 0xf
[ 2161.534576] ata1: SError: { PHYRdyChg CommWake 10B8B Dispar LinkSeq TrStaTrns }
[ 2161.534585] ata1: hard resetting link
[ 2162.435243] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2162.455350] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 2167.438442] ata1: hard resetting link
[ 2168.349050] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2168.369152] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 2168.369157] ata1: limiting SATA link speed to 1.5 Gbps
[ 2173.352248] ata1: hard resetting link
[ 2174.262857] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 2174.282964] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 2179.266060] ata1: hard resetting link
[ 2180.176662] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 2180.176669] ata1: EH complete
[ 2210.359627] ata2: hard resetting link
[ 2210.359632] ata2: nv: skipping hardreset on occupied port
[ 2210.906448] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 2210.966826] ata2.00: configured for UDMA/133
[ 2210.966829] ata2: EH complete
[ 2234.335838] ata3: soft resetting link
[ 2234.502662] ata3: EH complete
[ 2249.458048] ata4: port disabled--ignoring
[ 2249.458053] ata4: EH complete |
Added:
Tried to scan few minutes later, this time scaned successfully, drive found:
Code: | [ 3181.449101] ata1: hard resetting link
[ 3182.351889] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 3182.372186] ata1.00: ATA-8: ST3160318AS, CC38, max UDMA/133
[ 3182.372190] ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 0/32)
[ 3182.412202] ata1.00: configured for UDMA/133
[ 3182.412207] ata1: EH complete
[ 3182.412304] scsi 0:0:0:0: Direct-Access ATA ST3160318AS CC38 PQ: 0 ANSI: 5
[ 3182.412508] sd 0:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 GiB)
[ 3182.412605] sd 0:0:0:0: [sdc] Write Protect is off
[ 3182.412609] sd 0:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[ 3182.412643] sd 0:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 3182.423360] sdc: sdc1
[ 3182.423645] sd 0:0:0:0: [sdc] Attached SCSI disk |
|
|
Back to top |
|
|
Akkara Bodhisattva
Joined: 28 Mar 2006 Posts: 6702 Location: &akkara
|
Posted: Fri Sep 19, 2014 8:24 am Post subject: |
|
|
I've seen that happen when a SATA cable is loose. Try jiggling them while watching dmesg _________________ Many think that Dilbert is a comic. Unfortunately it is a documentary. |
|
Back to top |
|
|
Anon-E-moose Watchman
Joined: 23 May 2008 Posts: 6098 Location: Dallas area
|
Posted: Fri Sep 19, 2014 9:42 am Post subject: |
|
|
Everything between the exception and the hard reset means that the sata controller and the sata device lost communication.
As Akkara said it could be a loose cable.
Or it could be the other things I mentioned earlier.
I've swapped out my non-locking sata cables for the ones that lock, in the past.
creaker wrote: | I did not reboot machine and trying to get drive recognized again. |
I know you wanted to see if it would reconnect with a scan but
I recommend a reboot when the disconnect and reconnect happens.
I've had strange errors show up on the filesystems when I've had that happen in the past and not rebooted
especially when the filesystem was mounted when it disconnected/reconnected. _________________ PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland |
|
Back to top |
|
|
creaker l33t
Joined: 14 Jul 2012 Posts: 651
|
Posted: Fri Sep 19, 2014 11:58 am Post subject: |
|
|
OK, I've replaced both power and data cords. Let see will it solve a problem.
Though I doubt it is a cable issue due to two reasons:
1. If it's a data cable failure it should be mentioned in smart output, but seems no data transmission issues there:
Code: | ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
...
...
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 |
2. If it's a power cord issue dmesg will not show attempts to reconnect drive.
I've searched on web for 'exception Emask 0x10 SAct 0x0 SErr 0x19d0000 action 0xf' message and found thait may be kernel issue.
Anyway I replaced cables, have to see. |
|
Back to top |
|
|
lagalopex Guru
Joined: 16 Oct 2004 Posts: 562
|
Posted: Fri Sep 19, 2014 12:14 pm Post subject: |
|
|
smartctl also mentions a possible firmware update CC49:
you could run a long smart test. (But it should be successfull, as the error behaviour points to somewhere between hdd and kernel.) |
|
Back to top |
|
|
creaker l33t
Joined: 14 Jul 2012 Posts: 651
|
Posted: Sun Sep 21, 2014 5:00 pm Post subject: |
|
|
Server used intensively for the last two days and seems cords replacing solved a problem.
Thanks for helping, guys! |
|
Back to top |
|
|
|