Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[Solved] SATA drives mess
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
creaker
l33t
l33t


Joined: 14 Jul 2012
Posts: 651

PostPosted: Thu Sep 18, 2014 8:12 am    Post subject: [Solved] SATA drives mess Reply with quote

I faced with an amazing error. It even isn't an error, I do not know how to classify it.
OK, I have a server that running samba. Box has two sata drives onboard - sda and sdb.
I listened to music over the samba share and suddenly amarok stoped to play. I checked for shares and found that one of two shares disappears, mount doesn't show share mounted. OK, I logged into server over ssh and checked samba - seems it works. Dmesg doesn't show any fault. But running blkid I noticed very strange thing: one of my drives for some strange reason became /dev/sdc... Initially it was /dev/sda.
Code:
server ~ # blkid
/dev/sdb1: LABEL="GENTOO" UUID="63731ef5-a607-45c0-8cfb-36b11ad07a1f" TYPE="ext4" PARTUUID="0d310d30-01"
/dev/sdb2: LABEL="TORR" UUID="31d9a576-545f-41f1-a859-e5482b2a421c" TYPE="ext4" PARTUUID="0d310d30-02"
/dev/sdc1: LABEL="SHARE" UUID="b53719e3-20c0-468b-95eb-f54b4b7b9844" TYPE="ext4" PARTUUID="805607c7-01"

fdisk -l shows the same devices: sdb & sdc instead of sda & sdb.
I rebooted the server and checked it again - it is sda & sdb as it should be:
Code:
server ~ # blkid
/dev/sda1: LABEL="SHARE" UUID="b53719e3-20c0-468b-95eb-f54b4b7b9844" TYPE="ext4" PARTUUID="805607c7-01"
/dev/sdb1: LABEL="GENTOO" UUID="63731ef5-a607-45c0-8cfb-36b11ad07a1f" TYPE="ext4" PARTUUID="0d310d30-01"
/dev/sdb2: LABEL="TORR" UUID="31d9a576-545f-41f1-a859-e5482b2a421c" TYPE="ext4" PARTUUID="0d310d30-02"

Last week I got such a joke 5 or 6 times.
Who and why shifts drives enumeration (from sda to sdc)?


Last edited by creaker on Sun Sep 21, 2014 5:01 pm; edited 1 time in total
Back to top
View user's profile Send private message
lagalopex
Guru
Guru


Joined: 16 Oct 2004
Posts: 562

PostPosted: Thu Sep 18, 2014 9:32 am    Post subject: Reply with quote

Have a look at the output of dmesg when this happens.
You may also check your sda/sdc's smart values. (sys-apps/smartmontools => smartctl -a /dev/sda) and perhaps running some tests.
Back to top
View user's profile Send private message
Anon-E-moose
Watchman
Watchman


Joined: 23 May 2008
Posts: 6098
Location: Dallas area

PostPosted: Thu Sep 18, 2014 10:49 am    Post subject: Re: SATA drives mess Reply with quote

creaker wrote:
I faced with an amazing error. It even isn't an error, I do not know how to classify it
...
Who and why shifts drives enumeration (from sda to sdc)?


That can happen when the sata controller quits talking to the device (whatever reason) then reconnects.
There should have been a corresponding sata message in dmesg if that was the case.

I've seen it when the power supply wasn't working properly (I replaced it and problem went away)
or if one is overclocking the bus (sata controllers).
I suppose it could also happen if some component of the mb is going bad (south bridge IIRC)
_________________
PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland
Back to top
View user's profile Send private message
creaker
l33t
l33t


Joined: 14 Jul 2012
Posts: 651

PostPosted: Thu Sep 18, 2014 12:36 pm    Post subject: Reply with quote

ОК, have to wait for it occurs once more to check dmesg, though I checked dmesg and doesn't notice any message added to it since boot.
Back to top
View user's profile Send private message
lagalopex
Guru
Guru


Joined: 16 Oct 2004
Posts: 562

PostPosted: Thu Sep 18, 2014 1:03 pm    Post subject: Reply with quote

The smart values can be checked at any time, the scores are saved within the disks firmware.
Your sig mentions 250Gb hdd... this size is quite old. Maybe the disks just fail? (=> look at the smart values!)
Back to top
View user's profile Send private message
creaker
l33t
l33t


Joined: 14 Jul 2012
Posts: 651

PostPosted: Thu Sep 18, 2014 1:31 pm    Post subject: Reply with quote

I do not see any errors in smartctl output. Complete smart report:
http://pastebin.com/Ba11HCXp
No any fault or slowness on data accessing/filesystem navigation, nothing unusual.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54239
Location: 56N 3W

PostPosted: Thu Sep 18, 2014 7:27 pm    Post subject: Reply with quote

creaker,

The failure mechanism is that the drive is lost to the system but udev has not noticed the disconnect before the drive comes back.
As /dev/sda han not been released, the drive is allocated /dev/sdc. When udev catches up, it removes /dev/sda

I would expect dmesg to contain a similar set of messages as it produces for a USB drive disconnect and reconnect, unless the box rebooted.
Is the uptime correct?

The smart log looks good.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
creaker
l33t
l33t


Joined: 14 Jul 2012
Posts: 651

PostPosted: Fri Sep 19, 2014 7:09 am    Post subject: Reply with quote

I've caught it. The dmesg tail:
Code:
[  958.706670] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x380000 action 0x6
[  958.706673] ata1.00: BMDMA stat 0x25
[  958.706676] ata1: SError: { 10B8B Dispar BadCRC }
[  958.706679] ata1.00: failed command: READ DMA
[  958.706684] ata1.00: cmd c8/00:00:88:61:1a/00:00:00:00:00/ef tag 0 dma 131072 in
         res 51/84:f0:98:61:1a/00:00:00:00:00/ef Emask 0x10 (ATA bus error)
[  958.706686] ata1.00: status: { DRDY ERR }
[  958.706688] ata1.00: error: { ICRC ABRT }
[  958.706694] ata1: hard resetting link
[  958.706695] ata1: nv: skipping hardreset on occupied port
[  959.190575] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  959.250708] ata1.00: model number mismatch 'ST3160318AS' != '3160318AS                             \xffffff80'
[  959.250712] ata1.00: revalidation failed (errno=-19)
[  959.250716] ata1: limiting SATA link speed to 1.5 Gbps
[  959.250718] ata1.00: limiting speed to UDMA/133:PIO3
[  964.193775] ata1: hard resetting link
[  964.193776] ata1: nv: skipping hardreset on occupied port
[  964.684112] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  964.704224] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[  964.704228] ata1.00: revalidation failed (errno=-5)
[  964.704229] ata1.00: disabled
[  969.687311] ata1: hard resetting link
[  970.597921] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  970.618030] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[  975.601123] ata1: hard resetting link
[  981.554949] ata1: link is slow to respond, please be patient (ready=0)
[  985.637577] ata1: SRST failed (errno=-16)
[  985.637580] ata1: hard resetting link
[  991.591413] ata1: link is slow to respond, please be patient (ready=-19)
[  995.674039] ata1: SRST failed (errno=-16)
[  995.674042] ata1: hard resetting link
[ 1001.627872] ata1: link is slow to respond, please be patient (ready=-19)
[ 1030.746621] ata1: SRST failed (errno=-16)
[ 1030.746624] ata1: limiting SATA link speed to 1.5 Gbps
[ 1030.746627] ata1: hard resetting link
[ 1031.627525] ata1: SATA link down (SStatus 100 SControl 310)
[ 1031.627542] sd 0:0:0:0: [sda] 
[ 1031.627543] Result: hostbyte=0x00 driverbyte=0x08
[ 1031.627545] sd 0:0:0:0: [sda] 
[ 1031.627546] Sense Key : 0xb [current] [descriptor]
[ 1031.627549] Descriptor sense data with sense descriptors (in hex):
[ 1031.627550]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 1031.627555]         0f 1a 61 98
[ 1031.627558] sd 0:0:0:0: [sda] 
[ 1031.627559] ASC=0x47 ASCQ=0x0
[ 1031.627562] sd 0:0:0:0: [sda] CDB:
[ 1031.627562] cdb[0]=0x28: 28 00 0f 1a 61 88 00 01 00 00
[ 1031.627568] end_request: I/O error, dev sda, sector 253387144
[ 1031.627579] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627581] sd 0:0:0:0: killing request
[ 1031.627586] ata1: EH complete
[ 1031.627594] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627596] sd 0:0:0:0: [sda] killing request
[ 1031.627599] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627606] sd 0:0:0:0: [sda] Unhandled error code
[ 1031.627608] sd 0:0:0:0: [sda] 
[ 1031.627609] Result: hostbyte=0x01 driverbyte=0x00
[ 1031.627610] sd 0:0:0:0: [sda] CDB:
[ 1031.627611] cdb[0]=0x28: 28 00 0f 1a 62 88 00 01 00 00
[ 1031.627615] end_request: I/O error, dev sda, sector 253387400
[ 1031.627637] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627641] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627645] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627647] sd 0:0:0:0: [sda] READ CAPACITY(16) failed
[ 1031.627649] sd 0:0:0:0: [sda] 
[ 1031.627650] Result: hostbyte=0x01 driverbyte=0x00
[ 1031.627652] sd 0:0:0:0: [sda] Sense not available.
[ 1031.627655] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627658] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627661] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627663] sd 0:0:0:0: [sda] READ CAPACITY failed
[ 1031.627665] sd 0:0:0:0: [sda] 
[ 1031.627665] Result: hostbyte=0x01 driverbyte=0x00
[ 1031.627667] sd 0:0:0:0: [sda] Sense not available.
[ 1031.627670] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627673] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627678] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627681] sd 0:0:0:0: rejecting I/O to offline device
[ 1031.627684] sd 0:0:0:0: [sda] Asking for cache data failed
[ 1031.627686] sd 0:0:0:0: [sda] Assuming drive cache: write through
[ 1031.627691] sda: detected capacity change from 160041885696 to 0
[ 1031.627715] ata1.00: detaching (SCSI 0:0:0:0)
[ 1031.630278] sd 0:0:0:0: [sda] Stopping disk
[ 1031.630303] sd 0:0:0:0: [sda] START_STOP FAILED
[ 1031.630306] sd 0:0:0:0: [sda] 
[ 1031.630307] Result: hostbyte=0x04 driverbyte=0x00
[ 1091.849488] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.849936] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.850304] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.850687] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.851056] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.851434] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7864811: comm smbd: reading directory lblock 0
[ 1091.851789] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7602198: comm smbd: reading directory lblock 0

sda drive lost, but was not reassigned as sdc this time. blkid shows only sdb.
I did not reboot machine and trying to get drive recognized again.
Back to top
View user's profile Send private message
creaker
l33t
l33t


Joined: 14 Jul 2012
Posts: 651

PostPosted: Fri Sep 19, 2014 7:29 am    Post subject: Reply with quote

Scanned for lost drive at all available hosts:
Code:
server ~ # echo "0 0 0" > /sys/class/scsi_host/host0/scan
server ~ # echo "0 0 0" > /sys/class/scsi_host/host1/scan
server ~ # echo "0 0 0" > /sys/class/scsi_host/host2/scan
server ~ # echo "0 0 0" > /sys/class/scsi_host/host3/scan

it produced these lines in dmesg:
Code:
[ 2161.534570] ata1: exception Emask 0x10 SAct 0x0 SErr 0x19d0000 action 0xf
[ 2161.534576] ata1: SError: { PHYRdyChg CommWake 10B8B Dispar LinkSeq TrStaTrns }
[ 2161.534585] ata1: hard resetting link
[ 2162.435243] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2162.455350] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 2167.438442] ata1: hard resetting link
[ 2168.349050] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2168.369152] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 2168.369157] ata1: limiting SATA link speed to 1.5 Gbps
[ 2173.352248] ata1: hard resetting link
[ 2174.262857] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 2174.282964] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
[ 2179.266060] ata1: hard resetting link
[ 2180.176662] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 2180.176669] ata1: EH complete
[ 2210.359627] ata2: hard resetting link
[ 2210.359632] ata2: nv: skipping hardreset on occupied port
[ 2210.906448] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 2210.966826] ata2.00: configured for UDMA/133
[ 2210.966829] ata2: EH complete
[ 2234.335838] ata3: soft resetting link
[ 2234.502662] ata3: EH complete
[ 2249.458048] ata4: port disabled--ignoring
[ 2249.458053] ata4: EH complete


Added:
Tried to scan few minutes later, this time scaned successfully, drive found:
Code:
[ 3181.449101] ata1: hard resetting link
[ 3182.351889] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 3182.372186] ata1.00: ATA-8: ST3160318AS, CC38, max UDMA/133
[ 3182.372190] ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 0/32)
[ 3182.412202] ata1.00: configured for UDMA/133
[ 3182.412207] ata1: EH complete
[ 3182.412304] scsi 0:0:0:0: Direct-Access     ATA      ST3160318AS      CC38 PQ: 0 ANSI: 5
[ 3182.412508] sd 0:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 GiB)
[ 3182.412605] sd 0:0:0:0: [sdc] Write Protect is off
[ 3182.412609] sd 0:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[ 3182.412643] sd 0:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 3182.423360]  sdc: sdc1
[ 3182.423645] sd 0:0:0:0: [sdc] Attached SCSI disk
Back to top
View user's profile Send private message
Akkara
Bodhisattva
Bodhisattva


Joined: 28 Mar 2006
Posts: 6702
Location: &akkara

PostPosted: Fri Sep 19, 2014 8:24 am    Post subject: Reply with quote

I've seen that happen when a SATA cable is loose. Try jiggling them while watching dmesg
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.
Back to top
View user's profile Send private message
Anon-E-moose
Watchman
Watchman


Joined: 23 May 2008
Posts: 6098
Location: Dallas area

PostPosted: Fri Sep 19, 2014 9:42 am    Post subject: Reply with quote

Everything between the exception and the hard reset means that the sata controller and the sata device lost communication.
As Akkara said it could be a loose cable.
Or it could be the other things I mentioned earlier.

I've swapped out my non-locking sata cables for the ones that lock, in the past.


creaker wrote:
I did not reboot machine and trying to get drive recognized again.

I know you wanted to see if it would reconnect with a scan but
I recommend a reboot when the disconnect and reconnect happens.
I've had strange errors show up on the filesystems when I've had that happen in the past and not rebooted
especially when the filesystem was mounted when it disconnected/reconnected.
_________________
PRIME x570-pro, 3700x, 6.1 zen kernel
gcc 13, profile 17.0 (custom bare multilib), openrc, wayland
Back to top
View user's profile Send private message
creaker
l33t
l33t


Joined: 14 Jul 2012
Posts: 651

PostPosted: Fri Sep 19, 2014 11:58 am    Post subject: Reply with quote

OK, I've replaced both power and data cords. Let see will it solve a problem.
Though I doubt it is a cable issue due to two reasons:
1. If it's a data cable failure it should be mentioned in smart output, but seems no data transmission issues there:
Code:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
...
...
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0


2. If it's a power cord issue dmesg will not show attempts to reconnect drive.

I've searched on web for 'exception Emask 0x10 SAct 0x0 SErr 0x19d0000 action 0xf' message and found thait may be kernel issue.
Anyway I replaced cables, have to see.
Back to top
View user's profile Send private message
lagalopex
Guru
Guru


Joined: 16 Oct 2004
Posts: 562

PostPosted: Fri Sep 19, 2014 12:14 pm    Post subject: Reply with quote

smartctl also mentions a possible firmware update CC49:
Quote:
Firmware Version: CC38

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/213891en


you could run a long smart test. (But it should be successfull, as the error behaviour points to somewhere between hdd and kernel.)
Back to top
View user's profile Send private message
creaker
l33t
l33t


Joined: 14 Jul 2012
Posts: 651

PostPosted: Sun Sep 21, 2014 5:00 pm    Post subject: Reply with quote

Server used intensively for the last two days and seems cords replacing solved a problem.
Thanks for helping, guys!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum