Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Hard disk problem: "Logical unit not ready"
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
antu456
n00b
n00b


Joined: 11 Sep 2013
Posts: 9

PostPosted: Wed Sep 11, 2013 9:01 pm    Post subject: Hard disk problem: "Logical unit not ready" Reply with quote

Hey,

On one of my gentoo systems I regularly get errors in dmesg which I'm a bit worried about.

The system has 7 disks in a software RAID6 (mdraid), which go to standby after a few minutes of inactivity to save energy. When I access the filesystem on the raid while the disks are in standby mode I sometimes (but not always) get these messages in dmesg:
Code:

[1221060.356007] sd 0:0:1:0: [sdb] Unhandled error code
[1221060.356013] sd 0:0:1:0: [sdb] 
[1221060.356015] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[1221060.356018] sd 0:0:1:0: [sdb] CDB:
[1221060.356019] Read(16): 88 00 00 00 00 00 1b 66 7e 00 00 00 00 08 00 00
[1221060.356031] end_request: I/O error, dev sdb, sector 459701760
[1221060.356050] sd 0:0:1:0: [sdb] Device not ready
[1221060.356051] sd 0:0:1:0: [sdb] 
[1221060.356052] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1221060.356053] sd 0:0:1:0: [sdb] 
[1221060.356054] Sense Key : Not Ready [current]
[1221060.356055] sd 0:0:1:0: [sdb] 
[1221060.356057] Add. Sense: Logical unit not ready, initializing command required
[1221060.356058] sd 0:0:1:0: [sdb] CDB:
[1221060.356058] Read(16): 88 00 00 00 00 00 1b 66 7e 08 00 00 00 08 00 00
[1221060.356063] end_request: I/O error, dev sdb, sector 459701768
[1221060.356066] sd 0:0:1:0: [sdb] Device not ready
[1221060.356066] sd 0:0:1:0: [sdb] 
[1221060.356067] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1221060.356068] sd 0:0:1:0: [sdb] 
[1221060.356069] Sense Key : Not Ready [current]
[1221060.356070] sd 0:0:1:0: [sdb] 
[1221060.356070] Add. Sense: Logical unit not ready, initializing command required
[1221060.356071] sd 0:0:1:0: [sdb] CDB:
[1221060.356072] Read(16): 88 00 00 00 00 00 1b 66 7e 10 00 00 00 08 00 00
[1221060.356076] end_request: I/O error, dev sdb, sector 459701776
[1221060.356080] sd 0:0:1:0: [sdb] Device not ready
[1221060.356081] sd 0:0:1:0: [sdb] 
[1221060.356082] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1221060.356082] sd 0:0:1:0: [sdb] 
[1221060.356083] Sense Key : Not Ready [current]
[1221060.356084] sd 0:0:1:0: [sdb] 
[1221060.356085] Add. Sense: Logical unit not ready, initializing command required
[1221060.356086] sd 0:0:1:0: [sdb] CDB:
[1221060.356086] Read(16): 88 00 00 00 00 00 1b 66 7e 18 00 00 00 08 00 00
[1221060.356090] end_request: I/O error, dev sdb, sector 459701784
[1221060.356093] sd 0:0:1:0: [sdb] Device not ready
[1221060.356094] sd 0:0:1:0: [sdb] 
[1221060.356095] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1221060.356096] sd 0:0:1:0: [sdb] 
[1221060.356096] Sense Key : Not Ready [current]
[1221060.356097] sd 0:0:1:0: [sdb] 
[1221060.356098] Add. Sense: Logical unit not ready, initializing command required
[1221060.356099] sd 0:0:1:0: [sdb] CDB:
[1221060.356099] Read(16): 88 00 00 00 00 00 1b 66 7e 20 00 00 00 08 00 00
[1221060.356104] end_request: I/O error, dev sdb, sector 459701792
[1221060.356107] sd 0:0:1:0: [sdb] Device not ready
[1221060.356108] sd 0:0:1:0: [sdb] 
[1221060.356108] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1221060.356109] sd 0:0:1:0: [sdb] 
[1221060.356110] Sense Key : Not Ready [current]
[1221060.356111] sd 0:0:1:0: [sdb] 
[1221060.356112] Add. Sense: Logical unit not ready, initializing command required
[1221060.356113] sd 0:0:1:0: [sdb] CDB:
[1221060.356113] Read(16): 88 00 00 00 00 00 1b 66 7e 28 00 00 00 08 00 00
[1221060.356117] end_request: I/O error, dev sdb, sector 459701800

But after a few seconds I can access/use the filesystem normally, it doesn't have any errors/problems. The messages are always the same, just the sector number and the affected disks are different (but only sda, sdb and sdc are affected, the other disks never make these messages).

Hardware is:
Mainboard: Supermicro X9SCM-F
Disks: 7x HGST Deskstar IDK 4TB (0S03356)
HBA: IBM ServeRAID M1015 (flashed to IT mode so it just passes the disks through)

Some of the disks (I think 3) are connected to the mainboard, some to the M1015.

Should I be worried about this or can I ignore these messages? Could this have to do with Staggered Spin up? What do I need to do to fix this?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54119
Location: 56N 3W

PostPosted: Wed Sep 11, 2013 10:14 pm    Post subject: Reply with quote

antu456,

Welcome to Gentoo.

Yes, you should be worried and you should have current validated backups.

At face value, you have at least one hardware problem. Several problems at a time are rare, so to keep the options down, we'll work on the idea that its a single problem.

Look for one thing all the affected disks have in common. HDD controller? PSU?
Is you PSU good enough to support 7 HDD?
Do you have more than 2 drives on the same power cable from the PSU?

As you allow the drives to spin down, which is a very bad thing for drive life, it might be a spin up issue too, depending on how spin up control is implemented.
SCSI allows drives to be spun up when they are addressed. This feature does not have to be used and I'm unsure if it got into the SATA spec. The makes for slow spin up times of a group of drives but almost eliminates the start up current spike caused by, in your case, 7 drives trying to spin up at the same time.
The spin up current is about 5x the normal drive motor run current. Can your PSU provide the +12 current to start your drives all at the same time?

There are no solutions there - just some things to look at at.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
antu456
n00b
n00b


Joined: 11 Sep 2013
Posts: 9

PostPosted: Thu Sep 12, 2013 1:01 am    Post subject: Reply with quote

Thanks for your answer! :-)

The disks are inside 2 Cremax ICY DOCK MB455SPF cages (I have 8 hard disks total but the first isn't part of the raid. Additionally there is a SSD for the system but it's not inside one of the cages), 5 disks in the first, 3 in the second one. They use the same PSU (the system has only one), but different cables if I remember correctly.

The PSU (Enermax Platimax 500 Watt ATX 2.3) should be able to support 7 disks spinning up. It's a 500W PSU, when in idle the system uses around 40-50W and when disks are spinning up it's 150-200W for a few seconds, then 90-100W while they are running.

I know that spinning disks up/down often is bad for the drives but they are only needed 2-3 times a day, so they just spin up a few times per day which should be fine (according to HGST they can be spun up 600.000 times). According to SMART, the spin up time/load cycle count are fine (for sda):

Code:
  3 Spin_Up_Time            0x0007   126   126   024    Pre-fail  Always       -       611 (Average 615)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       490
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       490


I've looked what the affected disks (sda, sdb, and sdc) have in common and found this:

Code:
# ls /sys/block/sd* -lah
lrwxrwxrwx 1 root root 0 29. Aug 00:24 /sys/block/sda -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 29. Aug 00:24 /sys/block/sdb -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/block/sdb
lrwxrwxrwx 1 root root 0 29. Aug 00:24 /sys/block/sdc -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:2/end_device-0:2/target0:0:2/0:0:2:0/block/sdc
lrwxrwxrwx 1 root root 0 29. Aug 00:24 /sys/block/sdd -> ../devices/pci0000:00/0000:00:1f.2/ata1/host1/target1:0:0/1:0:0:0/block/sdd
lrwxrwxrwx 1 root root 0 29. Aug 00:24 /sys/block/sde -> ../devices/pci0000:00/0000:00:1f.2/ata2/host2/target2:0:0/2:0:0:0/block/sde
lrwxrwxrwx 1 root root 0 29. Aug 00:24 /sys/block/sdf -> ../devices/pci0000:00/0000:00:1f.2/ata3/host3/target3:0:0/3:0:0:0/block/sdf
lrwxrwxrwx 1 root root 0 29. Aug 00:24 /sys/block/sdg -> ../devices/pci0000:00/0000:00:1f.2/ata4/host4/target4:0:0/4:0:0:0/block/sdg
lrwxrwxrwx 1 root root 0 29. Aug 00:24 /sys/block/sdh -> ../devices/pci0000:00/0000:00:1f.2/ata5/host5/target5:0:0/5:0:0:0/block/sdh
lrwxrwxrwx 1 root root 0 29. Aug 00:24 /sys/block/sdi -> ../devices/pci0000:00/0000:00:1f.2/ata6/host6/target6:0:0/6:0:0:0/block/sdi


The 3 affected disks use the same controller, the disks using the other controller are not affected.

Code:
# lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/Ivy Bridge DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 05)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b5)
00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5 (rev b5)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation C204 Chipset Family LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 05)
01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
04:03.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)


So it looks like the controller could be the problem.

I found a post on the linux kernel mailing list where someone has a very similar problem here. Note that he has the same controller (they have different names but are the same hardware), and is also using SATA disks. The only difference is that I don't get errors by mdraid. Later in that thread someone posted a patch which was included in the kernel (I'm using 3.9.9-gentoo, and the patch is included there, I checked). It's this patch here. But it doesn't seem to be the same problem I have (my disks do spin up and there are no errors beside those posted above).

Not sending the disks to standby isn't a solution for me (because that would be a big waste of energy, they are only needed 2-3 hours per day when backups are made/I access some big files on the NAS).

Could this be a bug in the driver (I'll update to the latest stable kernel later to check if that fixes the problem)? Or a configuration problem? What to do next?

Edit: There is a firmware update for the controller, I'll install the new firmware tomorrow and see if that helps. :-)
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54119
Location: 56N 3W

PostPosted: Thu Sep 12, 2013 12:46 pm    Post subject: Reply with quote

antu456,

You have done your research very well.

Yes, it could be a bug in the firmware, the driver or even the kernel raid code, not checking that all members of the set are on line before it tries a disk access.
That 'all' is tricky - a raid6 needs only n-2 drives to provide access and you don't want to prevent access if you have sufficient drives.

Can you move one or more of the affected drives to the other controller and see if the moved drives are no longer affected?
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
antu456
n00b
n00b


Joined: 11 Sep 2013
Posts: 9

PostPosted: Thu Sep 12, 2013 8:34 pm    Post subject: Reply with quote

Unfortunately it's not possible to move the affected drives to the other controller as there are no more free SATA ports on it (there are only 6).

I have updated the BIOS/firmware of the controller to the latest versions.
I updated the kernel to version 3.10.7-gentoo.

I've gone through the controller's BIOS and changed the settings:

  • Changed Report Device Missing Delay and IO Device Missing Delay from 0 to 15 seconds.
  • Changed IO Timeout [...] values from 10 to 20 seconds.

Screenshots:
http://i.imgur.com/AuP607p.png
http://i.imgur.com/NVA5iBQ.png

And then I checked again, but the error is still occuring.

Next thing I found in the dmesg output is this:

Code:
[    1.026520] mpt2sas version 14.100.00.00 loaded
[    1.042013] mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (8144424 kB)
[    1.058775] mpt2sas 0000:01:00.0: irq 41 for MSI/MSI-X
[    1.058787] mpt2sas0-msix0: PCI-MSI-X enabled: IRQ 41
[    1.067429] mpt2sas0: iomem(0x00000000df600000), mapped(0xffffc90000070000), size(16384)
[    1.083939] mpt2sas0: ioport(0x000000000000e000), size(256)
[    1.329563] mpt2sas0: sending message unit reset !!
[    1.342888] mpt2sas0: message unit reset: SUCCESS
[    1.476110] mpt2sas0: Allocated physical memory: size(7418 kB)
[    1.483214] mpt2sas0: Current Controller Queue Depth(3307), Max Controller Queue Depth(3432)
[    1.498147] mpt2sas0: Scatter Gather Elements per IO(128)
[    1.699724] mpt2sas0: LSISAS2008: FWVersion(17.00.01.00), ChipRevision(0x03), BiosVersion(07.33.00.00)
[    1.716001] mpt2sas0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[    1.736901] mpt2sas0: sending port enable !!
[    1.742405] mpt2sas0: host_add: handle(0x0001), sas_addr(0x500605b00604f420), phys(8)
[    1.749451] mpt2sas0: port enable: SUCCESS


The driver version is 14.100.00.00 but in the current Linux git tree there is a newer version (16), I'll try the latest git kernel now.

Apart from that, I'm running out of ideas. :(
Back to top
View user's profile Send private message
antu456
n00b
n00b


Joined: 11 Sep 2013
Posts: 9

PostPosted: Sat Jun 14, 2014 2:31 pm    Post subject: Reply with quote

I tried some things to solve the problem:

  • Replaced the cage of the affected disks with a Supermicro CSE-M35T-1.
  • Updated to gentoo-sources-3.15.0-r1.
  • Updated the firmware of the controller to the latest version (19), BIOS update was not possible because the checksum of the new BIOS was invalid (I tried downloading it again, but always get the checksum error).
  • Switched off power management for the controller(s).
    Code:
    echo '' > /sys/class/scsi_host/hostX/link_power_management_policy # For every host.
    echo 'on' > /sys/bus/pci/devices/0000:01:00.0/power/control

  • I changed the controller settings and set:

    • Changed Report Device Missing Delay and IO Device Missing Delay to 60 seconds.
    • Changed IO Timeout [...] values to 120 seconds.

  • I changed the device timeout for all disks to 120 seconds.
    Code:
    for i in /sys/block/sd?/device/timeout; do echo 120 > "$i"; done


But the problem is still there.
Quote:

[ 8207.757438] sd 0:0:3:0: [sdd] Unhandled error code
[ 8207.757444] sd 0:0:3:0: [sdd]
[ 8207.757446] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 8207.757448] sd 0:0:3:0: [sdd] CDB:
[ 8207.757450] Read(16): 88 00 00 00 00 00 15 99 ad 08 00 00 00 08 00 00
[ 8207.757462] end_request: I/O error, dev sdd, sector 362392840
[ 8207.757486] sd 0:0:3:0: [sdd] Device not ready
[ 8207.757488] sd 0:0:3:0: [sdd]
[ 8207.757490] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8207.757492] sd 0:0:3:0: [sdd]
[ 8207.757493] Sense Key : Not Ready [current]
[ 8207.757496] sd 0:0:3:0: [sdd]
[ 8207.757499] Add. Sense: Logical unit not ready, initializing command required
[ 8207.757501] sd 0:0:3:0: [sdd] CDB:
[ 8207.757502] Read(16): 88 00 00 00 00 00 15 99 ad 10 00 00 00 08 00 00
[ 8207.757511] end_request: I/O error, dev sdd, sector 362392848
[ 8207.757517] sd 0:0:3:0: [sdd] Device not ready
[ 8207.757519] sd 0:0:3:0: [sdd]
[ 8207.757520] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8207.757522] sd 0:0:3:0: [sdd]
[ 8207.757523] Sense Key : Not Ready [current]
[ 8207.757526] sd 0:0:3:0: [sdd]
[ 8207.757527] Add. Sense: Logical unit not ready, initializing command required
[ 8207.757529] sd 0:0:3:0: [sdd] CDB:
[ 8207.757530] Read(16): 88 00 00 00 00 00 15 99 ad 18 00 00 00 08 00 00
[ 8207.757539] end_request: I/O error, dev sdd, sector 362392856
[ 8207.757546] sd 0:0:3:0: [sdd] Device not ready
[ 8207.757548] sd 0:0:3:0: [sdd]
[ 8207.757549] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8207.757551] sd 0:0:3:0: [sdd]
[ 8207.757552] Sense Key : Not Ready [current]
[ 8207.757554] sd 0:0:3:0: [sdd]
[ 8207.757556] Add. Sense: Logical unit not ready, initializing command required
[ 8207.757558] sd 0:0:3:0: [sdd] CDB:
[ 8207.757559] Read(16): 88 00 00 00 00 00 15 99 ad 20 00 00 00 08 00 00
[ 8207.757567] end_request: I/O error, dev sdd, sector 362392864
[ 8207.757574] sd 0:0:3:0: [sdd] Device not ready
[ 8207.757575] sd 0:0:3:0: [sdd]
[ 8207.757577] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8207.757578] sd 0:0:3:0: [sdd]
[ 8207.757579] Sense Key : Not Ready [current]
[ 8207.757582] sd 0:0:3:0: [sdd]
[ 8207.757583] Add. Sense: Logical unit not ready, initializing command required
[ 8207.757585] sd 0:0:3:0: [sdd] CDB:
[ 8207.757586] Read(16): 88 00 00 00 00 00 15 99 ad 28 00 00 00 08 00 00
[ 8207.757595] end_request: I/O error, dev sdd, sector 362392872
[ 8207.757601] sd 0:0:3:0: [sdd] Device not ready
[ 8207.757603] sd 0:0:3:0: [sdd]
[ 8207.757605] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8207.757606] sd 0:0:3:0: [sdd]
[ 8207.757607] Sense Key : Not Ready [current]
[ 8207.757610] sd 0:0:3:0: [sdd]
[ 8207.757611] Add. Sense: Logical unit not ready, initializing command required
[ 8207.757613] sd 0:0:3:0: [sdd] CDB:
[ 8207.757614] Read(16): 88 00 00 00 00 00 15 99 ad 30 00 00 00 08 00 00
[ 8207.757623] end_request: I/O error, dev sdd, sector 362392880
[ 8234.946594] sd 0:0:2:0: [sdc] Unhandled error code
[ 8234.946600] sd 0:0:2:0: [sdc]
[ 8234.946602] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 8234.946604] sd 0:0:2:0: [sdc] CDB:
[ 8234.946606] Read(16): 88 00 00 00 00 00 44 19 ad 08 00 00 00 08 00 00
[ 8234.946617] end_request: I/O error, dev sdc, sector 1142533384
[ 8234.946634] sd 0:0:2:0: [sdc] Device not ready
[ 8234.946637] sd 0:0:2:0: [sdc]
[ 8234.946638] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8234.946640] sd 0:0:2:0: [sdc]
[ 8234.946641] Sense Key : Not Ready [current]
[ 8234.946644] sd 0:0:2:0: [sdc]
[ 8234.946646] Add. Sense: Logical unit not ready, initializing command required
[ 8234.946648] sd 0:0:2:0: [sdc] CDB:
[ 8234.946650] Read(16): 88 00 00 00 00 00 44 19 ad 10 00 00 00 08 00 00
[ 8234.946659] end_request: I/O error, dev sdc, sector 1142533392
[ 8234.946664] sd 0:0:2:0: [sdc] Device not ready
[ 8234.946666] sd 0:0:2:0: [sdc]
[ 8234.946668] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8234.946669] sd 0:0:2:0: [sdc]
[ 8234.946670] Sense Key : Not Ready [current]
[ 8234.946673] sd 0:0:2:0: [sdc]
[ 8234.946674] Add. Sense: Logical unit not ready, initializing command required
[ 8234.946676] sd 0:0:2:0: [sdc] CDB:
[ 8234.946677] Read(16): 88 00 00 00 00 00 44 19 ad 18 00 00 00 08 00 00
[ 8234.946686] end_request: I/O error, dev sdc, sector 1142533400


Somebody has another idea what I could try?

I want the disks to suspend after ~8 minutes, and I use this command to configure the standby settings of the disks:
Code:
hdparm -B 127 -S 100 <Path to disk>

Do I have to configure something else like LVM/MDRAID/Linux so that it recognizes that the disks could be in standby and waits until they wake up before sending commands?

I can reproduce the problem by:
Code:

hdparm -y /dev/<Disk> # Send disk to standby.
# Wait a little...
ls -lah /mnt/<Path on RAID>
Back to top
View user's profile Send private message
skywalker67
n00b
n00b


Joined: 23 Jul 2014
Posts: 1

PostPosted: Sat Jul 26, 2014 8:15 am    Post subject: Reply with quote

Hi antu456, I am experiencing the same problem as you on LSI 9211-8i and 64-bit Debian with kernel from testing, current version 3.14-1.

Like in your case the disk actually wakes up and I get one hostbyte=DID_OK driverbyte=DRIVER_OK and couple of hostbyte=DID_OK driverbyte=DRIVER_SENSE, in dmesg I can see end_request: I/O error but no error on higher level. I am setting the standby timeout on drives with hdparm -S 150 and sometimes I am waking them up with the same command. (I am not going to specifics why I do that, it is probably not important.) I am yet not sure if the problem occurs with standard I/O from system or when I am waking up the drive with -S 150 command.

The problem occurs only sometimes. I have this controller only couple of weeks and I have experienced it fewer than 10 times. Drives are going to standby and waking up couple of times a day. Until now the errors were always on same drive on last port of the card, marked as slot 7 in dmesg / kernel messages. I did not yet test reconnecting drives to other ports. What is your setup? Which exact controller do you have, on which drive does it happen and how often?
Back to top
View user's profile Send private message
antu456
n00b
n00b


Joined: 11 Sep 2013
Posts: 9

PostPosted: Sat Oct 18, 2014 12:32 pm    Post subject: Reply with quote

Unfortunately the problem persists.

I've now updated to gentoo-sources-3.17.1 but still get the errors. I'm just ignoring them as they seem to not cause any problems/corruptions.
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2977
Location: Germany

PostPosted: Sat Oct 18, 2014 1:48 pm    Post subject: Reply with quote

I'm using a 7 disk raid5 and sending disks to standby works fine for me. Nothing in dmesg. It even does "staggered spinup", when I try to access the HDD filesystems. So it waits for the first disk to wait and return data, then proceeds to the next disk and so on... (with 7 disks this makes the wakeup process kinda slow). No raid controller involved though, all onboard. So I assume this is an issue with your controller(s)...?
Back to top
View user's profile Send private message
antu456
n00b
n00b


Joined: 11 Sep 2013
Posts: 9

PostPosted: Mon Oct 27, 2014 12:25 pm    Post subject: Reply with quote

Yes, it seems to be a problem with the LSI 9211-8i controller (IBM ServeRAID M1015 uses the same chip). It's probably a driver/firmware problem, so there is not much we can do about it. :/
Back to top
View user's profile Send private message
jgehring123
n00b
n00b


Joined: 12 Dec 2014
Posts: 1
Location: United States

PostPosted: Fri Dec 12, 2014 11:49 pm    Post subject: Reply with quote

Did anyone find out anything more about this? I am hitting a very similar situation and can add some data. I have output below from dmesg when this case is hit. There are a few extra lines of debugging that I inserted. The sense data logs are from executing:

~ # echo 0x9411 > /proc/sys/dev/scsi/logging_level

I'm running on a supermicro 6036ST-6LR (http://www.supermicro.com/products/system/3U/6036/SYS-6036ST-6LR.cfm) This system sports two storage processors with joint access to an array of 16 dual-ported SAS drives.
The reproduction recipe:

* Start raid6 array of 8 drives (controller A)
* Begin long running dd read against the array (controller A)
* On controller B, begin loop that calls sdparm once per second against a drive included in the array started on controller A.
sdparm --page=po --clear=IDLE_B,IDLE_C,STANDBY_Y -S /dev/sdd
* Note on controller A that the drive sdd is kicked from the raid array with the dmesg output listed below. (in this case sdd happens to refer to the same drive on both controllers)

This is with kernel 3.16.7. If I run the same test with kernel 3.4.60, I don't hit the problem at all. Same mpt2sas driver with both tests:
[ 147.692212] mpt2sas0: LSISAS2008: FWVersion(16.00.01.00), ChipRevision(0x03), BiosVersion(07.31.00.00)

So I would say that this is not a firmware issue, per se, but rather a change in the kernel scsi code. It's as if the change being made by the sdparm call from the B controller is locking access to the drive long enough that IO on the A side times out (errors out). Just guessing at that right now.

Code:

[16881.590592] sd 3:0:1:0: [sdd] CDB:
[16881.590593] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[16881.686363] sd 3:0:1:0: Mode parameters changed
[16881.686367] sd 3:0:1:0: [sdd] Done:
[16881.686369] SUCCESS
[16881.686371] sd 3:0:1:0: [sdd]
[16881.686373] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[16881.686376] sd 3:0:1:0: [sdd] CDB:
[16881.686377] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[16881.686386] sd 3:0:1:0: [sdd]
[16881.686388] Sense Key : Unit Attention [current]
[16881.686391] sd 3:0:1:0: [sdd]
[16881.686394] Add. Sense: Mode parameters changed: constants.c
[16881.686398] drivers/scsi/scsi_lib.c:scsi_io_completion:746 - error = -5 - result = 134217730
[16881.686400] drivers/scsi/scsi_lib.c:scsi_io_completion:747 - response_code = 0x70
[16881.686407] CPU: 11 PID: 0 Comm: swapper/11 Tainted: P           O  3.16.7.NGS1 #14
[16881.686409] Hardware name: Supermicro X8DTS/X8DTS, BIOS 2.1     06/26/2014
[16881.686411]  ffff880be6a2ae20 ffff88183fca3cb8 ffffffff815503d4 ffffffff817e9244
[16881.686414]  ffff88183fca3cf8 ffffffff812939bd fffffffb00000001 ffff880faf3cd140
[16881.686418]  00000000fffffffb 0000000000000000 0000000000000001 ffff880be6a2b3d0
[16881.686421] Call Trace:
[16881.686424]  <IRQ>  [<ffffffff815503d4>] dump_stack+0x45/0x56
[16881.686439]  [<ffffffff812939bd>] blk_update_request+0x19d/0x320
[16881.686444]  [<ffffffff81293b5c>] blk_update_bidi_request+0x1c/0x80
[16881.686448]  [<ffffffff81294587>] __blk_end_bidi_request+0x17/0x40
[16881.686451]  [<ffffffff8129468f>] __blk_end_request_all+0x1f/0x30
[16881.686455]  [<ffffffff8129662d>] blk_flush_complete_seq+0x34d/0x360
[16881.686458]  [<ffffffff8129681b>] flush_end_io+0x12b/0x200
[16881.686462]  [<ffffffff81293da1>] blk_finish_request+0x71/0x100
[16881.686465]  [<ffffffff81293e72>] blk_end_bidi_request+0x42/0x60
[16881.686469]  [<ffffffff81293ea0>] blk_end_request+0x10/0x20
[16881.686476]  [<ffffffff8139d537>] scsi_io_completion+0x107/0x7b0
[16881.686480]  [<ffffffff81392cb3>] scsi_finish_command+0xb3/0x110
[16881.686485]  [<ffffffff8139d347>] scsi_softirq_done+0x137/0x160
[16881.686490]  [<ffffffff81299e53>] blk_done_softirq+0x73/0x90
[16881.686494]  [<ffffffff8105015d>] __do_softirq+0xed/0x290
[16881.686497]  [<ffffffff8105052d>] irq_exit+0xad/0xc0
[16881.686500]  [<ffffffff81004588>] do_IRQ+0x58/0xf0
[16881.686503]  [<ffffffff8155782a>] common_interrupt+0x6a/0x6a
[16881.686504]  <EOI>  [<ffffffff8143605c>] ? cpuidle_enter_state+0x4c/0xc0
[16881.686511]  [<ffffffff81436187>] cpuidle_enter+0x17/0x20
[16881.686516]  [<ffffffff8108c18d>] cpu_startup_entry+0x2bd/0x3e0
[16881.686522]  [<ffffffff81030c02>] start_secondary+0x192/0x200
[16881.686524] end_request: I/O error, dev sdd, sector 0

This is eventually followed by:
Code:

[16882.851261] md/raid:md11: Disk failure on sdd2, disabling device.
Back to top
View user's profile Send private message
mt_undershirt
n00b
n00b


Joined: 20 Dec 2014
Posts: 4

PostPosted: Sat Dec 20, 2014 1:02 pm    Post subject: FYI I have it too Reply with quote

Hello,

I am running a DL380 G7 with 1x 9211-8i (which replaces the SmartArray p410i) and 2x 9207-8e which are connected to a D2700.
Before the external enclosure was connected to a P411.

Every since I have harddisks ( I do have SSDs also in the enclosure) in the D2700, I have been getting (sporadically) dmesg output like this:

Code:
[ 8736.127044] Add. Sense: Logical unit not ready, initializing command required
[ 8736.127045] sd 2:0:14:0: [sdo] CDB:
[ 8736.127054] Read(10): 28 00 39 44 08 48 00 00 08 00
[ 8736.127055] end_request: I/O error, dev sdo, sector 960759880
[ 8736.127059] sd 2:0:14:0: [sdo]
[ 8736.127060] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8736.127061] sd 2:0:14:0: [sdo]
[ 8736.127064] Sense Key : Not Ready [current]
[ 8736.127065] sd 2:0:14:0: [sdo]
[ 8736.127068] Add. Sense: Logical unit not ready, initializing command required
[ 8736.127069] sd 2:0:14:0: [sdo] CDB:
[ 8736.127082] Read(10): 28 00 39 44 08 50 00 00 08 00
[ 8736.127083] end_request: I/O error, dev sdo, sector 960759888
[ 8736.127087] sd 2:0:14:0: [sdo]
[ 8736.127088] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 8736.127089] sd 2:0:14:0: [sdo]
[ 8736.127092] Sense Key : Not Ready [current]
[ 8736.127093] sd 2:0:14:0: [sdo]
[ 8736.127095] Add. Sense: Logical unit not ready, initializing command required
[ 8736.127096] sd 2:0:14:0: [sdo] CDB:
[ 8736.127101] Read(10): 28 00 39 44 08 58 00 00 08 00
[ 8738.427279] sd 2:0:13:0: [sdn]
[ 8738.427362] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 8738.427444] sd 2:0:13:0: [sdn] CDB:
[ 8738.427520] Read(10): 28 00 39 44 08 00 00 00 08 00



My situation is similar to that of antu456. Also running 3.17.7 now (before 3.16.5 for the last few months).
The disks are used for backup and storage of less important things (driver downloads, cdrom images etc.) and they only get accessed intensely about twice per day when the backups kick in, for about an hour.
Like you, I was at first worried sick that I might be bleeding data i.e. introducing errors that were not immediately obvious and with every kernel release I hope the problem will disappear (which it does sometimes for days).
I am getting the errors only on the real hard disks. In the normal setup, that is on the enclosure (which I became suspicious of, because I bought it used); but for debugging I have moved some onto the 8-slot cage with the 9211 and the same problem arises.

What I did

  • Tested with smartctl in the server
  • Played with around 6 firmware and bios versions of the LSI controllers (even did downgrades)
  • Moved disks around between external 9207 and internal 9211
  • Moved the disks to a DL380 G6 with another LSI 9211-8i and a Z800 with an LSI 9211-8i as well
  • As antu456 I have tried to tweak LSI firmware settings concerning error reporting.


Results:

  • Everything works fine, but the messages still crop up.
  • No real errors anywhere. I verified several backups and they are perfect. mdraid etc. does not show any problem.
  • And as I said, it only happens since I put some magnetic spinning disks onto the controllers. With a pure SSD setup, it was running error-free which supports the theory it is a sleep/spin down issue.
  • My conclusion so far is that there is no real problem and it is one of those cases where linux is rather "too strict and eager" in reporting every hiccup (which I still consider a good thing).
  • Yet, I am not saying there is no problem; but I have been living with it for almost a year now and I have not lost any data up to now. I'll keep on running smartctl and keep an eye open, but I think it's harmless really. I think there is a kind of "disconnect" between driver, disk and controller states so that access is tried too soon.


good luck
mt
Back to top
View user's profile Send private message
TrollGentoo
n00b
n00b


Joined: 19 Feb 2015
Posts: 4

PostPosted: Thu Feb 19, 2015 2:07 pm    Post subject: Reply with quote

Hi all,

I am also using a LSI 9211-8 card.
- 4HDDS on one of the miniSAS to SATA cable: they are in raid5 with mdadm
- 3HDDS on the other cable, no raid involved.

My kernel is in version 3.16 and I am running a debian.
The board Firmware/Bios versions are up to date and use the IT version.

I also get from time to time the following errors:

Code:

[68928.461416] sd 0:0:6:0: attempting task abort! scmd(ffff88010995f040)
[68928.461994] sd 0:0:6:0: [sdo] CDB:
[68928.462538] ATA command pass through(16): 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
[68928.463128] scsi target0:0:6: handle(0x000f), sas_address(0x4433221107000000), phy(7)
[68928.463691] scsi target0:0:6: enclosure_logical_id(0x500605b0013d1480), slot(4)
[68932.372146] sd 0:0:6:0: task abort: SUCCESS scmd(ffff88010995f040)


and

Code:

[53173.816763] sd 0:0:0:0: [sdi] Unhandled error code
[53173.816776] sd 0:0:0:0: [sdi] 
[53173.816783] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[53173.816793] sd 0:0:0:0: [sdi] CDB:
[53173.816801] Read(16): 88 00 00 00 00 01 b4 59 5d 00 00 00 00 08 00 00
[53173.816828] end_request: I/O error, dev sdi, sector 7320722688


It also happens when going back from sleep mode...

The first one is not really a concern, the second one a bit more!
My mdadm was never forced to resync though...

I hope we find something soon!

Thanks
Back to top
View user's profile Send private message
mt_undershirt
n00b
n00b


Joined: 20 Dec 2014
Posts: 4

PostPosted: Mon Jun 01, 2015 10:29 am    Post subject: Hmmm, it was power management with my disks Reply with quote

Hello,

I am now free from those messages for over two weeks; it has never been this long, so I assume my last change was actually successful.

And it is nothing spectacular. I simply (almost) disabled power management.

From the hdparm man page:
Quote:

-B Get/set Advanced Power Management feature, if the drive supports it. [...] The highest degree of power management is attained with a setting of 1, and the
highest I/O performance with a setting of 254. A value of 255 tells hdparm to disable Advanced Power Management altogether on the drive (not all drives support disabling it, but most do)



So:
Code:
hdparm -B 254 /dev/drive_device


Note 1: It of course works also with 255. But as I understand a minimal level of drive APM remains with 254, so I have chosen that setting.
Note 2: I have tried this approach before (I am setting hdparm -B with a systemd boot service), but it failed because I was also running tuned!. And tuned, even though I gave it an
Code:
apm=254
and
Code:
alpm=max_performance
it would gradually changed the apm value back to something like 128 or lower (good intentions, I understand). Rather stupid not to check that before...
Note 3: The SSDs of course were not affected because they don't have anything to spin down or up and don't change their visible behaviour (no delays, etc.)


Hope this helps somebody. Be aware though, that your disks will be running almost constantly. People more in the know than I (at least I assume that), have stated that this is actually better for disks that are meant for 24/7 operation because start/stop cycles put more of a strain on the mechanics than just keeping them running. I guess I'll find out (that's what backups are for).

Cheers
mtu
Back to top
View user's profile Send private message
TrollGentoo
n00b
n00b


Joined: 19 Feb 2015
Posts: 4

PostPosted: Wed Jun 10, 2015 2:34 pm    Post subject: Re: Hmmm, it was power management with my disks Reply with quote

Hi,

it finally moves! :)

Well, if you don't spin down the hard drive at all, the message will not come for sure...
In my case, it's only when coming back from sleep mode.

Even if it may harm the lifetime of the HDD, spinning them down reduces the ambient noise + temperature, so it's a must for me.

Thanks
Back to top
View user's profile Send private message
mseed
n00b
n00b


Joined: 03 Jul 2015
Posts: 4

PostPosted: Fri Jul 03, 2015 12:33 am    Post subject: Re: Hmmm, it was power management with my disks Reply with quote

TrollGentoo wrote:
Hi,

it finally moves! :)

Well, if you don't spin down the hard drive at all, the message will not come for sure...
In my case, it's only when coming back from sleep mode.

Even if it may harm the lifetime of the HDD, spinning them down reduces the ambient noise + temperature, so it's a must for me.

Thanks



I have similar problem and nothing worked so far. For now i am testing some sysctl tuning:

Code:
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
Back to top
View user's profile Send private message
TrollGentoo
n00b
n00b


Joined: 19 Feb 2015
Posts: 4

PostPosted: Fri Jul 03, 2015 10:32 pm    Post subject: Reply with quote

Hi,

does it improve ?
Back to top
View user's profile Send private message
mseed
n00b
n00b


Joined: 03 Jul 2015
Posts: 4

PostPosted: Fri Jul 03, 2015 11:46 pm    Post subject: Reply with quote

TrollGentoo wrote:
Hi,

does it improve ?


I am waiting if raid failed again. Let me test it longer till i can say it helped.
Back to top
View user's profile Send private message
TrollGentoo
n00b
n00b


Joined: 19 Feb 2015
Posts: 4

PostPosted: Tue Jul 07, 2015 3:27 pm    Post subject: Reply with quote

Hi,

but do you still have the Warnings when the HDD wakes up from sleep mode ?
Back to top
View user's profile Send private message
mseed
n00b
n00b


Joined: 03 Jul 2015
Posts: 4

PostPosted: Thu Jul 09, 2015 3:25 pm    Post subject: Reply with quote

TrollGentoo wrote:
Hi,

but do you still have the Warnings when the HDD wakes up from sleep mode ?


So my server runs almost 7 days without any warnings, errors. Raid is stable.
Code:

/etc/sysctl.conf

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10


If it does not fail next 4 days i will say it definetly helped. For now it looks promising.
Back to top
View user's profile Send private message
mseed
n00b
n00b


Joined: 03 Jul 2015
Posts: 4

PostPosted: Tue Jul 14, 2015 2:18 am    Post subject: Reply with quote

mseed wrote:

If it does not fail next 4 days i will say it definetly helped. For now it looks promising.


Sysctl tuning solved my problem. :D
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum