Samba + sleeping drives. Can we wake up faster? Solved

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

Success!!

Posted this to the beginning of the thread in order to guide all those who follow.

Why?
I wanted to put my DM software based RAID6 to sleep when not in use. At 10 watts per drive, it adds up! I did not want to wait 10 seconds per drive, in series for the array to come to life. I was tired of my windows desktop hanging while waiting for a simple directory look up on my NAS.

Disclaimer:
Do not come crying to me when you destroy a hard drive, loose all your data, fry a power supply, or cause a small country to be erased from the face of the Earth.

The key points covered below:
Drive Controller
Bcache
Inotify

Drive Controller
My server/NAS was running 3X LSI SAS 1068e controllers to control my 7 drives RAID 6. Turns out that the cards are hard coded to spin up in series. No way to get around it, it just is. This happens to apply to ANY card running the LSI 1068e chipset, such as a Dell Perc 6/i, or HP P400. This may even apply to all LSI based cards. To make matters worse, the cards are smart and will only spin one drive up at a time across all 3 cards. My 7 disk RAID 6 was taking 50 seconds to spin up (10 seconds per drive). This was dropped to 40 seconds when I moved 1 drives to the on board SATA controller. That was my first clue. Thanks to the Linux-Raid group mailing list for the help isolating this one.

So I was on the Internets looking for a new, cheap, 12~16 port SATAII controller card. I found a very strange card on ebay. A "Ciprico Inc. RAIDCore" 16-port card. I cant even find any good pictures or links to add to this post so you can see it. It basically has 4 Marvell controllers and a pci-e bridge strapped onto a single card. No brains, no nothing. Just a pure, dumb controller with out any spin up stupidity. Same chipset (88SE6445) found on some RocketRAID cards. It was EXACTLY what I was looking for. At a cost of $60 I was thrilled. In Linux is shows up as a bridge + controller chips:

eccerr0r · Posted: Sun Aug 10, 2014 6:15 pm Post subject:

I think you're pretty much asking for two conflicting desires. If the data you want is not in cache, it has to spin up the disks which means you wait. So pretty much if you don't want to wait, keep the disks spinning or keep the data you want frequently on a disk that remains spinning.

(If your PSU is very hefty, I don't know if there's a way to get mdraid to simultaneously spin up all disks, as currently it will stagger spin - which is much less wear and tear on your system.)

I end up having to run my 4x500GB RAID5 spun up 24/7 since it's being used so randomly, albeit lightly - the the spin up/down will get annoying as well as start eating into the lifetime of the disks. Which may or may not be the case for you...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

I very much so have two conflicting desires.

My PSU has the power. I am running a Ablecom SP762-TS which is a 3 way redundant power supply. My whole system is a re-purposed server.

I was not aware that mdraid did a staggered spin up, by default. I will hunt around and see if I can find out how to disable/adjust that.

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

This is just strange. So I wrote a simple script to look at the state of my drives as they spin up:

eccerr0r · Posted: Sun Aug 10, 2014 11:43 pm Post subject:

I think the IDE commands are serialized so yes they will stop when there's an outstanding request to spin the disk...
Also it is possible for two disks to spin up but eventually all need to be spun up.

I have to say it's not "staggered" but rather "serialized" - it will fetch from the disks as needed but this has the effect of staggered startup as getting all the requests out at the same time isn't likely...

Also keep in mind "server quality" means "24/7 99.999% availability" not "spin up spin down as needed" - so you are still using it in an unintended manner :D
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

Wow. If it is truly "serialized" than this is going to become a big problem as I add more drives to grow the array. Any way of caching the file system's metadata? Trying to give the drives time to spin up in the background with out completely hanging the clients request.

Cyker · Veteran Joined: 15 Jun 2006 Posts: 1746

The problem I found was that even if you had gigantic caches, everything would still hang as soon as you requested something outside the cache, as requests that hit the cache don't necessarily wake the disks up.

I have yet to find a nice way around what you describe.

In the end I just got some low-RPM WD Greens and just let them stay spinning! They automatically park the heads when not in use but keep the disks spinning, so you save some power while idling, albeit not as much as a fully sleeping drive, but obv the recovery is a lot faster!

(On a slight tangent, I recently switched to the newer Reds; They run 15C cooler and draw slightly less power vs the 1st gen Greens!)

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

Time to have some fun! This is Linux, we can solve this.

My main array with 7X 2TB Western Digital Caviar Green drives. I have two other arrays in the same system. The OS array is a mirror running 2X OCZ Deneva 240GB SSD's. The archive array is a mirror running 2X Hitachi Deskstar 7K500 with btrfs and compression.

What type of gigantic caches were you able to put in place? Here is my idea. Setup inotify to watch the cache. When it is accessed, start all disks in the array. This falls apart if the cache cant be watched by inotify, like the generic system cache. Or if the cache is global, and not per array. In a worst case scenario, this trick might be employed against the array its self to start all drives up in parallel, but that would only save a few seconds.

Cyker · Veteran Joined: 15 Jun 2006 Posts: 1746

That's the spirit!

Well 'gigantic' was about 2GB on my old server :lol:

. I haven't played with it much on my new one (Currently the cache is 12GB :lol:

) since all the disks just spin perpetually (I find running a torrent server with 400+ seeds keeps it busy and random enough that it never gets to sleep!)

One thing to watch out for is that the IO system tends to block while it waits for the disk to spin up. I know the Explorer threads on my Windows machines would lock up until any sleeping disks woke up and started doing Samba's bidding.

I just had a thought tho' - IIRC Linux 'recently' added the ability to use other devices as an intermediary cache; I wonder if you could set up a small fast SSD as an intermediary cache - Theoretically it would be easier to monitor that for access than the cache in RAM? - and then use that to trigger the disk wakeup?

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

I have my OS on 2X 240GB SSD's. There are lots of ways I can cut a chunk of SSD out for an intermediary cache. I think, in this case, you are referring to bcache (http://en.wikipedia.org/wiki/Bcache).

A RAM based cache also works, as long as it is treated as read only. Dont want a power outage causing loss of data or corruption.

I have the RAM, or the SSD storage. I would rather use a non-persistent RAM cache. Something like a cache in tmpfs.

Using the SSD mirror works, but kinda defeats the point of my RAID 6.

Ideas?

eccerr0r · Posted: Mon Aug 11, 2014 6:14 pm Post subject:

I'd just say, just keep the disks spinning and at least allow the heads to unload, at least you'll get some saving there. The I/O blocking is indeed very annoying during interactive use.

No matter how big your cache, chances are, you'll always be fetching something that's not in cache (why would you be reading the same thing over and over again?)...

(As a side issue, I hate my raid5, IOPS is awful for some reason or another... the drives I have are not blacks or reds, I have two WD "blue" and three Hitachi disks in my 4+1 hotspare system and it bogs down badly during nfs use...)
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

Dont want to be burning 70 watts of power 24/7. Or at least that is the power draw of my 7 disks when spinning according to my power strip.

The cache would only be in place to see read requests. I am really only after caching the FS metadata. This comes into play when I walk the file system from windows. I need to get to the correct directory before I can watch a movie/listen to music. I am tired of windows explorer hanging until the drives spin up. In this case I think a cache of 16mb would be plenty! But I can fling GB's at it.

RAID5 perf. In the case of Linux software raid, you really need 4 drives to get the equivalent of 1 solo drives speed. This is do to the fact that there is no write-through cache capabilities. Raid 6, requires 5 drives before you will get the equivalent performance of 1 stand alone drive. Most of the time the drives are not even the problem. People like to run to many drives on a slow PCI interface. There is also one basic tweak that most n00b's forget to set. Stripe cache size. If it is default, your system will run like trash.

Cyker · Veteran Joined: 15 Jun 2006 Posts: 1746

Yea, I remember messing around with a bunch of settings to try and speed up my old mdadm RAID5.
I had stuff like this in my local.start for a while :lol:

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

As a rule, when I am using my array it does not spin down. The primary job of the array is media (Music and Movies). In the case of a movie there is almost always disk IO going on. If I set the spin down for 7m or 2 hours it wont really help.

I am good with tmpfs and building the inotify scrip but I don't know how to build the metadata cache.. Can you point me in the right direction?

I wish btrfs RAID5/6 was more stable. :-(

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

Just adding some more info. Spin up takes about 9.6 seconds. Need at least 5, of the 7 drives spinning to access data. 9.5 x 5 = 48 seconds. I need to find a fix for this... When I fill my drive cage with 15-2 drive the spin up time will be 125s. OUCH!!!

eccerr0r · Posted: Tue Aug 12, 2014 1:26 am Post subject:

I don't find the power draw a big deal, then again I only have four disks and service requests are not only local, so I can't control who powers the disks up. I've been running a RAID5's for quite a while now, though I was running an Athlon as the server CPU, now I'm running a Core2 Quad. Mostly as this machine is a shell box/VM server/webserver/mailserver. I have another machine that far exceeds the power draw of these disks... and another machine with just its GPU eat more power than the HDDs.

The problem with any cache is that it's still LRU and if you use the cache enough it will discard from the cache. I don't think there is a metadata-only cache available... that would be interesting but potentially wasteful.

Perhaps something easier is just to monitor the network, if you see a SMB packet come by and the disks are sleeping, go ahead and try to spin all of the disks up?

Maybe another way is to break up your raid so you don't have to pay the penalty for spinning up all disks when you only need to use one volume? Then again this complicates other things...

All of my RAID members are on an ICH10 onboard PCIe SATA 3Gbit. Disk sequential read is fine on the server, it's on the order of around 2-3x of a single disk speed (around 150 MB/sec), but random i/o over NFS is awful - even if it's NFS to a VM on the same machine. And yeah I was setting the read ahead and stripe cache larger. The readahead and stripe size (64K) may actually be hurting the performance of small files - I recall my 32K stripe system marginally better than the 64K stripe setup, but it definitely helped hdparm -t /dev/md1 speeds...

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

Still trying to find a good solution for getting access to the data faster. I think I am getting close to an acceptable solution. I asked for help from the linux-raid mailing list and Larkin was kind enough to give me the idea of writing a daemon that controls all sleeping/waking of the array.

So I am currently just playing with ideas.

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

As far as wear and tear on the disks. Yes, starting and stopping the drives shortens their life span. I don't trust my disks, regardless of starting/stopping, that is why I run RAID 6.

Lets say I use my NAS with it's 7 disks for 2 hours a day, 7 days a week @ 10 watts per drive. The current price for power in my area is $0.11 per kilowatt-hour. That comes out to be $5.62 per year to run my drives for 2 hours, daily. But if I run my drives 24/7 it would cost me $67.45/year. Basically it would cost me an extra $61.83/year to run the drives 24/7. The 2TB 5400RPM SATA drives I have been picking up from local surplus, or auction websites are costing me $40~$50, including shipping and tax. In other words I could buy a new disk every 8~10 months to replace failures and it would be the same cost. Drives don't fail that fast, even if I was start/stopping them 10 times daily. This is also completely ignoring the fact that drive prices are failing. Sorry to disappoint, but I am going to spin down my array and save some money.

eccerr0r · Posted: Wed Aug 13, 2014 1:15 am Post subject:

But is it worth ripping your hairs out getting annoyed at waiting for the disks? :D

It's a quality of life issue really then. Replace a disk every year or not have to be annoyed at disk spinup - always available.

I think it's the same cost either way really. Well, for me at least as I don't have as many disks.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

Cyker · Veteran Joined: 15 Jun 2006 Posts: 1746

Well it's definitely not worth the zots required to do this, but it is a fun little experiment

Who knows, we might see a paper on The DingbatCA Early Pre-emptive Midline-Storage Wakeup Algorithm in the future

It'll be cool to see you come up with and how well it performs!

The ionotify thingy looks to be a good start; The next tricky bit will be caching enough stuff to give the disks time to spin up,
I wonder, if you can cache the filesystem metadata entirely, but also have some sort of learning predictor cache that tries to spot access patterns in order to cache enough relevant stuff to give the array time to spin up.

This really is the sort of thing that a Linux hacker should be doing for a final year project or something :lol:

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

I am good with waiting for 10 seconds. With a little bit of caching I could mitigate that; if I can get the array to spin up as one unit. But I agree with eccerr0r that my quality of life is not worth waiting a minute every single time I want to use the array. Most of the media devices I have connected to the array will fail before waiting that long.

So, back to hacking, and my latest problem. Inotify work perfectly and responds with in 0.01 seconds of my array being accessed (Watching the mount point /data). But I can not get the disks to spin up in parallel.

Cyker · Veteran Joined: 15 Jun 2006 Posts: 1746

Possibly relevant?

http://linux.slashdot.org/story/14/04/12/1833244/linux-315-will-suspend-resume-much-faster

Also, what's your PSU like? HDD spinups, esp. 3.5" disks, have a surprisingly high amp draw and I'm slightly concerned your PSU might blow if it gets repeatedly spiked like that...!

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

Well. Thanks for the tip Cyker!

DingbatCA · Guru Joined: 07 Jul 2004 Posts: 384 Location: Portland Or

Running the shiny new 3.16 kernel.

John R. Graham · Posted: Wed Aug 13, 2014 10:49 pm Post subject:

There should be a non-blocking ioctl that could be issued against all drives to spin them up. Let me do some experimentation on my 4-drive RAID5 setup.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.