Home | About the Storage Advisors | Adaptec Trusted Storage


It seems I’m under the influence …

Posted in General, Storage Applications, Platforms, Storage Interconnects & RAID, Storage Management, Application Environments, Advisor - Neil by Neil

So the wife was right after all these years (I can say that here because I can definitely, exactly and almost confidently say that (a) she doesn’t read this and (b) doesn’t listen to a word I say about “blogs” and other computer related stuff).

http://searchstorage.techtarget.com.au/articles/35314-EMC-Cisco-accused-of-FUDfest-Apple-hammered-NetApp-accused-of-PR-BS-as-IBM-and-HDS-fight
(you’ll have to read down the page to get my point)

There is nothing that makes you think on your feet faster than answering questions from a journo. So what’s the future of RAID? Is it RAID 4 as per my colleague from NetApps? Don’t think so. NetApps are one of the few people doing it and if you ask joe public he probably hasn’t even heard of it.

6 and 10 are my bet. RAID 6 is the go these days because of faster processing power on RAID cards … the performance hit we once took for running so much parity has pretty much gone and yes, you can survive two disk failures at once.

RAID 10 is also a favourite amongst the paranoid of us out there because it “sounds” very safe, provides great performance and doesn’t cost the earth these days.

There is, however, a difference between 6 and 10 that needs to be understood (and often isn’t in my experience). Both 6 and 10 can survive two simultaneous drive failures … but with 10 you have to be very, very lucky as to which drives fail to survive the two drive calamity. 6 doesn’t care … two drives go and you’re still good. With 10 if two drives from the one raid leg go, you’re gone.

So don’t think about RAID 10 as being able to survive two drive failures … just think about it as a very fast way to use up a lot of disk space and money at the same time.

Ciao
Neil

6 Responses to “It seems I’m under the influence …”

  1. SH Says:

    Beyond a 4-drive RAID configuration, this point loses salience. RAID 6 can only ever recover from two drive failures. RAID 10 can theoretically suffer n/2 drive failures provided none of them are both drives in a single mirror. The larger the array, the more likely you are to see multiple drive failures.

    Do you have statistics showing that these are more likely to happen in a single mirror than scattered across the array?

  2. Neil Says:

    SH

    “Can only ever recover from two drive failures” … that’s a pretty blaise attitude towards your data.

    Yes, RAID 10 can theoretically survive a large number of drive failures, but the chances of that are somewhat slim. As for the “odds” of two drives failing in the one leg vs two drives failing in two separate legs of an array … I have absolutely no statistical data - you could probably find someone in Las Vegas to give you odds but that’s about as close as it gets.

    The point I was trying to make is exactly that. RAID 6 “will” survive two drives failing at the same time (which is pretty unusual), where as RAID 10 “will” survive the same catastrophe only if you are “lucky”.

    So if it’s my data, I want my sysadmin to have some certainty, not rely on luck. Of course, the hot spare will always rebuild the array very quickly, but my main fear is the failure of a second drive during a rebuild. I’ve seen that many times on older systems when old drives are placed under heavy load during a rebuild.

    I’m not trying to say that you should use RAID 6 instead of RAID 10 … if you are a database administrator you’d be nuts to do this, but just don’t be under the impression that the ability to survive catastrophic drive failures is the same.

    Ciao
    Neil

  3. Ethan Says:

    I recently had to decide between raid-10 and raid-6 for our tier-1 file server (my decision is still pending, we need fast and reliable and cheap… yeah I know.), it was a fun exercise for a newbie like myself. I still have a few questions about it even though your blog answered most of them. Speaking of which, thank you. I really enjoy your blog, it’s a breath of fresh air in this industry.

    Here’s a random train of thoughts. Lets suppose an 8 drives raid-10, and an 8 drives raid-6.

    FIRST drive failure:
    Raid-6 is safe
    Raid-10 is safe

    SECOND drive failure:
    Raid6 is still safe
    Raid10 has 1/7 = 14% chances of data loss

    THIRD drive failure:
    Raid6 is lost
    Raid10 has 2/6 = 33% chances of data loss

    Logically I assume I would end up with more or less the same numbers with any array size, as long as the raid-6 is striped by bunches of 8 drives. I don’t like the idea of having a hearth attack while waiting for the first rebuild, so I will most probably go with raid-6 despite the performance hit. That 14% is very scary, it’s a single dice roll.

    However, it got me thinking….

    In theory, suppose we have an environment where the drives have a high failure rate, in such way that the third failure becomes a clear possibility, could raid-10 actually become safer than raid-6 then ? For example, an fan failure makes the drives overheat, a firmware bug causing a higher probability of losing drives, or maybe the array is very very old, etc… ?

    Can a statistically low amount of time between the successive drive failures affect the difference in reliability between raid-10 and raid-6 in a way that the choice between the two becomes non-obvious on the reliability aspect ?

    Obviously I assume not in MY environment, and if I actually expect such extreme failure rate, my problem is definitely elsewhere and no raid level will save my data. But I enjoyed theorizing about this, and I think I answered my own question.

    Here’s my REAL question :)

    Besides the obvious trade off between “available space” and “reliability”, is there any performance impact going for a higher or lower count of drives per raid-6 in a raid-60 ? Is raid 6 optimal at a certain count ?

    Thanks again !

  4. Neil Says:

    Ethan,

    Glad there was a question in there :-)

    As for your theorizing … I don’t think I can think that hard. Will RAID 10 become more reliable than RAID 6 after a certain number of drives? I’m not sure I want to find out. It’s one thing to gamble and calculate on your odds of losing data … its another thing to use a method where you are “planning” not to lose data (not “chance” it based on odds).

    Think about it this way. If you have 3 simultaneous drives fail from your 8 and you explain that one to your boss, you’ll probably be OK. But if try explaining to your boss that you calculated your own odds, thought them to be within a suitable range of risk but it didn’t quite work out the way you thought. Next weeks pay packet will probably be a bit hard to come by (and rightly so).

    RAID 6 (which is your best mix on a general fileserver) will run sweetly at around 8-12 drives. I wouldn’t bother with RAID60 until the 16-drive mark because it gets somewhat expensive before that. If you are general fileserving just run RAID6 and save the brainwaves for the really important things in life :-)

    Oh, and don’t gamble with your data.

    Ciao
    Neil

  5. Ethan Says:

    If I go to my boss, and explain my above theory about the third drive failure, I’d be fired on the spot :)

    What I want to tell him, is that with raid-10 we have, say, 1% chance per year of losing the array. While with raid 60 it’s down to 0.1% but we lose 30% performance. I can use MTBF/AFR to calculate the MTTDL, but I’m getting paranoid about all the talks of correlated failures. Listening to everyone’s anecdotes on the net, it sounds like all drives will fail quickly after the first failure. Is there a rule of thumb ? Like 2x or 100x the AFR after the first failure ?

    Thanks,

    p.s. : My 8 drives above was an example. The array I’m planning is a 96 drives SAS, so it’ll be raid-60. Actually it will possibly be a raid 600, with 2 controllers in software raid-0. I need every ounce of performance I can get without sacrificing reliability (our access patterns are not the usual file server). Falling back to raid-10 will be a last resort if I can’t tweak raid-6 to have adequate performance.

  6. Neil Says:

    Ethan,

    Let’s not tell the boss …

    I have not seen statistical data that says when one drive fails they are all going to toss up their legs in the next short period of time. Use RAID 60, have at least 1 hot spare in the system.

    Note that when defining a RAID 60 you can define the number of “sub-level” arrays. This means that you can determine the number of actual RAID 6’s that the RAID 60 is made of.

    If you are running 96 drives and want to make one large file system, I’d recommend something like 8 sub-level RAID 6. This means that you will make up your 96 drive RAID 6 from 8 x 12-drive RAID 6’s.

    This would give you 10 drives per RAID 6 usable capacity, multiplied by 8, so 80 drives usable capacity. You can work that out for whatever size drive you are using.

    Note that this will be one very fast and large RAID config. Of course you might want to make multiple RAID configs depending on your OS environment, and that there are many, many variables you can do here.

    Good luck and keep us in the loop on your progress.

    Ciao
    Neil

Leave a Reply