Is RAID-6 made of wood?
Posted in Storage Interconnects & RAID, Advisor - Tom Treadway by Tom TreadwayAfter my last post, Real-life RAID reliability, you were probably wondering whether the story line could get more interesting. The naked campfire sing-along at SNW paled in comparison to the ribald explanation of how to use MTBF and BER to calculate MTTDL. Whew! That’s just darn good reading.
But it gets better! Yesterday, during our lunch basketball game in the back parking lot, someone asked me, “Tom, your treatment of the drive reliability subject was remarkable, but I came away feeling ungratified, with a yearning in my groins. Perchance, could you explain how bad would SATA have been without the benefit of RAID-6?”.
Sure, Ethan. This one’s for you.
So I just finished explaining how a 12-drive SATA RAID-6 has 171X the reliability of a 12-drive SAS RAID-5.
-
MTTDL total
SAS RAID-5 = 506 years
SATA RAID-6 = 86,695 years
But what if the SATA array was a 12-drive RAID-5? How far would reliability drop? Pretty far, it turns out.
-
MTTDL total
SAS RAID-5 = 506 years
SATA RAID-5 = 38 years
It dropped by a factor of over 2000!! Now it’s clear that the reliability of SAS is actually 13X that of SATA. RAID-6 can take even the worst of drives and make them reliable.
Stay tuned for my next installment, where I explore the ramifications of using witchcraft in arrays by seeing if a SATA RAID-6 array weighs more than a duck. (Remember, only very small rocks float.)
Yep, darn good reading.
TT
November 3rd, 2005 at 11:32 am
This is all very well and good, but won’t things like ILM change the landscape here????? Because I don’t really expect my data to sit on a single storage subsystem for ever. When the big beautiful ILM picture gets painted, aren’t I just going to continually move stuff around between tiers of storage with ever decreasing prices?????
Because you’re talking big numbers in any case. Even 38 years is a big number relative to the time that I expect any of my data to sit on a particular storage subsystem. In the next 38 years someone, somewhere is going to come up with a new storage methodology and I’ll be migrating by data to that.
Obviously 506 and 86,000 are bigger numbers….. but you’re right. Small rocks float. But even smaller rocks float, and really tiny rocks float too.
So if all you’re looking for is a rock that floats well enough to get you over to the shore, well……. isn’t the RAID-5 good enough when you connect it to the anticipated revolution in management software?
November 3rd, 2005 at 12:02 pm
Yep, you’re right, J. You’re going to have to determine the MTTDL of all the locations where data may reside.
Regarding the 38+ years, sure that’s a long time. And with one storage subsystem you would stand a very, very small chance of losing data. But obviously a big company with lots of sites stands a higher chance. It’s all just risk mitigation. And I’d agree that disk failure may not be your #1 concern. I’d put my money on power supplies and hurricanes. I just like to talk about RAID.
RAID-5 might be good enough, but my point is that RAID-6 is better with hardly any downside. See my previous post about parrots and the fjords.
TT
November 3rd, 2005 at 12:16 pm
Well, even if I had a bunch of drives, and assuming that a hurricane doesn’t come through … it takes a bunch o’storage to get to the point where replacing a drive is at least as common as buying, let’s say, cheese.
At 38 years, Don’t I have to have about 2000 drives before I’m replacing one a week? And at 500GB each, that’s a petabyte. Of course, if I were going to use the 500GB drives, I’d have to migrate the data off the one drive I’ve got now, which would restart the clock, which was kind of my earlier point.
At once a month, I gotta have 250TB, so even if I only want to visit Fry’s on the old heathen festival dates, I’m good for nearly 100TB. Assuming my math is close.
November 3rd, 2005 at 2:02 pm
Cheese? OK, I see your point.
If you’re talking about Fry’s then you’re coming from a user point of view. And I agree with your points. Heck, I don’t run RAID-5 at home, and I could probably “borrow” as many Adaptec RAID controllers as will fit under my shirt.
But if you’re a business, how much do you want to pay to avoid down time? How much do you lose by having your business off the web, or your employees not producing. We are just talking about adding another drive to your array. That seems like relatively cheap insurance.
TT
November 3rd, 2005 at 2:06 pm
Ah. J Lumber. Jack Lumber. Lumber Jack. I get it. I’m a slow learner.
And I bet you just don’t care, what with you sleeping all night and working all day. Yeah, you’re okay.
It’s good to see another Python fan out there. I don’t think anyone else gets my jokes.
TT
November 4th, 2005 at 11:29 am
Still not quite sure you’re seeing my point. (Look, look, I’m being repressed!).
Whether I buy my drives at Fry’s or not really isn’t the issue. I’m just trying to see what the impact of losing a drive would be. My outfit certainly doesn’t have 100TB of storage (yet), but according to your numbers, odds are that with RAID-5 I’m only going to replace a couple of drives a year due to failures. And I certainly know that I’m going to be buying and deploying and configuring and managing a lot more new drives than that just to handle normal growth.
So even though RAID-6 doesn’t cost me much, it also doesn’t seem to give me much. It’s a tiny bump on my storage management workload, and if I go that path I have to believe that the math crack-heads have done the algorithms right on some bleeding-edge stuff…….
Must be my paranoia, but it just doesn’t seem like a particularly good change for my company to make at this point.
Personally I’d like to see the storage guys work on making the other management problems disappear so that I don’t have to fire up the ugly GUIs and command lines all the time.
November 4th, 2005 at 1:58 pm
Ah, yes. The “ugly GUIs”. I assume you’re talking about how the IT guy has to understand the nuances of nine different RAID levels, each with variable stripe sizes, as well as the effect of the cache tunables, and how all that relates to his access pattern, to properly set up the controller through a vendor unique GUI. And if it’s external storage you force him to set up switches, HBAs, failover, etc., and all those pieces, assuming they play together, have their own vendor unique GUI. What so hard about that?
It makes just as much sense as allowing strange women lying in ponds distributing swords to be the basis for a system of government. After all, supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony. But I digress…
I wonder what percentage of controllers are set up correctly. And if they are set up incorrectly, do the IT guys know? If they know the IO access pattern and what the performance should be, then they could run IOMeter - but that’s just as techy as setting up RAID to begin with. These guys are running real world Exchange and Oracle applications where it’s a little more difficult to determine where the performance bottleneck is.
So it sounds like you’re looking for just a bunch of storage that’s just always there, always reliable and always fast. You may not care where it is or how it’s built. You want it to survive hurricanes. Egads. You’re going to make me say the V word. But this is probably the one place where I’d happily say it - Virtualization. Storage adapts to the access pattern and can grow or shrink as necessary. Commonly accessed data migrates inward to the faster storage devices that are closer to the server, while cold reference data migrates away to slower storage further away. There’s also some remote mirroring magic involved. And all of this is set up with a simple, industry-standard GUI with as few questions as possible. Yes, this is what virtualization should be all about.
There are a lot of products out there that have little pieces of this puzzle. SMI-S was supposed to be the glue that would bring it all together. Now the word on the street is that the upstart Aperi folks may tackle this problem. I’m not sure which answer is right, but we’d probably all benefit from the competition.
I have no idea if that’s what you were asking, but I thought it was a darn fine answer.
Am I getting closer?
TT
November 4th, 2005 at 2:48 pm
Wow. I was thinking of heading to Fry’s to see if I could pick up another drive, and now you’re off down the path of $100,000’s worth of virtualization gear.
It was a darn fine answer, especially the bit about the watery bint with the sword, but I think we’re getting a bit far from my budget.
On the other hand — the question about “do the IT guys know what they’ve got?” is interesting. If Iometer’s so hard to run …. why not bundle it or something like it with your HBAs with a little “wizard” button that does a run and tells me how well I’m doing? Look at all the “Internet speed tests” that let you know how fast your cable modem/DSL is running without having to be a network guru. Why don’t I get one of those with my RAID controller????? (Hey, maybe you guys should pay me for this brilliant idea!?)
And If I’ve got Exchange running on my system why don’t you just set it up right for me from the get go. You can tell that Exchange is running can’t you??????? (That’s a twofer….)
Enough already…… beware or I may taunt you some more.
November 4th, 2005 at 4:00 pm
I believe it was a “scimitar”, not a sword.
That level of virtualization shouldn’t be expensive. “It’s just software.” But I’d agree that it’s currently that expensive - if you could find something like that did all those things.
Regarding IOMeter and performance monitoring software, yep, I know exactly what you mean. But I can’t comment on future products.
And regarding Exchange, yep again. Change the GUI checkboxes from RAID-1, 5, 6, etc., to Exchange, Oracle, etc. Say no more.
Add a little auto-morphing (changing RAID levels as work load changes), and you’ve got that $100,000 product that you referenced at a fraction of the cost.
But we still need a management infrastructure to glue all the other pieces together. Hopefully SMI-S or Aperi will step up.
TT
November 4th, 2005 at 6:56 pm
Just remove the darn checkboxes altogether.
As you correctly said in the earlier post ….. you RAID guys know a lot more about what’s going to be best for me and my applications than I ever will. And I sure as heck don’t have the time to go off and try different things to find the best result.
Why don’t you just do it right for me out of the box? And sure, I could confuse the issue by going ahead and installing a new app later on. But I’d be no worse off than I am now even if your initial guess wasn’t the best for my new app. And if you were watching you could tell I’d installed it anyway. We’re only talking about three or four apps. Jeez, just check for them being installed every day or two. If I install Exchange (perish the thought) you could pop up a nice friendly box that said “We detected that you installed a Behemoth Email Application. We’ve reconfigured your storage to handle it”. Can you imagine how many emails of delight you’d get from folk????????
It’d be like getting spam every day for a year.
[spam, spam, spam, spam, …. .beautiful spam, wonderful spam!]
November 5th, 2005 at 5:43 am
Sounds like you’re talking about BDS - Brain Dead Simple.
Your idea about changing the RAID level when new apps are installed is interesting - and not very difficult.
But I wonder what we should do if we detect two apps on the same volume, for example G:, that require different RAID levels. Making one volume have two RAID levels based on block number (yikes, HP AutoRAID?!) makes my head hurt, and automatically creating an H: volume and moving one of the apps is very scary. I suppose at a minimum we can just tell the user what we’ve detected and what we recommend that they do.
This has been a very interesting exchange, and you’ve definitely given me some new things to think about. You may have just brought BDS to a new level. I’d like to buy you lunch and talk if we ever run into each other at an industry event. I think I’ll know what you’ll be wearing.
TT
November 5th, 2005 at 7:16 am
With drives being as inexpensive as they are today, it seems like most customers would be better off with a RAID-1 or RAID-10 array, as this avoids the write penalty associated with the RAID-5 and RAID-6 write-back algorithms. Regardless of the array type though, if the Windows operating system typically fails within a few years, does it really matter whether the disk array fails in 38 years or a few thousand years? If the operating system fails and the data files get corrupted, having double redundancy instead of single redundancy will not make a difference.
November 5th, 2005 at 9:14 am
Wayne,
Yeah, in a perfect world everyone would use RAID-10. I remember describing RAID-5 to my boss back in the early 90’s and he thought I was crazy. “Why suffer the performance penalty of RAID-5 when drives are so big and cheap.” (I think they were 100MB! Wow!) And then every year afterwards, while the engineers were fixing yet another tough RAID-5 bug, they would say , “Why mess with RAID-5 when drives are so big and cheap.” And then drives hit 1GB (inconceivable!), and they said… I think you get the point.
And all that time, the IT Nazis (I mean that affectionately, George
) could never give us enough drive space to do our jobs. It seemed that we spent more time zipping and deleting than coding. I guess they kept running out of drive bays, etc. RAID-5 would have given us a lot more space than RAID-10.
So if performance is the goal, and cost is no object, RAID-10 is the way to go.
But something bugs me about the SATA MTBF calculactions. I stand by all my previous comments about 38 years, 1000 years, whatever. And I’m comfortable that the equations are correct. But are the MTBF numbers that the drive guys are giving us correct? I have four machines at home with one or two ATA hard drives in each one. My unscientific gut feel is that a drive seldom lasts more than four years. That’s an MTBF of roughly 35,000 hours, quite a bit less than the 1,000,000 hours quoted by the drive guys.
Hmm. Maybe RAID-6 isn’t even good enough. Did I just talk my self into RAID-n? What am I overlooking?
TT
November 7th, 2005 at 7:52 am
Wow, you’re right, Seagate is claiming a MTBF of 1.0 million hours for their NL35 Series disk drives! Without having any scientific proof to back it up, I agree with your estimate Tom. While 4 years sounds about right, and 10 years seems possible, 114 years is way way out there. I know of no mechanical system that will last this long, without performing maintenance, such as replacing bearings. Even if you ran a thousand disk drives for a full year under abnormally heavy load and none failed, I don’t see how you can project this. Eventually, the disk drive seals will deteriorate, heat will affect electrical components, and the cycles of expansion and contraction will crack solder joints.
According to one site I checked, a one million MTBF means that out of a million hours of testing only one drive failed. However, the site goes on to say that since “disk drives are typically tested only a few hours” it is “unlikely for a failure to occur during this short testing period. Because of this, MTBF ratings are also predicted based on product experience or by analyzing known factors such as raw data supplied by the manufacturer.”
Given that Seagate only warranties their NL35 series of drives for 5 years, I’m inclined to think that the manufacturer really believes that these drives will fail shortly thereafter. Perhaps, what the disk drive industry needs is another reliability number, as MTBF doesn’t appear to be a useful in calculating a product’s expected lifetime.
November 7th, 2005 at 9:21 am
I’m beginning to think we don’t know what MTBF means. I’m going to try pulling one of my old QA friends into the discussion.
Regarding the 5 years, desktop SATA drives are 0.8-1.0 million hours and that warranty has been dropped to just 1 year.
Maybe the issue is the nunber of start/stops. Or perhaps the 1 million hours is in a perfect setting. I know my home workstations are pretty dang hot.
It’s a mystery.
TT
November 8th, 2005 at 5:43 am
New info: While reviewing some SAS documentation (I think it came from the STA), I came across a table that indicated an MTBF of only 500K hours for desktop SATA drives - and that was only if they were powered on for a maximum of 8 hours a day!! I can also assume that it means they’re powered on just 5 days out of each. That’s just 40 hours a week out of the 168 hours in a week. So if the drive is always on (like in my home) is the actual MTBF only 119K hours? That’s still 13.5 years, but getting closer to the ~4 years that I’ve experienced.
TT
November 14th, 2005 at 8:00 am
MTBF/MTTF stats from all vendors are entirely untrustworthy, for pretty much the same reason that analysts are untrustworthy: 86% of statistics are made up on the spot.
Reality depends on your actual workload characteristics, which tend to be deflated in lab testing. AFR is a much better predictor of drive resiliency in my experience.
November 16th, 2005 at 5:51 am
And the other 14% are lies and damn lies.
AFR (Annual Failure Rate) is typically defined as the inverse of MTBF, making it just as untrustworthy as MTBF. But I agree that it’s an easier number to “fit in your head”. Who the hell knows how long 119K hours is? If you invert it, adjusting for a change in units, you’ll get 7.4% which is much easier to understand. For comparison, an enterprise drive with an MTBF of 1.4M hours has an AFR of 0.6%. Seems low.
TT
February 13th, 2006 at 10:26 am
MTBF is only a reliable number within the expected service life of a hard drive (generally 5 years or so). No one, not even manufacturers, expect a single hard drive to last 100 years, but they do expect out of 100 hard drives to have a single failure in a year. After five years, all bets are off, and all those 100 drives might explode into flames!
I’ve got a question about SBOD improvements to availability. A vendor is suggesting to me that their SCSI RAID5 SBOD can provide a better system MTBF than most RAID6 JBODs because of the improvement in predicting future hard drive failure. Supposedly the SBOD switch can notice a few soft errors on a drive, and suggest that you replace it before it actually blows up.
Based on SMART data, I’ve seen algorithms to predict drive failure with a warning accuracy of 20%-60% with false alarm rates of 2%-4% (close to the annual average drive failure rate).
Supposedly SBOD traffic latency analysis can do an even better job, but I haven’t found the data yet!
(SMART data from “Improved Disk Drive Failure Warnings”, Hughes/Murrary/Kreutz-Delgado, IEEE Transactions on Reliability, Sept. 2002)
February 17th, 2006 at 6:13 am
Thomas,
Yep, I agree that a drive won’t last 100 years. MTBF is just a statistical measurement that becomes invalid after so many years.
Regarding predictive failure analysis, also known as PFA, or SMART:
A vendor saying that RAID-5 with SMART is better than RAID-6 is a little misleading. Of course RAID-6 can also take advantage of SMART.
However I think I see the point they’re trying to make. Using SMART you can detect a bad drive before it fails and replace it, hopefully avoiding the condition of two simultaneous failures and thereby not requiring RAID-6.
But that’s not why we should use RAID-6. The main purpose of RAID-6 is to protect against bit errors detected during a rebuild. SMART can’t protect against that.
BTW, you mention an SBOD switch detecting SMART errors. I’ve never seen a switch that does that. RAID is done in an HBA or external controller. Why would a switch be checking for SMART errors, and if it saw one, wouldn’t it just inform the RAID stack anyway? Or maybe I’ve misunderstood how they’re defining an SBOD. I usually think of a SBOD as being a JBOD with a switch - and therefore it’s typically FC and soon SAS.
If you can tell me which vendor you’re talking about, I’d love to look into what they have.
Thanks for reading. I hope you find the posts useful.
TT
February 17th, 2006 at 7:39 am
Tom,
If you look at this paper:
http://www.emulex.com/products/white/fcswitch/Advancing_Storage_Reliability_wp.pdf
you will see the typical claims for enhanced SBOD reliability through monitoring.
Here are the claims:
1) JBODs use serially connected FC-AL topologies. A rouge drive on an FC-AL segment may cause one or both FC-AL channels to become unavailable, potentially making all drives unavailable. SBOD can isolate a rogue drive more effectively.
2) SBOD allows a newly inserted drive to be dynamically tested before it is ever placed into service.
3) SBOD controllers are better set up to monitor traffic trends to see potential bottlenecks going into individual drives, which may reveal drives that are about to go bad, but do not yet report CRC errors.
4) SBOD controllers examine words on FC links to each drive looking for CRC errors. In JBOD, the source of CRC errors on the FC-AL can be elusive because of the shared, serial nature of the topology. SBOD can track down the exact port and Source ALPA of frames with CRC errors.
5) SBOD can perform a clock check comparing the relative frequency of attached devices (both drives and initiators). Identifying devices with out-of-tolerance clock frequencies allows them to be replaced under optimal system conditions before failure.
6) SBOD provides enhanced MTTRs because it can tell you right away which cabling error occured during repair.
The question is, does all that really matter, and if so, how much?
July 25th, 2006 at 2:39 pm
So I guess my 2 raid 6 arrays mirrored together with ZFS should be OK…
July 25th, 2006 at 2:46 pm
Uh, yeah. I think you’re pretty safe.
July 25th, 2006 at 2:52 pm
Using commodity non-Enterprise drives for cost purposes with HS configured on each — I need utter reliability…
I was planning on a straight RAID 6 until I got enthused by ZFS…
Commodity desk top drives might fail slightly more than the enterprise ones but they cost less so I can get a few extras and in this configuration they should work fine.
Thanks for the interesting read!