Real-life RAID reliability
Posted in Storage Interconnects & RAID, Advisor - Tom Treadway by Tom TreadwayNow that we know how to calculate the Mean Time To Data Loss (MTTDL) for RAID-5 and RAID-6 arrays, let’s take a look at some real life cases using actual SATA and SAS/SCSI arrays. For this exercise I decided to use Seagate drives because (1) they’re popular, and (2) they’re more open on their website about Mean Time Before Failure (MTBF) and Bit Error Rate (BER).
Seagate’s enterprise drives fall into two categories: high-reliability Cheetah and Savio SAS/SCSI/FC drives and nearline NL35 SATA drives. Depending on the drive category, these drives have the same reliability characteristics.
-
Seagate Cheetah 10K.7 300GB SAS/SCSI/FC
MTBF =1.4M hours
BER = 1 out of 10^15
Seagate NL35 400GB SATA/FATA
MTBF=1.0M hours
BER = 1 out of 10^14
Next, I’m going to assume a 12-disk array. The Cheetah SAS drives will be configured as a RAID-5 and the NL35 SATA drives as a RAID-6. The SATA drives have a higher capacity, but RAID-6 also requires an additional drive for data protection, so the final capacity comes out close enough.
I’ll leave the math as an exercise for the user, but the result is the following for MTTDL taking only MTBF into account:
-
MTTDL due to MTBF
SAS RAID-5 = 52,933 years
SATA RAID-6 = 6,694,442 years
Wow. That’s a pretty convincing 126X difference in favor of RAID-6 SATA. Now let’s look at only the effect of BER.
-
MTTDL due to BER
SAS RAID-5 = 511 years
SATA RAID-6 = 87,832 years
Yikes. That’s a 172X difference in favor of RAID-6 SATA. And they’re both much smaller numbers than the MTBF-only calculations.
If you combine the two you’ll see that the BER has the predominant effect on the final MTTDL.
-
MTTDL total
SAS RAID-5 = 506 years
SATA RAID-6 = 86,695 years
This boils down to a 171X difference in favor of RAID-6 SATA over RAID-5 SAS.
So what have we learned today? First, the oft-quoted drive MTBF seems to have very little to do with reliability in arrays. It’s all about Bit Error Rates.
Second, RAID-6 can make a set of cheap drives with 1/10th the BER of high-end expensive drives reliable enough for the enterprise environment.
One last thought: It’s very possible that SATA drives are even better than we think. There’s been a persistent rumor in the industry that the drive guys lower the specs on their ATA/SATA drives just to make the SCSI/SAS/FC drives look better – and to justify their higher cost. So maybe the BER on all these drives is closer than we think. And maybe the same thing’s true for MTBF. Who knows? These drives guys are very, very sneaky.
TT
January 4th, 2006 at 12:26 am
Tom,
trying to do my homework and use your formulas for my business case,
I need to verify that my Excel gives the same results than yours.
However, you do not provide the MTTR for SAS nor SATA, so I am unable to check MTTDL_MTBF
Could you clarify you assumptons, and tell whether you used your MTTR formula or real world measurements?
Thanks anyway for your great job on this blog
January 4th, 2006 at 8:47 am
Jacques,
The MTBF and MTTR has very little effect on MTTDL; the BER is more significant. Therefore I simply used a calculated value for MTTR. Sorry for not being clear.
Here is a link to a spreadsheet that you can use to calculate all of this data, as well as performance and cost data, in numerical and graphical format. You’ll notice that the spreadsheet currently has specs for some older drives, but it should be obvious how to plug in new specs.
Let me know if you have any problems with the spreadsheet.
And thanks for reading. It’s not a thrilling topic, but hopefully I can help others make sense of the Storage World.
And feel free to post questions if there are any new topics you would like to see us cover.
TT
January 5th, 2006 at 5:59 am
Thanks a lot for sharing the spreadsheet; I will let you know any problem.
The main issue obviously is the MTBF we feed into the model.
As you explain somewhere else, 1M hours is not consistent with what people measure in the field, even for SCSI.
For example, the Terraserver people report 24 broken disks in 3 years (2000-2003) with 78 SCSI disk! Around 85 000 hours for MTBF!
http://arxiv.org/ftp/cs/papers/0502/0502010.pdf
So one may wonder where the truth is …
Jacques
May 24th, 2006 at 9:34 am
Tom,
Thanks for providing your spreadsheet - makes things so much easier. Could you elaborate on the formulas you use for calculating the Write IO/sec/drive and per array? I’m new to this field and I’d like to know the reasoning behind these. Are they applicable for all the popular RAID levels (including the ones using mirroring)? Also, is there a way to reflect the controller cache in any of the calculations?
Keep the great posts coming
!
Ivo
June 28th, 2006 at 4:26 am
Ivo,
I’ve uploaded a document here that goes into more detail on how the equations were derived, including the performance estimates. Enjoy!
TT
P.S. It’s actually not too hard to estimate the effect of the cache on performance. You “simply” have to figure out how the cache changes the IO pattern on the drive. For example, with a good write-back cache the short sequential writes from the host will become a fewer number of long sequential writes. And with a read-ahead cache, short sequential reads will become long sequential reads. And lastly, with a very large queue depth, or a large write-back cache, random access will become slightly less random, i.e., the seek times will be reduced. A more detailed analysis of how cache affects IO is frankly just real dang hard.
July 25th, 2006 at 6:01 pm
After reading a paper on BER [1] I did some very ruff calculations on the numbers they reported. What I came up with was a mean BER of 10^15.6 bits or 1 in 450 terabytes. 10^16.4 bits were read in total and I counted a total of 6 BER. The drives in question are a WD2500JD (rated at 10^14) and a 7Y250M0 (rated at 10^15).
[1] http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2005-166
July 26th, 2006 at 5:09 am
Great test. I find it very cool that you got within 10 times the rated BER. That may seem like a lot, but it actually sounds pretty close to me for such a small sample size. The drive vendors probably supply a conservative value rather than an average.
One comment in the paper struck me as odd. To paraphrase, “… the BER is not relevant … a more meaningful metric is MTTDL …” But aren’t bit errors during rebuild one of the biggest causes for data loss?
Thanks for reading and commenting.
TT
July 26th, 2006 at 2:22 pm
I was being conservative when I stated 10^15.6, Not all the errors I counted where true bit errors.
“From the point of view of the programmer, we have seen 3 uncorrectable read errors. From the point of view of the operating system and the disk drive, there have been 30 uncorrectable read errors. There are actually at least 12 disk extents which encountered errors that were unrecoverable by the disk; 3 of these remained unreadable after four retries by the file system and imply data loss.”
If I read that correctly only 3 true unrecoverable bit read errors happened. If we calculate the new BER with 3 errors we get:
Log (10^16.4 / 3) = 10^15.9228787453 bits or one in 951.89475 terabytes. I believe this is why they say “MTTDL is a more meaningful metric”. I think the only thing that’s clear is someone else needs to redo the test with better controls and a larger sample size. If it is true that the BER of SATA = SCSI then it does change a few things.
——————–
BTW what units are you using to calculate MTTR, I can’t get your equations to match your examples here. Where are you getting M from and is C the sum total of all drives in the array? What would the MTTDL_BER_R6 be if you used 10^15.923?
July 27th, 2006 at 5:08 am
The drive industry has historically listed SCSI (and SAS) drives as 10X lower BER than ATA (and SATA) - 10e14 vs 10e15, if I remember correctly. I’ve never been able to determine if it’s actually true. I’ve always expected that they do this to justify the higher price for SCSI and SAS.
As far as MTTR, you’re making me use the way-back machine. I’ll have to go off and figure out how it was calculated. I’ll get back to you…
July 27th, 2006 at 6:18 am
I checked the MTTR calculations, and I “think” they’re right. Here’s what I did:
I started with the media rate - let’s say it’s 80MB/s. Then I figured that the RAID card would read all the drives in parallel, waste a revolution doing an XOR, and then write the parity. Note that the card won’t actually read and write in nice, neat whole-tracks, but with the drive cache enabled (both read-ahead and write-back) then the result should feel like entire tracks are read and written. This all results in the 80MB/s being cut to a third, or ~27MB/s. I then convert this to MBs per hour to match the MTBF numbers used later. So this would be 97,200MB/h.
Then I used the capacity of just one of the drives, i.e., the drive being rebuilt, which in this example is 300GB. Converting 300GB to MB gives me 307,200MB. Then I simply divide capacity (MB) by rate (MB/h) to get 3.2 hours.
And that seems to match the spreadsheet.
Now, in real life, it’s hard to find a RAID card that is that efficient. Also, many OEMs choose to run with the drive write-back cache disabled, causing missed revolutions on the write. Also, it’s possible that these large writes could cause the read-ahead buffer to be flushed, causing another missed revolution. And if the RAID card transfers in chunks much smaller than a track, you’ve got even more missed revolutions.
So I suppose I should have squinted a little and arbitrarily doubled or tripled the number. But the MTTR plays such a small part of the MTTDL that I figured it didn’t matter. I arbitrarily tripled the number as a test, and it only lowered the MTTDL by ~1%.
Does this match your calculations? It’s certainly possible that I screwed something up…
October 20th, 2006 at 8:12 pm
The minimum effect is slow throughput, although the more serious problem is random array failures, with the drives testing fine afterwards.
October 22nd, 2006 at 9:31 am
Agreed. That’s a frustrating problem for a user. And there is always plenty of finger-pointing when perfectly good drives fail. Is it the controller, the drive, the cable, …? Unfortunately this happens too often when new interfaces are released. I think we’re past that for SATA and SAS.
November 30th, 2006 at 1:30 pm
Tom,
Thanks for doing the math for us. How would I come up with a BER for a RAID-5 array? You have Mean time to data loss from BER but I kinda need to take time out and say a 7 drive RAID 5 array of 146GB Cheetah’s will have an unrecoverable read error every 10^Y bits.
Thanks
Howard
December 1st, 2006 at 11:07 am
Howard, good question. The BER of a RAID-5 is simply the total of the BER of all the drives in a degraded array, or BER*(N-1), where the RAID-5 drive count is N. Optimal arrays don’t have BERs because they can be corrected. I suppose there is a really, really tiny change that a BER could exist on the same block of two drives during a rebuild, but it’s so small that it’s not worth putting into the calculation.
Does that make sense?
TT
December 16th, 2006 at 5:44 pm
Has anyone considered the positive effect of ZFS? Apart from its other contributions to data integrity, with regular scrubbing, bit errors can almost be eliminated from consideration…
January 2nd, 2007 at 6:38 am
Sure, ZFS has plenty of positive aspects. But it’s not any better than standard RAID-6 with background data scrub, is it? To me it just seems that the data layout is more obfuscated.
February 8th, 2007 at 10:41 pm
SATA/RAID 6 seems to do just fine with MTBF. But how can we estimate the total performance of a RAID set in terms of bandwidth and availability in 24/7 operation ? Like Tom described in his nice “Seagate’s definition of nearline drives” article, there are number of other factors having an effect to the total performance.
February 9th, 2007 at 6:15 am
JPuu, good question. As I mentioned in the nearline post, a few things that can affect performance significantly are (1) error handling, (2) rotational vibration, and (3) workload management.
I guess we can discount error handling because it “shouldn’t” happen often. And once you get an error it should be repaired and shouldn’t be a source of continuing performance degredation.
Rotational vibration is a good one. About the only way to see if you’re affected is by running a random, full-bore seek benchmark one drive at a time, and then re-running it with all drives active. Assuming that the RAID controller scales correctly as drives are added, there should be no drop off. We ran some internal benchmarks with SAS and SATA drives interleaved in an enclosure. Wow! The SAS drives were clanking away making one hell of a racket, with no impact to performance, but then SATA drives were a different story. They dropped to around 10% of their normal performance. We were quite surprised that RV really had that effect.
Workload management is interesting in that full rotation delays (4-17ms) will be introduced when the drives are running hot. I would hope that a SMART error would be returned to indicate that this was happening. Those delays would KILL a random write application.
TT
February 9th, 2007 at 11:31 am
“SATA drives … dropped to around 10% of their normal performance”. Wow! However, mixing SAS and SATA drives in one box runs against popular wisdom. Did you run a similar test with all SAS and all SATA drives?
February 10th, 2007 at 7:26 am
Ernst, I think mixing drives can make sense for folks that need one volume of cheap, slow storage and one volume of expensive, fast storage. Most enclosures support up to 12 drives, so combining the two drive types is probably reasonable. But I’d certainly agree that combining SAS and SATA in one array is crazy.
We did this test for a variety of enclosures just to see how well (or poorly) they could isolate each drive from the RV effects of other drives. We were frankly surprised at the difference. I forget which drives we used, but we didn’t really try too many different drive types. The test was all about enclosures. But maybe we can re-run these test with some of the new nearline drives. I’ll see if I can talk someone into doing that…
September 26th, 2007 at 5:38 am
Hi Tom.
I was very interested in seeing your spreadsheet as referenced in the comments above, but the link appears to be broken. Is there any way to repost or resend?
Thanks,
Chris
October 5th, 2007 at 8:43 am
It looks like the link to the spreadsheet is broken. Do you still have it around somewhere?
November 1st, 2007 at 10:24 am
Oops. Link is fixed. Sorry about that.
And here it is again just so you don’t have to scroll back to find it.
Enjoy!
TT