RAID reliability calculations
Posted in Storage Interconnects & RAID, Advisor - Tom by Tom TreadwayIn an earlier post Joe mentioned that the reliability of SATA could be improved by using RAID-6. That’s very true, and there’s a good story to be told about the reliability of SATA with RAID-6 being much higher than the reliability of SCSI/SAS with RAID-5. So I thought I would start by reviewing how reliability is determined for both types of arrays.
MTTR (Mean Time To Replace) is the total time the array is in the degraded mode. It is calculated by adding two time periods. The first is H, the time it takes for the technician to notice that the drive has failed and to replace it. Note that H is approximately 0 if a hotspare is used. The second is the time for the rebuild to complete. In an optimum system with a controller that can (a) keep up with drive media rates, (b) perform the XOR within one rotation, and (c) rebuild in track multiples, this rebuild time can be calculated as the time to read the remaining good drives, XOR/rotate and write the replaced drives. This results in the media rebuild rate effectively being cut to a third of the actual rate. Therefore with C being Capacity and M being Media rate, MTTR is calculated as follows:
- MTTR = H + C / (M / 3)
Note that reading the next set of data while performing XOR operations on the previous set could reduce the time to rebuild the entire array, however the method presented here is more commonly used on RAID controllers.
The reliability of an array is directly related to the reliability of the drives and the number of drive failures that can be tolerated before the array becomes inaccessible. The reliability of the drives is characterized by its MTBF (Mean Time Before Failure).
It can be easily shown that the reliability of a non-redundant RAID-0 is equal to the MTBF of the drives divided by the number of drives, D. This is referred to as the MTTDL_DF (Mean Time To Data Loss due to Disk Failure).
- MTTDL_DF_R0 = MTBF / D
Since a RAID-5 array can tolerate a single drive failure, the reliability of the remaining drives can be calculated.
- MTTDL_DF_R5_DEGRADED = MTBF / (D-1)
Combining these equations, and taking into account the rebuild time, will result in the reliability of a RAID-5 array.
- MTTDL_DF_R5 = MTBF^2 / [ D * (D-1) * MTTR ]
This can be further extended to show the reliability of RAID-6.
- MTTDL_DF_R6 = MTBF^3 / [ D * (D-1) * (D-2) * MTTR^2 ]
It is sometimes argued that the probability of a second disk failure after the first is much greater due to hardware or environmental problems. This is referred to as a correlated disk failure. Other failures can occur early or late in the life of a drive, and sometimes can be related to manufacturing lots. One way of taking this into account is for the MTBF of the second drive to be one-tenth the MTBF of the first drive. Likewise, the MTBF of the third drive is 1/100 of the first drive, etc. Taking correlated disk failures into account results in the following equations.
- MTTDL_DF_R5 = [ MTBF * (MTBF/10) ] / [ D * (D-1) * MTTR ]
MTTDL_DF_R6 = [ MTBF * (MTBF/10) * (MTBF/100) ] / [ D * (D-1) * (D-2) * MTTR^2 ]
A third factor in MTTDL is the failure to read one or more blocks from the remaining good drives during the rebuild of the failed drive(s). This chance of this happening is proportional to the bit error rate (BER), or the number of bits that have to be read on average to have one unrecoverable bit error. Given a sector size of 512 bytes, the number of sectors that have to be read to have on average one bad sector can be calculated.
- Sector Error Rate = BER / (512 * 8 )
Assuming that all errors are random, the probability PDISK of successfully reading all sectors on a disk can be calculated.
- PDISK = (1 – 1 / (BER/(512 * 8 ))) ^ (C / 512)
Lastly, the probability PARRAY of not being able to read all the sectors in the array can be calculated.
- PARRAY = 1 – PDISK^D
From this the MTTDL due to a disk failure in a RAID-5 and a bit error during the rebuild can be calculated.
- MTTDL_BER_R5 = MTBF / [ D * (1 – PDISK^(D-1)) ]
Likewise, the MTTDL due to two disk failures and a bit error during rebuild can be calculated for a RAID-6.
-
MTTDL_BER_R6 = [ MTBF * (MTBF/10) ] / [ D * (D-1) * (1 – PDISK^(D-2)) * MTTR]
Lastly, the MTTDL due to either disk failure or bit errors can be calculated by taking the harmonic mean of MTDDL_DF and MTTDL_BER.
- MTTDL = 1 / [ (1/MTTDL_DF) + (1/MTTDL_BER) ]
This MTTDL value can be used to compare the reliability of arrays with different drive types, drive counts, and RAID types.
In conclusion, it is important to note that the reliability of a system is not determined solely by the reliability of the disk subsystem. In many cases, it is the operating system that will cause frequent data loss. Also, the MTBF of other components, such as fans and power supplies, especially if not replaced promptly, can contribute to failure at a much higher rate than disk drives. However, for this comparison, it should be assumed that the MTBF of these other components is the same for both drive types and can therefore be left out of these calculations.
In my next post I will show how the MTBF and BER of SATA and SAS/SATA drives vary and how those parameters affect the reliability of RAID-5 and RAID-6 arrays.
TT
November 2nd, 2005 at 3:19 am
Tom, great analysis!
First, it seems the storage industry uses MTTR (Mean Time to Repair) and MTBF (Mean Time Between Failures) in a non-standard way compared to ANSI standard definitions. But, no matter, the intent is similar.
Anyway, besides the array failure and rebuild issues, another of the biggest problems we face is data accessibility. With some manufacturers, you may not access the array while it is rebuilding. With others system performance while operating in rebuilding mode is almost as bad as not having the array online at all.
How can we facilitate user access to data with *reasonable* performance while the repair (replacement and rebuild) is taking place? With arrays of multiple TBs in size, it can take quite some time. For example, using your formula on a 1 TB array:
MTTR = H + C / (M / 3)
MTTR = 0 + (1,000,000 MB / (10 MB/s / 3))
MTTR = 300000 seconds
MTTR = 83.3 hours
This is a REAL, DAMN LONG TIME !! What am I missing? How will users accept being actually or effectively down through poor performance that long? Can we do anything to give reasonable access during this time?
November 2nd, 2005 at 1:22 pm
Mark, what are the standard ANSI terms? MTTR is definitely non-standard, but I would have thought MTBF was fairly standard. Just curious.
Regarding the data accessibility during rebuild, make sure you send back or destroy any RAID card that won’t let you access your data. But there’s unfortunately much you can do about the performance hit during rebuild. I suppose that you should stay away from RAID-5. RAID-1 rebuilds shouldn’t be too bad. Also, some RAID cards allow you to tune the rebuild rate to be less intrusive. But of course your window of vunerability (H) would be greater.
Regarding the rebuild time, your media rate is a little dated. Have you got an array of 5.25″ floppy drives.
The number should be closer to 80-100MB/s for the modern, inexpensive drives available today. So the rebuild time should be around 8-10 hours.
Thanks for your posts, Mark. These are very good comments and questions.
TT
November 3rd, 2005 at 1:06 pm
From my perspective, RAID levels affect AVAILABILITY of the array, and NOT reliability. Hard drives and enclosure reliability are pretty much fixed; RAID levels do not improve reliability.
November 3rd, 2005 at 2:09 pm
Isn’t a measure of reliability the ability to keep something available?
But I see your point. I suppose a RAID-5 with a failed drive is still available even though it isn’t very reliable.
TT
November 5th, 2005 at 7:29 am
The longer rebuild times for RAID-5 and RAID-6 arrays are another argument in favor of using RAID-1 or RAID-10 arrays, where a rebuild requires just a simple copy. If performance is more critical that rack space, slot space, or power consumption, then RAID-10 arrays are a better customer fit, as the cost differential for the extra disk drives is minimal.
November 5th, 2005 at 9:22 am
Hmm. I wonder what the rebuild times are. Let’s break it down.
The number of writes are the same on RAID-1 and RAID-5.
The number of reads on RAID-1 is one, and the number of reads on RAID-5 is the number of drives minus one. But all the RAID-5 reads are done in parallel. The drives are doing read-ahead, so the data should easily stream at media rate, assuming that the controller can keep up.
The big difference in RAID-1 and RAID-5 rebuild is obviously the XOR, and is therefore highly dependent on memory speeds.
We’ll go off and do some real measurements on RAID-1 and RAID-5 rebuilds and post the results here.
Good comment.
TT
July 31st, 2006 at 12:25 pm
I liked your summary of the RAID 5 and RAID 6 MTTDL. It was very helpful for me to use your MTTDL_DF and MTTDL_BER portions. I am now looking for similar equations for RAID 10. Can you help?
Thanks,
Mike
August 7th, 2006 at 7:00 am
Hi, Mike. Glad I could help. Regarding RAID-10, here’s how you can create the equations:
First, remember that a RAID-10 is a stripe of RAID-1 components. So start by calculating the MTBF of a RAID-1. If x is the MTBF of a single drive, then x^2 is the MTBF of a RAID-1. Just think of these RAID-1 components as highly-reliable disks.
Next, the MTBF of a RAID-0 is simply the MTBF of the individual drives divided by the number of drives, n.
So the MTBF of a RAID-10 is x^2/n. It’s that simple.
The rebuild time of a RAID-1, or a RAID-10, is simply the amount of time to copy one drive to another. There should be no complications due to missed rotations or XOR calculations, so the rebuild rate is simply the media transfer rate. And don’t get distracted by the question of what would happen if multiple RAID-1’s were rebuilding because they’re each rebuilding in parallel.
Lastly, the total BER is calculated by using the BER of each drive being read, and in the case of a RAID-1 this would be simply the BER of one drive.
I hope that helps. You should be able to modify the spreadsheet that I referenced in the comments section of this post.
Let me know if you have problems.
TT
August 7th, 2006 at 11:31 am
Tom,
Many thanks! Used your equation as a basis for my RAID 10 model that also includes MTTR and BER. Much appreciated. You guys are great.
Thanks again,
Mike
October 23rd, 2006 at 9:28 am
> This results in the media rebuild rate effectively being cut to a third of the actual rate.
I don’t understand this part. The drive being rebuild only has to do sequential writes AFAIK, resulting in a ‘full’ media rate.
October 24th, 2006 at 6:19 am
Most of the drives need to be read, the data is XOR’ed (causing a missed rotation) and then the data is written. That results in three rotations. As I think I said earlier (or maybe in the whitepaper) is that the reads, XORs and writes “could” be overlapped to reduce this 3X drop closer to 1X. Typical RAID controllers are somewhere in between. BTW, a drive’s read and write cache helps get this closer to 2X.
November 6th, 2006 at 1:18 pm
Tom,
I have another question. In your PDISK calculation you use sector error rate . Why don’t you merely use BER divided by capacity? In other words, why do you suggest using sectors in this equation?
Many thanks,
Mike
November 6th, 2006 at 1:39 pm
Mike, that’s true. There’s no need to convert to sectors since the end result is a whole disk failure rate. I guess my mind just works at the sector level.
BTW, it would be BER “multiplied” by capacity, right? The larger the capacity the larger the chance of getting a bit error somewhere in the disk.
TT
November 13th, 2006 at 11:42 am
Tom,
Could you inform me what the MTBF specification is on a typical RAID card (controller)? I am atempting to roll-up a system level MTBF.
Thanks,
Mike
November 13th, 2006 at 3:04 pm
Wow. I haven’t considered that question for years. I know our quality guys used to calculate MTBF by some complicated formula involving the number of chips, number of pins, and other odd-ball factoids. And until you asked the question, I guess I had assumed that we no longer calculated MTBF for controllers.
So I found one of our U320 SCSI HBAs (non-RAID) and it had a calculated MTBF of over 1.5MHours. I admit that that seems awfully low - it’s almost the same as an enterprise drive that has wildly spinning parts. How could a simple controller have the same MTBF? Hmm. Different math?
I struck out trying to find someone to give me an MTBF for our RAID cards. They have a few more chips than a simple HBA, so if 1.5MH is right for an HBA then a RAID card is probably close to 1.0MH.
But, again, that just doesn’t seem right. I suppose I could keep digging but I don’t trust these numbers.
I don’t suppose you found any MTBF for other components, did you? Are they in this same ballpark?
TT
November 14th, 2006 at 9:30 am
The answers to the question of colntroller MTBF are quite amazing. Google found http://www.intel.com/support/motherboards/server/sb/cs-020405.htm, which shows MTBF for various Intel controllers at 40 degree Celsius to run from a low 0.284 Mh (milion hours) to 2.2 Mh. Both the low values and the difference of almost an order of magnitude are suprising. These numbers suggest that in calculating the time to data loss (TTDL) the RAID hardware must be taken into account.
November 15th, 2006 at 7:55 am
Tom,
Thanks for checking. I too am a bit surprised over the low MTBF rating. The component reliability assessment thing probably uses a Belcore tool that adds up the component FIT rating. Disk drives certainly have come a long ways in terms of reliability and I guess I expected RAID Cards to keep pace. I found this additional info. The published MTBF for the new LSI SAS RAID controller is 200,000 Hrs. Wow is that low! I did see others anywhere from 1M hrs to 1.75 hr MTBF. I may go ahead and use 1M hr in my model for now.
Thanks,
Mike
February 9th, 2007 at 1:13 pm
Just wanted to drop a thank you note to Ernst. Much appreciated.
Thanks,
Mike
October 16th, 2009 at 11:00 pm
Good article. I’ve always been a fan of RAID 1 on my primary work machine and for each of my web servers. It seems to be the sweet spot in terms of reliability and cost benefit.
October 18th, 2009 at 2:33 pm
Coldfusion …
Yes RAID 1 is a very valid RAID level, and with the size of disks available today it can supply a useful system to a large number of organisations. My only concern with this RAID level is performance (which is not terribly good due to only 2 spindles in the system).
However, take a look at our new MAXIQ caching module. In combination with a RAID 1 it can give your webservers a ginormous kick in the pants when it comes to performance without compromising on the integrity (safety) of your data.
Ciao
Neil