RAID-5: Don’t be a hater
Posted in Storage Interconnects & RAID, Advisor - Tom by Tom TreadwayNetwork World posted an article last week called The High Price of RAID-5, written by Mike Karp.
First, I don’t know Mike. I’m sure he’s a great guy, smart as a tack, experienced in the storage field, and kind to animals. He wrote a good article but I do want to address some of his comments. I’ve reproduced it below in its entirety, with my snarky comments embedded.
- There are many levels of IT hell. Surely, one of the worst of those involves coping with the looming torture of RAID 5.
Two sentences into the article and I’m already wincing. (I have to admit I also make outlandish claims just to get people’s attention, so I won’t get too righteous about this one.
)
To setup a RAID-5, the IT admin should point to a handful of drives, or sections of drives, click “Make a RAID-5”, and then forget about it. Someday a light may come on, an alarm may sound, an e-mail may be sent, or a log entry may be made, indicating that a drive has failed. The IT admin gets up from his comfy chair, replaces the bad drive, and then goes back to playing Solitaire.
That doesn’t sound like torture.
- RAID has been with us for more than 20 years, and during that time has saved the corporate bacon for many an admin. RAID is a splendid thing when it works but when it fails, depending on the circumstances, the result may be anything from an inconvenience to a disaster.
I’d suggest that when anything fails the results are often bad, but I’d certainly agree that RAID-5 is in that category of Things That Just Better Work.
- Most of us are familiar with RAID 0 (data striping) and RAID 1 (mirroring). The first provides enhanced read performance but has no protection against data loss and is awkward to scale due to the need to re-stripe whenever capacity is added;
The re-striping as drives are added should be completely transparent to the user, the OS and the applications. Just like the RAID-5 setup above, the IT admin should just point to one or more new drives and then click “Add to RAID-0”. It should be that easy.
- …the second protects data, but at the expense of reduced write performance and, of course, the necessity of buying twice as much disk capacity to support the mirrors.
RAID 0 and RAID 1 - sometimes used in the same RAID set (RAID 10) - have typically appealed to managers using less expensive systems.
FYI, RAID-10 is just as expensive as RAID-1. They both require twice the storage as RAID-0 or JBOD. RAID-10 is the most expensive RAID level.
- The preferred alternative when it comes to first-tier storage with high performance disks has been RAID 5. But we have now arrived at the point where we ought to ask ourselves if RAID 5 continues to be a good choice for first tier data.
RAID 5 stripes data and distributes parity information (used for error correction) across all drives within the array. This yields most of the read performance advantages of RAID 0, but comes at the expense of slower writes due to the parity calculations that accompany each write operation. Like RAID 0, adding more disks to the array involves parity recalculation and imposes a significant performance penalty.
Short random writes to a RAID-5 are VERY slow, as Mike points out, but long sequential writes to a RAID-5 “should” be pretty fast, even approaching the performance of RAID-0. Folks rarely use RAID-5 for databases with a heavy write component. RAID-10 makes more sense. But RAID-5 is great for video servers or read-mostly storage.
- Beyond that, with RAID 5 we find that one high-risk element is added. When a second drive fails, the consequence is always catastrophic data loss.
Huh? When a second disk fails in a RAID-1 you also get catastrophic data loss. If you want to survive a two disk failure then RAID-6 is the answer. (I’ve made a ton of posts on this topic and I recommend checking them out.)
- As a result, replacing failed RAID 5 drives immediately is crucial and hot spares are often kept on hand so that rebuilds can begin right away. The expense of owning a spare Fibre Channel or SCSI drive, admittedly costly, is pretty easy to swallow compared to the potential disaster of losing access to high performance, online data.
Yep, agreed. But replacing a drive immediately is just as critical with a RAID-1 or RAID-10. Only RAID-6 will allow a slothful IT admin to wait weeks before replacing a drive.
- Unfortunately, the need to re-stripe the parity imposes essentially the same calculation penalty on a system as would a simple expansion of the array.
First, re-stripe is often defined as reconstructing the parity on all the drives – also known by the clumsy name of “Verify with Fix”. A re-build is different. It is the reconstruction of bits on a failed drive, including both parity and data. I assume Mike is referring to a rebuild here. There is little reason to ever do a re-stripe of parity.
Second, re-building a degraded array is NOT the same as an array expansion. They both involve reading data and XOR’ing data, but the rebuild only involves a write to the replaced drive. The array expansion involves writing to all the drives. Reads typically come out of the drive’s cache without loss of rotation, but writes must go straight to disk since write cache is turned off in any serious storage system. Writes will miss rotations, killing performance, tying up the drives for ~10ms at a time, and therefore the time to do an expansion is often several times that of a rebuild.
- While all that calculation is taking place, system performance degrades markedly, frequently degrading things so much that I/O performance during the rebuild renders the system unusable. If you’ve never been through a RAID 5-rebuild process while users continue to access the system, you can capture the flavor of it by using a demanding application at the same time as your system is running a backup.
I don’t disagree with this, but it’s not quite as bad as it sounds. First, if the disks were 100% consumed by OS and application IO before the failure, then the rebuild will definitely be felt by the user. But if there is enough dead time between IO bursts, such as at night, then a good RAID controller will try to schedule the rebuild between IOs to reduce any disruption to the user IO.
But with all that said, a RAID-5 rebuild is often noticeable. This is just one of the factors to be taken into account when selecting a RAID controller and RAID level. I recommend that you measure the performance level during degraded and rebuilding modes to make sure you’ve got enough horsepower to allow your business to continue running.
- Have a good book handy.
I don’t recommend that you sit next to the array and watch it being rebuilt. This isn’t laundry. Once the failed drive is replaced your job is done. Check back later to make sure the rebuild completed, but never, ever just sit there and stare at your RAID controller. This is viewed as a sign of hostility.
- Performance penalties during rebuilds have always been part of the price of using RAID 5, but we may be getting to the point where that price has gotten to be too high. With small disks, performance during rebuilds was certainly exasperating, but at least it was a relatively short-lived problem. As systems get larger however, the situation becomes radically different, and not in a good way.
The basic idea, I suppose, is that a user might have been able to stomach poor performance for a few hours with small drives, but not necessarily a full day with large drives. I can agree with that.
But you may want to just tune your RAID card to do the rebuild a little slower and affect performance less. The downside of this is that the array is in degraded mode longer, where a second drive failure before the rebuild is complete will cause you to lose data. But the chance of a second drive failure (due to MTBF) in this several hour to several day window is extremely low. If it’s an environmental failure, like high temp, then you’ll probably doomed anyway, regardless of the size of the rebuild window.
- Next time, I’ll show you why.
Thanks, Mike. If you read this I hope you take it as constructive criticism. There was certainly no malice intended.
TT
August 8th, 2006 at 8:28 am
Tom,
RAID5 and big drives is a hot topic, but probably not for the reason brought up by Mike Karp.
A point that few people appreciate is that the ever larger drives maintain or improve their MTBF over previous, smaller capacity models. The result is that for a given SAN you need less drives than previously and since less drives fail less often, there are fewer rebuilds per year. So while a rebuild takes longer for larger drives, the number of rebuild events decreases and the net result is that the number of hours spent rebuilding is independent of the size of your drives. Of course, larger SANs have more rebuilds, but then it is a diminishing fraction of all LUNs that get the performance hit. With RAID50, the slowdown is diluted further.
Much more worrying is the increased chance to hit an unrecoverable read error during the rebuild. For lower-cost drives with a 10^-14 bit error rate, that chance becomes 32% when you have 5 1 GB disks in a RAID5 set. (You have to read 4 disks, 4 GB = 4×10^12 bytes or 32×10^12 bits). Basically, every three rebuilds your would lose one sector of data. Even with ‘enterprise class’ drives with a BER of 10^-15, the chance to lose a sector is 3.2%, way to large form comfort.
Maybe actual drives are much more reliable than what the datasheets say - or are we indeed living dangerously?
RAID6 seems to be the answer.
August 8th, 2006 at 8:43 am
Ernst, excellent points.
Assuming that the user wants to have a specific final, usable capacity, and since drive MTBF remains constant and independent of drive capacity, then larger drives will allow the array to have fewer drives and therefore a higher MTTDL. Great point. Hopefully this will reduce the number of rebuilds.
And as you also point out, most of the problem is due to the BER, a topic that I’ve excessively blogged on. Perhaps you’ve seen those other posts here, here, and here. And as you summarize, RAID-6 is the answer for BERs.
Thanks for reading.
August 9th, 2006 at 1:26 pm
Tom,
With regard to your post “RAID-5: Don’t be a hater” I agree.
I will post more comments separately.
I would like to have continued the dialogue we started on DrunkenData. It looks like that topic has been closed to new posts. I tried posting this several times without success.
Concerning “Content Typing of Information in Cache”
Thanks for the excellent reply. That’s 100% more than I got from the NetApps guys. Or any of the other major Storage vendors I asked. Although, I have to admit, some of them did look very wise while not replying.
With regard to “This is probably too much to talk about in a comment section of a blog. Maybe we should stick to making fun of the French.”.
I’m not sure what to do about an ongoing, productive dialogue. This is Jon’s Blog and I’m sure he will take charge of it when we are out of hand. Many Bloggers seem to feel comments should be short. I’m sure that long comments are not read much. On the other hand I sort of like the idea of a Blook
(http://en.wikipedia.org/wiki/Blook)
From Jeff Jarvis at the BuzzMachine Blog.
[Begin long URL]
http://www.buzzmachine.com/index.php/2006/08/04/exploding-books-i-everybodys-an-
author/
[End long URL]
“Next up is the blog slurper (with other slurpers after that). It will take your blog, grab the content, and let you edit it, publishing it as is (see Tony Pierce’s blook). I can also see using the blog as a writing and publishing tool for the express purpose of ending up with a book (something I’m thinking about doing with a book on books). And of course, see Tom Evslin’s Hackoff.com, written as both a blog and a book.”
http://en.wikipedia.org/wiki/Tony_Pierce
http://www.hackoff.com/
I believe Storage could benefit from some Information vision sharing. Information is all that matters. Storage is just the “Enabling Unit of Technology”.
Think of a car without a person to drive it. It could be a Yugo or a Mercedes. Without a person “Unit of Information” the car is just dead metal.
August 11th, 2006 at 8:43 pm
When I read Mike Karp’s RAID-5 article I had many of the same thoughts you did. Frequently, in the past, I recorded my immediate thoughts in an email and fired them off to Mike. Hoping to correct, to me, the error of his ways. Over the years I have mellowed. Partly because Mike cheerfully ignored me.
Mike introduced me to the finest Star Trek site on the Internet. It is now a “pay only” site. He introduced me to Deep Space Nine and the Ferengi. I use the phrase “Ferengi acquire” frequently now.
Look quick! These sites won’t be around much longer:
http://www.geocities.com/Area51/Nebula/4156/index2.html
[Begin long URL]
http://www.geocities.com/Area51/Nebula/4156/infirmary/xeno/ferengi.html
[End long URL]
It is my conclusion Mike’s RAID-5 article was a “devil’s advocacy” for RAID-6. I don’t know why. RAID-6 is maybe an improvement over RAID-5?
We never used RAID-5 for the automatic rebuild reason. The boxes we ran it on were so slow they had to be taken offline to have any hope of ever rebuilding.
We used it because RAID-5 will not release the host until the Information to be written is committed by cache. In theory this means you should never lose Information on any part of the host, connection, Storage going down.
I’ve been a big fan of RAID-10 (1 + 0) ever since we discovered the error of Oracle’s ways. Oracle started all this RAID-01 (0 + 1)
for performance reasons. Fixing the Storage was easier and quicker than fixing the Oracle database. RAID-01 did start us thinking about something that was almost as fast as RAID-01 but almost as safe as RAID-5. That turned out to be RAID-10.
If you have all databases then run RAID-01. If you have a lot of “unstructured’ data in your ad hoc Information space run RAID-10.
It is not the RAID level that is important. It is the Strategy you employ for Information High Availability and Information Integrity.
These concepts apply to the Managed Units of Information. They do not apply to Units of Information.
If you have the right Strategy you could run JBOD or a “Roll your Own” array of inexpensive, independent disks.
I will make a separate post about the failure statistics. These only apply if you are managing at the “spindle” level. If you measure your Storage by the acre, as the Deutsch TeleCom guy once bragged, you can’t manage at the spindle level.
August 14th, 2006 at 5:19 pm
RAID, RAID go away! Come again another day!
Or is it RAIN! RAIN!?
Hu Yoshida of HDS is talking about SAIN?
http://blogs.hds.com/hu/2006/06/a_saain_approac.html
The SAIN definition in the post is, “SAIN stands for SAN attached Array of Independent Nodes”.
I started doing something similar to this behind hybrid NAS/SAN “Roll your own” Storage. Best of both worlds.
When you run out of NAS bandwidth you can set another NAS head or NAS head cluster. Depends on your needs and the size
of the IT wallet.
When you are out of SAN you are out of SAN and a quick fix. Unless you can implement a SAN cluster. That’s what we did. Except clusters by definition have some common software for communication and task sharing. Or at least shared “Rules to
Live by!”. So I guess we were doing SAIN before we knew what
to call it.
People always looked funny at ‘SAN cluster”. They always wanted to know what cluster software it ran. No imagination. Isn’t the failover in the SAN? No SPOFs, redundant everything!
I guess, in theory, if the disk capacity gets large enough without changing the drive mechanism, and we sure as hell are not going to change the drive elctronics EVER!!!, the failure rate approaches zero so closely that for all intents and purposes it is zero.
That is exactly what Storage Virtualization does!
So why is everybody bad-mouthing it?
I started out to write this about my real RAID love. RAID-5 at
the hardware level and software RAID-0 everywhere else.
We ran this on a big SGI Seismic processing system. It was fast.
August 17th, 2006 at 5:50 am
Good article Tom. Enjoyed it greatly.