Picking the right stripe size
Posted in Storage Interconnects & RAID, Advisor - Tom Treadway by Tom TreadwayQuestion to the Storage Advisors, from Anonymous: Been reading your blogs most of the day (just stumbled on this site today)…WOW! Tons of excellent information/suggestions for best practice! I’ll be adding this site to my daily tech read file… One quick question: Is there a rule of thumb concerning RAID-5 Block Stripe Size to file size? Is the any direct performance correlation between Block Strip Size and NTFS cluster size? Thanks!
Thanks, man. I feel the love.
Regarding RAID-5 stripe size, in general the larger the better. In rare cases a smaller stripe size might help, but if you make it too small then performance can plummet. And of course the answer depends on the pattern of access – large or small, sequential or random, read or write, OS queue depth, …
Let’s look at an example of 4KB random reads to an 8-drive RAID-5 with an OS queue depth of 64 concurrent commands. And let’s say that each drive was capable of 100 IOs Per Second (IOPS).
Now let’s say that the stripe size on each drive was “very large” – large enough that each host command fell entirely within a stripe. That would mean that each drive would be servicing host requests at 100 IOPS, for a total of 800 host IOPS for the array. That’s as good as it gets.
But let’s say that the stripe size was 8KB. And let’s say that the 4KB requests from the host are randomly aligned, meaning that they could start on any block boundary. The following picture shows two adjacent 8KB stripes (therefore two adjacent drives) with a 4KB host request placed in all the possible random positions.

Notice that the host command can be placed at 16 different offsets, but 7 of those offsets cause the command to fall on two drives. If each host IO tied up two drives then the IOPS rate would drop from 800 to 400. But since on average only 7 host IOs tie up two drives while the other 9 tie up just one drive the resultant rate would be 625 IOPS. (Simple math left as an exercise to the user.)
At this point someone might wonder why, in the two drive case, we saw no benefit from the two drives loading the data in parallel. Good question. It’s true that two drives will be able to transfer that data twice as fast as one drive. But remember that this is a random IO access pattern. That means that both drives have to seek and rotate to get to that data. That’s access time is about 15ms on average, depending on the drive. In contrast, it will take significantly less than 1ms to actually transfer the data. The time saved in transferring the data from two drives is lost in the noise of how long it takes to get to that data.
So back to the comment about a “very large” stripe size: What I meant was that the stripe was large enough to cause very few of the IOs to fall across two drives. For example, if the stripe size was 256KB, then only 7 out of 512 host commands would degrade performance – a negligent amount.
Now let’s look at writes. If the writes are short and random, then pretty much everything said about reads will apply to writes – with the exception that the IO rate is cut to ¼, or 200 IOPS. I won’t get into the details in this post, but it has to do with a RAID-5 technique called Read/Modify/Write where each host command is converted to two reads and two writes, or four IOs total. Trust me; it’s a RAID-5 thing. But the concept of trying to avoid having host commands cross drive boundaries still applies.
But what if those writes were longer and sequential? In that case, the RMW technique would be replaced with a Full Stripe Write technique, and the extra IOs would be eliminated. (Again, just trust me that that is a good thing.) And how do you make long, sequential writes? An obvious way is to have the host write long, sequential commands. An alternative, which is common with RAID controllers, is to use the controller cache and write-back, or lazy writes, to permit short IOs to hopefully coalesce into longer IOs.
So if you have a RAID cache then you should use a stripe size that is no larger than the cache’s typical write-burst size. And how do you know what the burst size is? You have absolutely no way of knowing. All you can do is hope that your RAID card is tuned to have the cache and stripe size coordinated. That “should” be a good assumption.
I suppose after all that long-winded prose I should get back to your original questions.
Is there a rule of thumb concerning RAID-5 Block Stripe Size to file size? Is the any direct performance correlation between Block Strip Size and NTFS cluster size?
Nope. Sorry, I guess I could have said that first, but it would have made this post less interesting.
How an OS accesses a file is somewhat unrelated to the file size. For example, a multi-GB database file is still accessed in small, e.g., 4KB, chunks. In this case the NTFS filesystem isn’t even used. And when the NTFS filesystem is used, the NTFS cluster size tends to define the minimum access size but not the maximum.
Anyway, I hoped this help explain how stripe size affects performance. The bottom line is typically that the default stripe size is best, and the default is usually a big number, such as 256KB. If you want to play around with reducing stripe size, make sure you do plenty of real-world performance testing.
Enjoy,
TT
June 19th, 2006 at 6:45 pm
There’s some confusion in the field that stripe size is akin to cluster size, in that “smaller stripe sizes conserve drive space.”
http://64.233.167.104/search?q=cache:WJxrDzSAVdoJ:www.computerpoweruser.com/editorial/article.asp%3Farticle%3Darticles/archive/c0607/31c07/31c07.asp+RAID+%22stripe+size%22+wasted+space&hl=en&gl=us&ct=clnk&cd=1
But based on your diagram, it appears that a RAID controller distributes bits of a file (literally) wherever it has to in a stripe set regardless of cluster size. For example, in your alignment example, you show one “block” of a 4KB file residing on one stripe of a 2nd drive. Does that prevent another file from being written to the stripe in the 2nd drive or no?
Thank you.
June 28th, 2006 at 9:14 am
Mr. Palmer,
That’s correct. Smaller stripes sizes will definitely NOT affect drive space. Heck, the OS doesn’t even know that striping is occuring (assuming it’s happening on a hardware RAID card) so there’s no way that it could affect drive space. I suppose that someone like Microsoft could write RAID code that was tightly integrated into the filesystem, but as far as I know that hasn’t happened.
So, to your question, no, having a 4KB file (for example) residing on part of one stripe does not preclude a different file from residing on the remaining part of the stripe.
TT
June 28th, 2006 at 1:09 pm
Thanks for your response. It’s as I thought and I think your transparency argument says it well. (That the OS doesn’t even know a drive is striped for hardware RAID. New thread: What about Microsoft’s soft RAID 0 in NT, Win2K and XP? Any cluster alignment with stripe there?)
But just to be sure, the question was really about “too large” a stripe size. (You talked about “smaller stripe sizes.”) IOW, if you had 512KB stripes but only wrote 4KB files in a 4KB cluster file system, would you suffer a storage penalty? (I’m not suggesting that’s a good setup, simply a hypothetical.)
Again, I think the answer is “No.” But there sure is a lot of conflicting info out there. Do a FIND for “waste” at
http://www.datarecovery.com.sg/data_recovery/types_of_raid_configurations.htm
You’d think these people would know, as they’re in the data recovery business. They’re treating stripes like clusters. But I trust Adaptec more.
Perhaps an article on how RAID works at the hardware level will help clear up the misinformation. How does the controller keep track of stripes and what kind of translation goes on during I/O, etc.?
Thanks.
June 28th, 2006 at 1:23 pm
Wow. I checked out the link. Here’s the text that you referred to:
Firstly, if the stripe size is too small, writing a big file would cause the file to be broken down into many segments across the drives hence utilizing more disk resources. A stripe size too big may exceed the size of the data to be stored and result in space wastage. That is to say if you configure 100K as your stripe size, you’ll waste 30K of space if you are to write a 70K sized data.
That’s just flat out wrong for hardware RAID, which I would guess is 95% of all RAID shipped. I’m also pretty confident that Windows RAID doesn’t link stripe size and file size, but I’ll look into that and make a new post if I find anything.
Regarding the stripe size, you’re right. You said “too large” and I gave you an example of “too small”. But as you surmised, it just doesn’t matter.
I’ll add the suggestion for an article to my things2do list. Thanks for the idea.
TT
January 9th, 2007 at 2:45 am
Tom,
That poor guy does not have a future in storage. What I thought he might say was if the stripe was sooo large that your entire database fitted inside it, it would all reside on one spindle and so performance would be poor.
Joe
January 9th, 2007 at 1:20 pm
Am i right in thinking that if you have an array which spends most of its time serving up one big file at a time - say on an A/V workstation, or a machine doing heavy-duty image processing - a small stripe size would be best?
In that situation, assuming the OS has laid the files out contiguously, there won’t be a lot of seeking, just reading, so seek time will no longer be a bottleneck. Conversely, if the stripes were really big, then for this application, performance would go down - the system would spend time reading a stripe’s worth of blocks from one spindle, then go to the next one, then the next, etc, effectively reading at the speed of a single disk.
Of course, that breaks down if readahead is happening - the OS can be reading a stripe from each spindle at the same time, thus getting full throughput.
I suppose the sweet spot, for effectively single-threaded systems, anyway, is to have stripe width * stripe size being about the size of the files you’re reading (or your OS’s readahead buffer size); that way, you spread the files out over the whole array, getting maximum throughput, while minimising seeks. The fact that files vary in size makes this purely academic, though!
– tom
January 12th, 2007 at 8:24 am
Hey, Tom.
It really depends on what the access pattern looks like from the OS. The file may be large, but the accesses are probably 64KB or less. So if you have a 9 drive RAID-5, then you’d need an 8KB stripe size. At that point the RAID stack will start generating a ton of additional overhead just to create those commands, which defeats the purpose.
However if you’re doing a lot of writes into a write-through cache in front of a RAID-5 array, then making sure that each OS write covers a full stripe will give you a BIG increase in performance. But making sure the OS writes align on stripes is near impossible.
And I agree with your point about large stripe sizes. It’s safe to assume that a drive will support read-ahead, so you “should” be getting media reads from all those drives in parallel. Going to a smaller stripe size would only help if the array was degraded and you had to rebuild data on each read. You probably don’t want to slow down optimal performance just to make degraded a little faster.
Like you said - it’s pretty academic.
Unless you can control the size of files and make them align perfectly with stripes and control how the OS access that file, and you’re not worried about strange write-thru or degraded cases, then there’s not much point in worrying about stripe size. Large is good 99% of the time.
Thanks for reading, Tom.
TT
January 18th, 2007 at 5:47 pm
What is the best stripe-size for an email system.
2,500,000 files but all the files are very small?
thanks,
sam
January 19th, 2007 at 5:39 am
Sam, what’s the read/write ratio on the mail server? I assume the accesses are pretty random.
If it’s mostly reads, then RAID-5 with any large stripe size will be fine - just as long as the stripe size is much greater than the file size. The mail system will probably use a high command queue depth and therefore you want each command to access exactly one drive, rarely falling across a stripe boundary and accessing two drives.
If it’s mostly writes, then RAID-10 is what you want. All those short writes are probably quite random, and the RAID-5 parity update penalty will kill performance. And if you pick RAID-10, then stripe size doesn’t have much effect on performance, assuming it’s not abnormally small.
And of course if the load level on the mail server is low, then it probably doesn’t matter what you use.
TT
January 22nd, 2007 at 8:00 am
Just to revisit that media PC question. My RAID5 controller supports stripe sizes of up to 4096KB, and the files I’m hosting are averaging over 200GB in size which makes that statistic kinda redundant (from what you said earlier!).
Given that I plan on serving at most, six clients (well 6 possible files being accessed at once forgetting about trivial OS stuff) doing sequential reads (at most 4 files) and sequential writes (at most 2 files), would I still be advised ‘big is best’? It’s just 4MB seems HUGE compared to what people usually talk about.
January 24th, 2007 at 6:42 am
Phil, yeah, with multiple clients bigger stripes are definitely better.
But I agree that 4MB is damn big. I don’t know how many drives you have, but let’s say you have five. Also, I’ll assume that you have the write-back cache turned on - hopefully with a battery. To do a full stripe write the RAID controller would have to buffer 16MB of data, and have space for another 4MB of parity. So will your RAID card buffer and write in 20MB chunks? Unfortunately there is no way of knowing how well it works without reconfiguring the array, letting it complete the build, and then run a performance test like IOMeter. This is pretty tedious work and will take days.
If I were you, I would set the stripe size to 1MB. That seems like a more manageable size, and should deliver good read and write performance. Also, what’s the default stripe size? That’s “probably” a good value to use.
I hope that helps.
TT
February 17th, 2007 at 3:46 am
Opinions?
I have a 2 TB RAID-5 with 8 disks. I get really crappy read speads from my directories with really small files. This data is mostly for access once it is there, so reading is the most common, as well as backup. There are about 15 million files, all very small each, like 100k or less. I am thinking that a smaller stripe size than the current 64k will significantly increase my read performance. Tell me if I’m wrong.
I think that if I had a stripe size of say 16k, then 16 * 7 = 112 and I’d be getting a parallel read over 7 spindles. With 64k strip size, then it seems to me I get only 2 spindles.
Bad logic? Good?
February 26th, 2007 at 6:34 am
Kazooless, I think moving to a smaller stripe size will hurt your performance. If you moved to a smaller stripe size and if your transfer sizes are large enough to cross multiple stripes, then you would get multiple disks reading in parallel (that’s good), but you’d also suffer from the worse case access time (that’s bad). Before getting any of your data, all the disks will have to seek and rotate. With a single disk, this access time is just the average, for example half a rotation. But with multiple disks, it’s the worse case access time that you’ll see - longer seeks and more than half a rotation.
About the only time where short stripes might help is for long sequential reads, but honestly with the drives doing good read-ahead it’s still probably safe to stick with large stripes.
TT
March 14th, 2007 at 7:52 am
Thanks for the very informative explanation.
My question relates to our student lab RAID setup, 7 500GB 300Mb/s drives with RAID5. The Write policy options available are default, write through and write back. The stripe size goes from 16K, 32K, 64K, 128K default, 256K, 512K AND 1024K. The files that will be hosted on it vary from small simple text files KB in size to DV video files of ~15GB, at least 560GB of DV files. About 10 people will have access to drive whenever they need to and are using Windows XP. I’ve been asked to get it up and running, which is fairly straight forward, but I’d also like to have optimal settings.
We are also going to try and get some sort of NFS working as one of my colleagues uses OSX and thinks it works much better for reading single frames from videos. Any advise would be greatly appreciated,
Cheers Daire
March 14th, 2007 at 8:01 am
Daire, assuming that the controller has a battery on the cache, you’ll definitely want to run in write-back mode. As far as the stripe size, 128KB is probably best (especially since it is the default). If you were doing only single-threaded, large sequential writes then I would recommend moving up to 1MB, but 128KB is a better “average” stripe size for highly-queued, random transfer sizes.
Good luck,
TT
April 6th, 2007 at 10:02 am
I’m very appreciative of this article; this is very eye-opening for me.
We are an A&E firm of 80-100 employees with a file server hosting 750K files & 60K folders on a 4-HDD U320 SCSI HDD RAID5 array. Stripe size is set to 16K.
I need to enlarge the array and optimize it. I personally don’t see this file server getting taxed very hard - no PERFMONs show me any data that makes me feel like it EVER gets nailed. However, I am concerned it is not a great solution and can (and should) be optimized.
Typical files are your average Word Docs and Excel spreadsheets (30KB-800KB) and AutoCAD Drawings (800KB-8MB).
I am embarassed to be another “Hey, help me for free” post here, but this arena simply isn’t my forté and any assistance is very much appreciated. Thank you for all you do.
RLR
April 6th, 2007 at 11:06 am
Hi, R.
No need to apologize for the call for free help. That’s what we’re here for - to make the world a better place via the noble art of storage blogging. And if we didn’t like your question, we would have just deleted it and had a prolonged giggle. We’re simple folk…
So how was a 16KB stripe size chosen to begin with? That’s pretty small. I’m worried that it may have been the controller default, and higher stripe sizes may not actually work very well. Older RAID controllers often didn’t allocate memory correctly and used small page sizes, making large stripes inefficient and sometimes reduced the maximum queue depth to the drives. You may want to see if there is a newer firmware version that may have larger defaults. A larger stripe size “should” improve performance.
But with all that said, if this server is really seeing such a low work load I’m not sure you should screw around with upgrading firmware. I’d hate to hear that the newer code didn’t understand the drive metadata and couldn’t import the array. Or you encountered some new bug that was introduced into the code. My gut feel is to just leave the controller alone. And if the controller does have a higher default stripe size, then I would certainly change it during the array expansion.
BTW, I assume you have a backup “just in case”?
TT
April 18th, 2007 at 3:26 am
Tom,
Recommendations above apply to RAID 5 configurations. Am I right that the same is true for RAID 1+0 (stripped set of mirrors)? E.g. I have total 10 disks (i e 5 disks in a stripe set). The RAID hosts huge DB files with random read/write access and also with relatively big sequential writes (while adding bulks of new records to the db).
Thanks!
April 18th, 2007 at 5:55 am
Andre, when running a database with small, random IO on any RAID level using striping (RAID-0, 10, 1E, 5, 50, 5EE, 6, or 60) you want to make sure that the minor stripe size is large enough to avoid too many stripe crossings, i.e., host commands than span two drives. As the number of spanned host commands approaches 100%, the performance will be cut in half because it takes twice as many drives to service each host command. With these small random IOs you can’t really have a stripe size that’s too big.
You also mention that your database occasionally writes large sequential files. With RAID-5, as discussed earlier in this post, there are a lot of issues to keep in mind to insure that the write-back cache can use the optimal full-stripe-write algorithm rather than the less-optimal read/modify/write algorithm. However this is all moot with RAID-10. There is only one way to write data, and it never involves reading old data as with RAID-5. So having a large stripe size won’t hurt RAID-10 writes.
The bottom-line is that you should be able to use the largest stripe size supported by the RAID controller.
TT
June 7th, 2007 at 9:23 pm
Hi,
Great site. I am looking for a recommendation on our setup.
We are using two Fibre RAIDs. Each of the two enclosures has 16 drives. In each enclosure we create two RAID5 arrays (2×8-drive RAID 5 arrays). There’s only one controller with 1GB RAM cache, but there are two 2GB ports so we map each array volume to a different port, and then plug them into our switch. We end up with 4 x 8-drive RAID5 LUNs of 2.5TB each, which we then stripe RAID0 across in Windows 2003 Server.
We’re working almost exclusively with video/film images that require realtime playback of 24 x 9MB or 12MB video files per second per client, or else multi-GB QuickTime files that are still in the same range of 150-250MBps per client. We have two clients on the server that get block data over fibre fabric.
The choices I have to make are stripe size and block size. Controllers on the RAIDs allow up to 1MB stripe size (remember each controller handles a pair of RAID5 arrays consisting of 8×400GB drives). NTFS has built-in support for block size up to 64KB, but I’ve also seen 3rd-party NTFS formatting utilities offering block sizes as big 1MB. With two users on large sequential reads like this, what should I do?
Thanks so much for any help!
June 8th, 2007 at 7:18 am
Hi, Tyler.
Just to make sure I understand your particulars:
You want to concurrently read 24 streams of video, with each stream transferring at 9-12MBytes/sec (not Mbits/sec) for a total of 216-288MB/s? And are those 24 streams spread across the four arrays, with each array seeing 6 streams? Or do you want 24 streams per array, for a total of 96 streams?
And in the case of the QuickTime files, the throughput is 150-250MBytes/sec? But it sounds like now you’re talking about 150-250MB/s per client rather than stream. If so, since there are two clients, I guess each client reads 12 streams concurrently? In other words, each stream would need to transfer at 12.5-21MB/s?
I just want to make sure I got these details right before making suggestions.
TT
June 8th, 2007 at 8:17 am
There are always two clients, so I think it would really be just two streams. Each client is reading 150-250MBps. The only difference is the form that the data is stored. Sometimes it is as very large QuickTime files, other times it is as a series of sequential still images instead. In the latter case, these would be 9-12MBps per image file, and the system reads 24 of those image files every second to play the motion picture (24 images = 24 frames per second in film).
The data is on one large volume that is striped across the four RAID5 LUNs. So those four arrays together support the two clients (not two clients per array, but two clients for the four arrays striped together).
June 8th, 2007 at 12:43 pm
Ah, I see. You keep saying that the image file is 9-12MBps, but I think you mean 9-12MB. Adding “per second” doesn’t make sense since you talk about sending 24 images per second. So if the client is reading 24 images a second and each image is 9-12MB each client will read at 216-288MB/s.
Wow. Over 200MB/s for an image stream is REALLY high bandwidth. I believe that decent quality HDTV signals using MPEG2 are just 2-3MB/s. Your images are 100X higher bandwidth.
Does that make sense?
TT
June 9th, 2007 at 1:31 pm
You are correct, I messed up saying “Mbps”. It is MB per image, so your figures of 216-288MBps are correct. We actually have an image format we use that is 4X that big, but it requires DAS. We are trying to accomplish these things with off-the-shelf hardware rather than proprietary solutions.
June 11th, 2007 at 6:09 am
216MBytes/sec for a video stream is a really, really big number. I think you mentioned that you’re using FC. I assume you’re using 4Gb FC? Because 2Gb FC wouldn’t be fast enough - unless you’re spreading the IO across two channels.
To get back to your original question, you really do want to use stripe and block sizes as large as possible. I’m not familiar with the 1MB NTFS modification, but it could be worth trying. Can you send a link or a product name. I’m curious.
BTW, are you currently able to hit these transfer rates? If not, what are you at now?
TT
June 14th, 2007 at 11:39 am
I have two 2×4Gb cards and 1 2×2Gb card. The storage has 2×2GB ports x 2 enclosures, so 4×2GB ports all together.
We’re getting 310-40MBps sequential read, a little more than half that for 100% random. Our SAN software (MetaSAN by Tiger Technology) has a utility included for altering the stripe size after you create the dynamic volumes.
June 25th, 2007 at 6:02 pm
I’m a bit confused on how the advice here seems to confict with IBM’s recomendations on their ServeRAID series. Specifically I have taken recent possession of several ServeRAID7k x346 boxes. In their docs IBM says:
“You can set the stripe-unit size to 8 KB, 16 KB, 32 KB, or 64 KB. You can maximize the performance of your ServeRAID controller by setting the stripe-unit size to a value that is close to the size of the system I/O requests. For example, performance in transaction-based environments, which typically involve large blocks of data, might be optimal when the stripe-unit size is set to 32KB or 64KB. However, performance in file and print environments, which typically involve multiple small blocks of data, might be optimal when the stripe-unit size is set to 8KB or 16KB.”
Based on the advice here it seems I should obviously go for the largest available; 64KB. However if some of the boxes are file and print servers the above seems to be contradictory. Is there perhaps some differences between RAID5 implementations that would contraindicate going for the largest stripe size available (their default is 8KB!)? For the record there are a couple of Windows 2003 SQL servers and a couple of RHEL file, print, and email servers.
Any clarification greatly welcome. Thanks!
June 26th, 2007 at 5:57 am
Daryle, the IBM advice makes absolutely no sense - especially with the current version of Adaptec RAID controllers. But I know the folks in Raleigh that manage the ServeRAID controllers and they’re far from stupid.
Let me look into this and get back to you. I can’t remember which RAID stack is used in the ServeRAID7k. It might be an old stack with some crazy behavior where large strpe sizes had a weird side effect - which I’ll go out on a limb and call a bug.
Check back in a few days and I hope to have more info.
TT
June 26th, 2007 at 1:08 pm
Daryle, I checked into this. Apparently the ServeRAID7k uses an older RAID stack that Adaptec has put into a maintenance mode, so it’s a little difficult to get real-world performance numbers. I really, really doubt that a smaller stripe size helped performance - UNLESS this RAID stack had a bizarre bug or architecture that made it perform poorly on certain stripe sizes.
At this point it’s hard to say why they recommended smaller block sizes, but I’m willing to bet that the statement is just flat-out wrong.
Sorry that I can’t be more definitive.
TT
June 26th, 2007 at 1:19 pm
Thanks for the help. I guess when I’m redoing an array I will just set it to 32KB as a safe compromise. Probably a lot better than the 8KB it is at now and it isn’t the max so it should be fine. Pitty about ServeRAID 7k being obsolete already. I thought it was recent enough to be a good implementation of RAID 5.
June 26th, 2007 at 1:25 pm
Yeah, 32KB should be a safe bet. 8KB is just crazy.
BTW, IBM would probably shoot me if they knew I implied that ServeRAID 7k was obsolete. I certainly didn’t mean to do that. That RAID stack was certainly solid when it was released and IBM may even still be shipping it, but I think the last product using that stack was released about two years ago.
Good luck.
TT
July 30th, 2007 at 1:27 am
I have one of those rare “academic” cases when all my files are about the same size (~300KB). Most of the time these files are written to the RAID5 (4 or 7 disks), and occasionally are being read - always the whole file at once. I’m trying to optimize the writes (without compromizing the reads); I’m currently using 64Kb strpes, with WriteBack and ReadAhead, but I’m not sure this is an optimal setting for my scenario. Any suggestion for better performance will be much appreciated.
Thanks,
Shlomo.
July 31st, 2007 at 12:45 pm
Hi, Schlomo.
It sounds like you understand RAID-5 writes well enough to try to cause the file to fall on an entire major stripe, avoiding the read/modify/write penalty of shorter writes.
Unfortunately it’s very difficult to make the files line up exactly on stripe boundaries. The OS isn’t aware of these boundaries, so if they don’t line up the RAID controller ends up seeing lots of major stripe misalignment, causing lots of read/modify/writes. The only way to really ensure mostly full stripe writes is to make the major stipe a LOT smaller than the file size. But if you do that you’re going to get killed in controller and RAID overhead.
Also, there’s the problem with files becoming fragmented, but I suppose you can somewhat avoid that by frequent defrags.
Fortunately, the write-back cache can solve a lot of those problems by combing writes to multiple files - if contiguous. Which leads me to my questions: Are you writing these files sequentially or randomly? Also, are you writing one file at a time, or multiple files?
I realize that you’re reading the files only occasionally, but having too small of a stripe size with lots of concurrent read requests can really kill performance.
But I’ll wait for your answers before digging into that.
TT
August 21st, 2007 at 2:40 pm
Hi,
I am setting up a new Exchange 2007 mailbox server and am a tad confused on what stripe size I should set my raid 10 hard drives to. The set of drives will only be used for the database and not the OS,etc.
I understand that MS recommends the cluster size to be 64KB now. (http://technet.microsoft.com/en-us/library/bb124518.aspx) I also read that the database now does 8KB i/o’s. Does it still make sense to have a larger stripe size (i.e. 128KB)? My original plan was to have a 32KB stripe size.
Also, to add to my confusion which relates to the IBM ServeRAID question were asked awhile back. Here is an article from HP stating “the stripe size should equal the primary i/o size for the application” (http://docs.hp.com/en/B2355-90672/ch08s06.html?jumpid=reg_R1002_USEN) Can you help clarify? If that was true, it seems like I should have 8KB stripes…
Thanks so much for any help!
Mike K
August 23rd, 2007 at 7:08 am
Hi, Mike.
It sounds like your IO will be most random. So your only goal will be to make sure that the stripe size isn’t small enough to cause many of your IOs to cross stripe boundaries (and therefore double the number of IOs). In other words, having a small stripe size could hurt performance, but once that stripe size is big enough, having a larger stripe size will not help performance.
About the only time you need to worry about keeping the stripe size small is if you need to have the host write IOs build up in the cache in order to generate efficient RAID-5 full stripe writes rather than inefficient read/modify/writes. But as you point out, you’re using RAID-10 so this isn’t an issue for you.
Now, with all that said, I took a look at the HP link that you included. That link is talking about (I think) software striping in Linux. Are you using hardware RAID-1 or software RAID-1? If the former, then I’ll stand by my comments. But if the latter, well, I admit I’m just not that sure. There may be an interaction between the OS page size and the stripe size, and the HP link may be correct.
Sorry I can’t be of more help. Good luck.
TT
August 28th, 2007 at 8:40 pm
Tom, I read this post with interest. I’m confused about why Microsoft, EMC etc are advising to ‘align’ the data with the stripe boudaries. It seems futile to me. Sure you can use diskpart to offset the alignment initially and have a perfect alignment. Say your stripe size is 64k (amount of data written per disk at a time), if you write a 64k file great, you’re still only utilizing one spindle. Now on the next write you get 4k worth of data…guess what even if your next write is 64k you’re out of alignment because you’ll write 60k to the second disk and 4k to the 3rd disk. Now EMC arrays etc that utilize cache can minimize this happening often by using write back cache but still once you’re out of alignment who knows what random i/o pattern you’ll need to get back on track.
Microsoft issued a document as well as EMC has written whitepapers on setting alignment offset to 64k using diskpart.
http://sqlblog.com/blogs/linchi_shea/archive/2007/02/01/performance-impact-of-disk-misalignment.aspx
September 22nd, 2007 at 12:17 pm
Hi folks,
I’ve read this whole thread with great interest… I’ve been scouring the web for an answer to the following question - looks like I’ve found the right place to ask…
I am trying to get the best performance out of the following raid set up:
Adaptec SATA 2410SA raid controller
4x WD SATAII 500GB drives
Windows Vista (32 bit, but will be moving to 64 bit soon)
Raid 5 Stripe size is 512k
I have formatted the drive under Vista with a 2048k allocation unit size…
I have been getting write speeds of 26-30 MB/s - I’m sure I could improve this with a better combination of stripe size/a.u, size…
I am a photographer, so most of my data comprises 8MB+ image files - I’d obviously like the best combo for this type of data transfer.
Any advice much appreciated.
October 14th, 2007 at 11:20 am
What about for games and multimedia? I primarily use my computer as a media server (movies approx: 700megs, music: 3-4megs) and for gaming, CS:Source, CnC, WoW, etc. What Stripe size would work best for this scenario?
October 27th, 2007 at 8:17 am
WOW! You guys are awesome….
I have a Sun SE610 and a Sun ST5320 (NAS Gateway). We will be setting up the SE6130 to serve disks to the NAS gateway and then provide NFS export/CIFS shares to all the users. The CIFS shares will primarily be used for home directories and office size files (30K-800k) and the NFS shares will primarily be accessed by Unix systems with 1M-1G files. My plan is:
The SE6130 has 2 controllers (1 controller tray with 10 146GB FC drives and an expansion tray with 10 146GB FC drives (20 drives total).
My Plan (Please feel free to chastise me, as I am not married to it)
I will create 4 RAID-5 vdisks (each 4+1) with a stripe size of 512k
On each vdisk I will create five equally sized “volumes” (~100GB each). I will assign each volume in alternating sequence to a controller allowing access to the ST5320 (which is direct fibre attached). For example vdisk1vol1 will be assigned to controller A and vdisk1vol2 wil be assigned to controller B. I will do this until all 20 ~100GB volumes are assigned to the ST5320. (The assignment to a controller is somewhat arbitrary as both controllers have access to all volumes, but use the assignment as priority (I think)).
From the ST5320 I will create my initial “base volume” for each export/share starting with a different vdisk for each share. For example share 1 will use vdisk1vol1, share2 will use vdisk2vol1, share3 will use vdisk3vol1, and share4 will use vdisk4vol1, share5 will start again at vdisk1vol2, etc…. I will continue this until I have all of the shares created (about 12 shares)
Obviously not all shares are going to be the same size. The way the 5320 works, as I understand it, is the base volume is shared, then I grow that base volume(share/export) by simply adding LUNS as extents to the share. This is done on the fly without interruption to the end users accessing the shares. As each share/export needs additional space I will simply add space using the same “round-robin” fashion as I created them.
Sorry so long… My question is what do you think? very simple.