Bottlenecks: From disk to backbone
Posted in Storage Interconnects & RAID, Advisor - Tom by Tom TreadwayQuestion to the Storage Advisors, from Darwin: I just wonder, what kind of storage solution that can saturate 10Gb Ethernet backbone. If I assume I have a file server (of course equipped with 10GbE NIC on PCI-E 8x), can an external SAS/SATA-II JBOD box (using 4xwide SAS link) produce enough throughput (taking into account the hard disk speed and the link bandwidth) to saturate the network? If not, where will be the bottleneck? Thanks.
Darwin, let’s break down the bottlenecks in this picture.
First, each drive is capable of about 100MB/s from the media. Some drives are faster and some are slower, but this is a good average number.
Next, each SAS 3Gb link is capable of 300MB/s bit rate. But once you apply overhead and encoding conversion, you’re down around 270MB/s - let’s say 250MB/s just to be safe. So with a 4x link, you can get close to 1000MB/s.
Now let’s look at your PCIe 8x bus. Each 2.5GHz PCIe link can do 250MB/s. I’ve seen efficiency estimates as low as 70%, so let’s say that each link can do 175MB/s. Therefore an 8x bus can do 1400MB/s. In fact PCIe is full duplex so each bus can actually be transferring in both directions for a total of 2800MB/s. But this is all moot. The SAS 4x bus is limited to 1000MB/s, so the PCIe 8x bus is overkill.
Lastly, let’s look at your 10GbE link. That link can probably burst at around 1GB/s, or 1000MB/s. Coincidently, that almost exactly matches the SAS 4x throughput that we calculated earlier.
So it would seem that your system is nicely balanced with 10 drives feeding a 10GbE link. But of course it’s never that easy.
First, can your disk controller hit the 1000MB/s mark? That answer will depend a lot on whether you’re using RAID for protection against drive failure, and furthermore what type of RAID and what access pattern you’re using. For example, if you’re doing random IO to the disks, then each drive is spending the majority of its time waiting for the head to seek or the media to rotate. Using the 71 IOPS from the previous post on SATA IOPS measurement, and assuming a 4KB transfer size, each drive is supplying less than 300KB/s. Therefore it would take over 3000 drives to hit the magic 1000MB/s mark! That’s not going to happen.
Next, what is the overhead of the drivers and firmware sitting between the OS and the drives? This number is probably around 10%, so it’s not too interesting. We can move on.
The next biggest problem, after drive access pattern, is “what is driving the GbE link”? Is it iSCSI with a TOE to offload the TCP/IP processing? You mention that this is a file server, so I assume that you’re just running NAS. If so, you’ve got a ton of overhead in processing filesystem metadata. You’re probably lucky to get 100-200MB/s.
So you can probably see that you asked a very complex question with lots of unknown variables. But hopefully all the simple factoids that I presented will help you find the bottlenecks in your system and tune the performance accordingly.
Good luck.
TT
March 20th, 2007 at 11:09 am
Darwin, Tom,
I’m very currious about the question behind the question: Would you need a disk system to be fast enough to fill the 10 GE link? Just one 10GE link? Probably not. Are you afraid that 10 GE is “overkill” and in what sense? If your requierements can be filled with a few 1GE links and as long as that is cheaper than two 10 GE’s, nobody is expecting anybody to buy 10 GE, right?
Today the biggest impact of 10 GE on storage decisions is: it’s there if we ever need it, the ceiling is way up.
Every bottle has a neck. I think we just want it to be the most expensive part of the system, not a 5% cost item. For storage, that should be the drives, although some people try to make the software licensing to be the crown jewel.