Buffer, buffer, who’s got the buffer
Posted in Storage Interconnects & RAID, Advisor - Tom Treadway by Tom TreadwayQuestion to the Storage Advisors, from Olaf: Why is such a large drive buffer/cache needed, considering controllers and the host have much larger caches anyway? And/or, why is read-ahead data not forwarded to the host immediately?
Olaf, the question of where to put the buffers has been around for years and years. As you point out, you have three general areas: the OS, the controller and the drive. We could actually break the OS into several areas, such as the application, the filesystem and the driver, but let’s just leave it simple for now and treat the OS as one unit.
First, let’s look at reads. Clearly the shortest latency for a cache hit on reads would be from the OS cache. Getting the same cache hit on the controller or drive would only add latency. So for this reason, controllers and drives typically don’t concentrate on caching read data. The OS cache is bigger and more efficient.
Next is the issue of read-aheads. You asked why we don’t just forward those to the OS immediately. Well, we could only do that if the OS asked for the data. And the game of speculatively reading data all the way from the drive to the OS that may or may not be used is very risky. It turns out that the least risky place to perform read-ahead is on the drive. While the media is rotating to find the requested data, the read-ahead data may simply pass underneath the drive head. There is no overhead to reading this data, other than finding a place in the drive’s buffer to store it. And after the read completes, if the drive doesn’t have another command to process, it might as well read in more data – especially if it detects that the last few commands have been sequential.
Lastly is the write-buffer. Allowing dirty writes to “build up” in a buffer serves two purposes. One is that multiple writes to the same blocks will be reduced to only one disk write, saving disk seeks and rotations. The other is that short writes may be combined into highly efficient long writes, again saving disk seeks and rotations. It would seem that putting this buffer in host memory would make the most sense because it reduces latency, just like reads. But that data is at risk. If you lose power, you lose your data. You have to make sure that the write-buffer memory has a battery in case of power failure. And you typically want the battery to last 24 hours, or maybe more if you want to make it through the weekend. [A UPS will only last a few minutes.] Since it’s impossible to put a battery on OS memory or drive memory, the only logical place to put it is on the controller.
In summary, you need a buffer in all three locations for different purposes. You can start combining them, but you will lose either performance or reliability.
TT
December 15th, 2006 at 7:46 am
But the UPS only has to last a few minutes, to allow the diesel powered generators to start up and stabilize. When that’s done, operations can continue as normal. If for some reason, there are no backup generators, the servers will start to shut down in a controlled fashion flushing all IO and then the disk controllers can shut down, equally in an ordered fashion. This of course implies that there’s a sufficiently inteligent signaling mechanism between the UPS and host and controllers system to trigger the shutdown.
The way I see it, if you have a large battery backed disk cache you’ve already committed to the idea that the consistency of my data is protected by a battery. Question is, do you want to use an expensive cache memory board backed up by an unknown battery (you never get to specifiy or choose the battery in the controller, it’s just there) or do you rather plug in more of the considerably cheaper (in relative terms) host RAM and get your performance boost there, while at the same time have total control over the UPS and backup generator systems?
January 2nd, 2007 at 6:35 am
Charles, I agree with you. If you’ve got a UPS and backup generator in place, then you’re already protected from power failures. You’re not protected from OS hangs, but in theory those shouldn’t happen if you’ve flushed out the design and done enough testing.
Unfortunately I think there are folks that chose OS RAID over hardware RAID to save money, and many of them probably don’t have a generator. They think that a few minutes of UPS protection is enough. The result is lost data and, even worse, data corruption due to RAID-5 Write Holes, etc.
Happy New Year!
TT
January 8th, 2007 at 5:39 pm
OS’s do read aheads so you do get read-ahead data! Of course the drive doesn’t know its read ahead data - it looks just like any ordinary read. A few years ago the OS read ahead was a bit thick and would submit a read ahead on the next few LBA’s, jus tlike the drives - totally useless in a fragmented file system when you hit the end of fragment. Nowadays the read aheads have moved above the filesystem (in XP for instance) so they will read ahed the next chunk of file which will not be the next LBA at the end of a fragment.
Joe
January 9th, 2007 at 5:42 am
Joe, yes I agree that the OS does read ahead. My point above was that the controller couldn’t do read ahead on it’s own and then just surprise the OS by giving it extra data. The OS has to ask for it.
Plus, as you point out, the filesystem is fragmented, so it’s tough for a block-level device (like the controller or drive) to figure out where the next chuck of file resides.
TT