Home | About the Storage Advisors | Adaptec Trusted Storage


Desktop drives on a RAID controller - not good

Posted in Storage Interconnects & RAID, Advisor - Tom Treadway by Tom Treadway

Question to the Storage Advisors, from Don: I am running the SATA 2410 RAID controller in a departmental server. Last weekend, after two drive failures at the same time, I replaced all four drives with WD Caviar SE16 SATA 250mb drives. (It was an emergency). I now read that there are specialized RAID hard disks that have “Time Limited Error Recovery”, which my disks do not have. Will my RAID controller compensate for hard disk based error recovery that takes many minutes?

Don, sorry for the delay in responding to your question. I recently made a post on the topic of nearline drives, so I figured your question was pretty pertinent.

Somewhere in that long post is a section on error recovery timeouts, quoted here for your reading convenience:

    Error recovery is probably the most well known feature of nearline drives. Typically a desktop drive is used in a solitary, non-RAID environment. Often this drive contains the only copies of your mother-in-law’s Orlando vacation photos, and you better not lose them! So a desktop drive will do whatever it takes to recover data. This can mean 30-60 seconds of retries. For a home user, that’s no big deal – just sit there and wait.

    But if these drives are used behind RAID controllers, the controller will probably give up long before the drive – typically closer to the 10-15 seconds that are common on enterprise drives. Since nearline drives are often used in large disk farms that contain critical data it’s common to use them with RAID controllers. Therefore nearline drives have error recovery timeouts similar to enterprise drives.

So, as you noticed, WD does have both Enterprise and Desktop drives, and the Enterprise drives have Time Limited Error Recovery. There is no mention of the Desktop timeout, but there is plenty of text that describes how thorough the retries are, which clearly indicates a long timeout.

I suppose it would be kind of cool to have a configurable timeout in the RAID controller, allowing timeouts to be set to a minute or so on non-Enterprise drives, but I’d worry what your OS or application would do. I’m sure it would timeout, dump its write cache, and do all sorts of other nasty things long before the timeout occurred.

Sorry, but I’m not feeling good about your config. You may experience a drive failure within a few days. If so, you can assume you’ll have another one a few days later. And then people will start yelling at you. It won’t be pretty.

TT

8 Responses to “Desktop drives on a RAID controller - not good”

  1. Ernst Lopes Cardozo Says:

    (I’m a bit late, catching up after a week off-line): really cool would be to have drives that support both time-limited and leave-nothing-untried error recovery. In a RAID5 system, the controler would set the drives to limited error recovery. Then, when a drive has failed and the system is rebuilding by reading all the remaining drives from rim to hub, the controler would use the extensive error recovery method to increase the chance that it could rebuild succesfully.

  2. Tom Says:

    Ernst, I agree 100% that making this a tunable parameter would make more sense. Unfortunatley the drive guys really want to make sure they differentiate their high-end, money-making drives from their low-end, $1 profit drives. Making a cheap drive do expensive things is too difficult to charge for. Pity.

    On a separate topic, I also think that SATA-only drives are silly. I’d rather see dirt-cheap, single-ported SAS drives that support the SATA interface (which they typically do). Again, it goes back to how the drive industry wants to charge for drives, and the motivation of the companies involved during the creation of the specs. (I admit I was involved in both efforts - mostly SATA.)

  3. Joe Fagan Says:

    I fough this battle and lost. Here’s why. The reason the drive guys don’t allow you to configure timouts (Selectable Command Timeout) is that their recovery algorithms don’t degrade with time, up to about 20 seconds. Intuitively, you’d assume that the vast majority of retries would be successful on the first few passes, and after 10 seconds (that’s 1666 passes on a 10K drive!) the return would diminish – that’s just not the case. The statistical distribution of recovery vs retries is not an falling exponential – its got peaks, as more sophisticated techniques are tried and falls sharply only when the algorithms have exhausted their techniques - the lengths they go to is staggering. [In addition to left and right of centre, patching bits of the track together they will the switch to filter configureation. On the read channel there are 16 8-bit fields to configure the filters. The values are optimiesed at manufacture time that gives the best head/media combination. During retries they will tweak these registers to meander around the vector space in n-dimension ’spheres’ centred about the configured optimal and chasing the best signal] After about 20 seconds they still have more to try but recovery does then start to diminish.

    Unrecoverable errors add to the glist (grown defect list table) and when that hits a specified number you get a SMART error and a drive failure and so dropping the timeout has a very significant impact on return rates and field failure statistics (by about a factor of 2 going from 30 to 15, and the same from 15 to 10).

    Joe

  4. Tom Says:

    I remember back in the day of ESDI, SMD and otherly manly drive interfaces, errors would cause us to move the read head offtrack in three positions in or out (both directions up hill, of course!), as well as shift the read clock three positions forward or back (sometimes while it was snowing!). If we couldn’t read data then we just recreated it - or made something up. It’s what we did, and we liked it that way!

    1666 retries?! Young kids today have too much time on their hands.

    ;-)

    TT

  5. Ernst Lopes Cardozo Says:

    Interesting stuff this. And it can be turned around too: puting a 15k drive in a PC for ultimate speed makes you vulnerable to loosing those Orlando vacation photos because of the limited read retries.

  6. Tom Says:

    Ernst, that’s an interesting way to look at it: SAS drives are actually LESS reliable than SATA when it comes to reading marginal data. Hmm. I admit I didn’t see that coming.

    Joe, I think you were implying that after 20 seconds the chance of recovering bad data starts to decreases. So do you have any idea what the odds are of finding your data after 20 seconds? Let’s say 25% of all bad sectors are recovered only after 20 seconds. I think that would mean that SATA drives are 25% more likely to recover your data (of course ignoring all the other little things that make SAS and SATA drives different). Is that thinking correct?

    And I think I know what your answer is: The SATA media has 10x the Bit Error Rate (BER) as SAS, so it has 10 times as many bad sectors to recover. Sure, the SATA drive may try harder to recover, but it HAS to because it’s got so many dang errors.

    Joe, do I win anything? :-)

    TT

  7. Henry Black Says:

    According to Areca’s knowledge base: TLER can’t prevent the disk dropping; actually, it will increase the chance of dropping. The TLER will report a timeout error to avoid long error recovery time. Such action can improve the controller command process, but it will also increase the change of disk dropping caused by timeout error.

  8. Tom Says:

    Henry, the Areca knowledge base is wrong.

    A drive returning a medium error should NEVER cause it to be dropped out of the array. Medium errors are an indication that RAID recovery needs to be attempted - such as rebuilding the data from the other drives, reassigning the bad block, and re-writing the rebuilt data. A drive should only be dropped out of an array if it stops responding - which is what a drive looks like when the error recover is too long.

    So I’ll hold to my original comment that limited recovery attempts and shorter timeouts will REDUCE the chance of a drive dropping out of the array.

Leave a Reply