Home | About the Storage Advisors | Adaptec Trusted Storage


Surviving component failure …

Posted in General, Storage Applications, Platforms, Storage Interconnects & RAID, Storage Management, Application Environments, Advisor - Neil by Neil

While it’s good to have good backups, most system admins would like to know just exactly what the consequences of component failure mean for their uptime/downtime/workload scenario. In other words, how big a job will it be to get the server back on line if something falls over.

So let’s look at how the storage subsystem should behave in the event of catastrophic component failure.

The easiest thing on the list are the hard drives. If they fail you should hear a screaming noise from your raid card, you should get an email from your management software (Adaptec Storage Manager) and you should see a red light on the failed drive in the case of a hot-swap chassis. Pretty simple really. Just replace the drive. The RAID card will (depending on settings in the BIOS) automatically rebuild the array, or sit there until you make the new drive a hot spare, in which case the card will then rebuild the array.

We’ll come back to the subject of drive failures towards the end of this because there are issues with drive failures and different RAID types.

Backplane failure? Pretty rare but it should just be a case of replacing the backplane, reconnecting everything and all will be good. The real problem here is that the RAID card is now more than a little annoyed at have had all it’s drives removed and will have to work out what was on the drives. This is simple if the drives all dropped off the card at once. However if they went down in sequence then the card will, at some point, have marked the array as failed and you’ll need to talk to tech support about your options (too long to list here but you are not totally without hope).

Card failure? Just replace the card. Adaptec store their RAID data on all the drives plus on the card itself. The first thing the new card will do is read the metadata from the drives, load the array information into the NVRAM on the card and you’re away. Note that you may be prompted to accept the finding of a new array, which confuses people. It’s not new to you, but it is new to the new card.

Motherboard or total system failure. Just replace the components and your storage will be fine. You can even take the RAID card and array (disks) to another system, plug them in and the card will still know the array and present the data … it’s up to you to work out any OS issues that this causes but it’s generally not the end of the world.

Going back to RAID failures … of course there are diffent RAID levels which have different redundancy capabilities. If you are nuts enough to use RAID 0 you have no insurance - one drive failure will kill everything - don’t bother ringing us, we can’t do anything for you.

RAID 1 can survive 1 drive failure (after all, there are only 2 drives in there). RAID 5 can survive one drive failure - the real problem here is that you can sometimes be caught with a second drive failure during the rebuild after the first drive failure. If the rebuild is not complete then this is regarded as a two drive failure, and you’ve had it. (Note I’ll do a separate article about the dangers of building too large a RAID 5 array later.)

RAID 6 can survive two simultaneous drive failures, so it’s safer than RAID 5 because if you have one drive fail, replace the drive and start the rebuild and another drive fails before the rebuild is complete the system will survive. You’ll be very annoyed, but you’ll still have your data. RAID 10, 50 and 60 can survive varying numbers of drive failures, but you have to be lucky which drives fail. In general you don’t want to count on being able to survive multiple drive failures in these configs, but most of the time you can.

Of course we recommend you have a hot spare in your system. This is just a drive sitting there watching all other drives (but doing nothing itself). When one drive dies, the card will initiate a rebuild onto the hot spare minimising your downtime. While the system is rebuilding you can trot off to the shop to get the failed drive replaced etc. When you get the new drive you replace the failed drive then make the new drive the hot spare. Don’t forget to move the nice neat printed label you put on the front of the system indicating which drive is the hot spare. Do not, under any circumstances, try to re-arrange the drives so that the hot spare is back where you originally had it, either physically or per drive ID. We can handle a bit of randomness - it’s humans that just have to be neat.

So as you can see, you can survive a fair amount of damage happening on your system without your world falling to pieces, but that does not, ever, mean you don’t need good backups. My mate Murphy was an optimist … if you have good backups then nothing much ever goes wrong. It seems to me that if you don’t have backups then fate kicks you at the worst possible time.

Ciao
Neil

6 Responses to “Surviving component failure …”

  1. Clive Says:

    Hi.

    I found this blog when searching for info to help me with sorting out an apparent drive failure in a RAID 10 on an Adaptec 2410SA controller. I should start by saying I’m standing in for someone else in terms of supporting this system - a Windows 2003 Server in a small network at the local primary school - and that I’m really much more of a software person than hardware. So, basically I thought I was “volunteering” to help them boot the server about once a term and that would be about it, but suddenly I find I’m up to my ears in with firmware PANICS, degraded arrays, screwdrivers, and nasty sounding questions about whether I want to continue and possibly lose all data!

    To begin with the system crashed (for some reason), and then the controller wouldn’t boot up - during POST it failed with error messages about an unknown firmware error, error EF, kernel PANIC and the like. Me PANIC also…!! Totally removing power, letting things rest for a while and trying again made no difference. So after much googling and ringing around a few people I decided to to try to reflash the firmware, but when I went back with my disks in hand a few hours later I found the controller was now willing and able to boot up “normally”, albeit also reporting one drive “missing” and degraded arrays. (I’d left it powered up after booting from a Linux based “Rescue CD” earlier - that had also reported the mysterious firmware error EF from the controller but had at least continued to a command line prompt after that, so perhaps that few hours with power on had helped in some way?)

    So now the system would boot up, but various restarts after that plus checking the RAID cables and card were firmly seated, etc., made no difference to the “missing member” message - missing after rescanning drives, missing everywhere and all the time. So I decided the drive must be truly dead, and ordered a new one. After scaring myself silly a few times (and by the way why are the ports apparently numbered in the reverse order from what the documentation says?) I got the new drive installed. The controller recognized it and allowed me to “initialize it” and to set it as the global hot-spare. I rebooted the system and all looked good. However another problem on this system is that Adaptec Storage Manager (ASM) consistently and mysteriously reports “no controllers found” so the only way I have right now to know what is really going on with the array is to reboot the machine and press Ctrl-A to get into the firmware configuration tool during POST. So, two days after installing the new drive, I do this (also motivated by having just installed the latest version of ASM in an attempt to fix that particular issue, and being told I needed to reboot before it could be used). So naturally I do the Ctrl-A dance again and find the drive on port 1 is still reported as missing from the array but also there was something about the rebuild being 80 something percent done! (And the new drive is visible in the list of individual drives elsewhere and still appears as the global hot-spare.) Hmm. I read somewhere else earlier that rebooting before a rebuild is complete also means it all starts again from scratch! I guess I better wait more than two days before daring to look next time! (These are not huge drives, originally 4 X 120GB Seagate SATA, but the replacement had to be a 160 GB SATA II drive - with jumper set to force SATA 1 speeds.)

    In the meantime, the updated ASM still doesn’t see the controller (and there’s no firewall running). However the “Adaptec Storage Agent” service (possibly a different name in the latest version just installed) is no longer hanging or failing in some way during the system boot which it was doing earlier so perhaps that is a small step in the right direction at least. I think I’m right in assuming the agent handles the direct communication with the controller, and then passes info to and from ASM as required?

    My real beef is that it doesn’t seem easy to find a clear and straightforward description of what to expect when a drive is replaced, how long things might take, etc. No doubt this all seems very basic when you’re dealing with this kind of stuff regularly but I feel a bit like a blind man walking around a room full of sharp things and with holes in the floor!

    I would have expected the new drive to no longer be showing as the hot-spare if the rebuild has started, and for the “missing member” message in the relevant array info page to also be replaced by something more meaningful. However I am pretty certain some kind of rebuild has being happening because the full daily backup took twice as long as it usually does in the hours immediately after the new drive was installed, although it was back to normal duration on the following night which was another reason I thought the rebuild was probably complete by then - this is also all happening over a weekend with essentially no other activity on the network so it’s just the server “idling” away the time.

    Okay, this post has grown far too big but I really wanted to get this story off my chest. Perhaps someone can point me at the relevant documentation that clarifies the meaning of what I am seeing, and what could be stopping ASM from seeing the controller. I have downloaded everything that seemed relevant from adaptec.com as I had no access to whatever came with the controller originally, but perhaps I’ve also just missed the import bits when digging through all that.

    Thanks, and I’ve enjoyed reading a few of the other posts also!

    Clive

  2. Neil Says:

    Clive,

    Wow … someone who can speak english, type well and knows how to spell punctuation! Enough platitudes … down to tech support:

    You have the wrong driver loaded in the system. You are using the Microsoft driver. Go to Adaptec’s website, download and install the Adaptec driver. That will at least let you see the card and array in ASM.

    As for documentation to tell you what you are doing? A good place to start would be the documentation that comes with our newest controllers. The manuals are a darned site better than they were in the 2410 days.
    http://download.adaptec.com/pdfs/user_guides/Adaptec_RAID_Controller_IUG_6_2009.pdf

    It appears you’ve had a drive failure which has confused the daylights out of the card. This is not completely unheard of. It also appears that you have successfully rebuilt the array, but you’ll see that when you get the correct driver in the system and the card shows up in ASM.

    If you need further technical assistance we have an online help system “http://ask.adaptec.com” or you can call tech support in your region (depending where you are of course … which I cant determine from your email address).

    Whichever way you contact us our tech support team will be glad to assist.

    Ciao
    Neil

  3. Clive Says:

    Thanks Neil! ( I like your writing style also. :)

    I’d come to more or less same conclusion about the driver myself and in fact grabbed the (hopefully) correct version from the Adaptec site about 30 minutes ago, almost immediately before I came back to find your response here. (If you’re interested, the driver installed now is aacmgt.sys, version 5.2.0.10237 and is identified as coming from Adaptec. It’s also not signed. And then there are a couple of other Microsoft files displayed along with that also whose details I didn’t note down.)

    Anyway I’m so “excited” about being able to see what the darn thing is actually doing that I think I’ll jump the fence and update the driver right after I’ve posted this!

    I’ve also grabbed the documentation you pointed to. I thought I already had that file but it turned out to just be something with a similar name.

    Thanks very much for your response and help.

    Clive

  4. Neil Says:

    Clive,

    You will probably find one anomaly that I always find strange, but understandable, in our software. The hot spare drive will probably still be marked as a hot spare. It will have a similar icon to the rest of the drives in the array (indicating that some or all of the disk is in use as part of an array, and you will have an option to “delete” the hot spare.

    Why? At first I used to think … why doesn’t it just get rid of the hot spare moniker itself? Basically because then you would have a hard time understanding what had happened in your system. We leave the hot spare indicator there so that you can tell that the drive was a hot spare, but is now part of the array (and hence there is something rotten in the State of Denmark).

    I’d suggest you have a fair chunk of reading to do, and will (because you obviously enjoy typing and are reasonably good at it) come back with some questions.

    Looking forward to it.

    Ciao
    Neil

  5. Clive Says:

    I did jump the fence after my previous post, all fired up and ready to replace the Microsoft driver for the RAID controller with one from Adaptec but just when I was about to start doing that I got a sudden and severe dose of cold feet. That arose from a thought or two about how I was about to swap the driver for the boot device, and I wasn’t quite sure how that process would (or should be) handled. So, as I also really didn’t want to end up with the server being unable to boot just because I wanted the “excitement” of seeing ASM working properly, I just fired off some additional backups instead and came home to do a bit more reading and thinking. Then it turned out the server had to be shutdown a day later anyway for an upgrade to the building’s power supply so I took that opportunity to get back into ACU and found that at long last everything showing up as “Optimal”. And there was much rejoicing…

    After all that I only got around to installing the correct Adaptec driver today and now ASM is working properly. Very exciting indeed! Thanks also for mentioning that “anomaly” in ASM as with that information onboard I barely gave any potential consequences a second thought before merrily “deleting” the hotspare.

    And all was well in the State of Denmark! (And New Zealand also for that matter.)

    Hey, where did that crack about me needing to get back onto my meds go? I enjoyed that!

    Cheers
    Clive

  6. Neil Says:

    Clive,

    Did I mention to make a backup before you touch anything in your system? How remiss of me to forget this little statement. It’s pretty much the standard caveat … “she’ll be sweet” (build the confidence), “but make a good backup before doing this” (sow the seeds of fear), then “just call us back if you have any problems” (completely destroy what little confidence the customer had left) … then hang up - I can almost see the cold sweats I leave people in after talking them through something complex … :-)

    However … it appears that things are plodding along nicely in the land of the long white cloud and that there is someone else out there who knows a smattering of Shakespeare (which is all I know).

    As for the crack about meds (and that’s a pun in it’s own right) … my editor gets very, very nervous about my sense of humour … you vill comply or else! (I’ll let you guess the nationality I’m thinking about here).

    Ciao
    Neil

Leave a Reply