Wait, you can’t plan for a disaster can you? They just happen when you least expect it. You just need to be ready to react and recover, right? While that is partially true, there are plenty of tasks you can regularly do beyond daily backups that will speed up your recovery time. While we can’t tell you everything you need to do for your specific situation, following several of these steps will help you answer those questions for yourself over time.
For most teams that have never done a disaster plan, the whole concept seems to be a massive amount of work that no one has time for. While this can very often be true, breaking down the goal of complete recovery is easy to do. It’s relatively unrealistic for any company to think that any team can complete a plan with all of the details you need in the first, second and most cases third year. The only time I have seen this done was around the time of Y2K. At that point though, every big company had a dedicated project team and even then most had been working on it for years. So until you have done your first attempt at an evaluation of the systems that need to be recovered, do not allow yourself to set a goal date. As staff members and technical leaders on this type of project, we can tell you this will be both appreciated by your team and will in turn make them more productive during your drills.
Getting Started
So how do you begin? Well IT Teams cannot begin until their partners in the business put on paper the systems they use. The business partners then need to prioritize these systems by criticalness. Finally, they need to identify how long they can survive without these systems and by using manual methods like faxing and postal mail instead of E-Mail. This last step seems hard to calculate and in certain situations the answer may be forever. In some cases it may be as little as 24 hours. For instance, Gmail being down for more than a few hours might cost Google millions of dollars in lost revenue and public good will. The public good will issue will probably result in an ad campaign designed to remind people how awesome they are and that you should trust them with more of your data. At the opposite end of the spectrum the servers for regression testing the latest build of Android might be able to be down for a week with little or no cost. Arming your team with that list allows them to focus and prioritize where they need to begin and start to set the boundaries on how much to spend on disaster recovery or business continuity plans.
Let the Planning Begin
Now that the business has outlined their requirements, you can now concentrate on your requirements. The next step is to develop a list of dependencies for each of the items on the list. Depending on how large your list, it may also need to be approached in phases. Once a phase and a list has been identified, you can start to talk through what the requirements are for each of those parts and keep repeating the cycle until you have reached the OS and Hardware(Virtual or Physical) level down on each the path. Sounds like a lot doesn’t it. It’s really a lot less than it appears to be. Why? Well each area of expertise, center of excellence, or however you break down your teams really knows these things like the back of their hands. For instance, most of the people on my team can tell you exactly what version of the OS, java application server, monitoring tools, and management tools are on each of our systems and do it all from memory. The thing that is always most impressive about good technical teams is how many of them can do this for how many different types of systems they support. Also make sure to note what human resources are needed to complete the tasks. There is nothing worse than realizing you are the only person who can ever recover a key piece of a system because that means you can never take a day off.
Let the Reviews Begin
Before we continue, let’s take a minute to remind the Manager/Leader folks reading this of a few key points to remember in the whole process. Teams we have worked on in the past, very often feel like they are missing documentation. This leads to fears about missing steps or dependencies. You will win big points with them by positively reinforcing how great that documentation will be as this process moves forward. Also reinforce that failure is a positive thing in this exercise. When you fail during a planned session or a test of the plan it will be an opportunity to grow, you will learn tricks to use even on a daily basis, and make the company stronger. For instance, once you can restore a system in less than your original estimates, this can lead to finding new ways to make the recovery even faster. One team I worked with started off taking five days to completely recover a particular system. By the third recovery drill we had reduced that time to three days. By the fifth drill, we had that time down to twenty-four hours. At the same time, our documentation and scripting got better, cleaner and easier to read with each drill. The end result was that anyone, not just the “Subject Matter Expert” for the system, could recover it in the twenty-four hour time frame.
Once you know the list of dependencies you have, you can compile the list of requirements you need to recover the systems, and in which order they need to be done. The next step I have normally taken is to work through what needs to be done for each dependency and how long it is believed to take. Estimates at this stage are all you have to go on and close counts, and yes, close will be good enough. Try to be as realistic as possible. Each IT area should then walk through their plans as a larger group to look for what might be missing from the plan at this stage. This is generally the first of what will be many reviews of the process you are making.
The next step is to get representatives from each IT area in your organization together and do a walk through of the combined plans. With this review, you are looking for conflicts and again anything that might have been missed. Until you have done an actual simulation of a recovery plan you won’t know that you have everything. Again, to Managers and Leaders, take a minute to reinforce that finding gaps is an expectation to help manage the stress of the exercise on your staff. Repeat this step until you stop finding gaps and oversights in the documentation. A common stumbling block at this point is an obsession with the time everyone estimates it will take to do a step. We have spent more than a few hours either defending or arguing against someone else’s estimates. The end result in almost every case was that both sides were wrong and it really did not matter what we had estimated because some outside force was there to mess us up. His name is Murphy and his law is amazingly accurate.
You should now be able to start estimating the cost of your recovery plans. There are a large number of companies ready to help you with this type of solution. At this point you are finally ready to talk to them. While each of them will be slightly different, they will all be looking for things like, but not limited to, network needs and configurations, server counts, and how long you would stay there until you recover. Each of these facets will affect your price. The closer you push towards restoring your business to a pre-disaster state, the higher the bill will be. To control costs you may want to consider hosting/hosted solutions like Amazon’s EC or Google Apps. While you may want more control over your data and user experience on a regular basis, in a disaster, rules normally tend to loosen. If the costs get to large, you may want to consider buying additional hardware and converting a space in a remote office. In most organizations, you will be presenting these costs to upper management, so showing that you have investigated these solutions even when you recommend against them, will help you sell your plan and obtain the money it will take to implement.
At this point you should have a rough idea of how long it will take you to recover from a disaster. You should be pretty sure how much it will cost you to recover. Hopefully your management has signed off on it. So nothing left to do now but give the plan a real test. You have two choices on how to proceed with this. The first thing everyone will want to do falls into one of two camps. 1) attempt to bring back each system independently or 2) try to bring everything back at once. The first time you are going to do an actual recovery, attempting everything at once is the surest way to waste time, money and resources. The most successful processes we have participated in were evolutionary, not revolutionary. Refer back to what your business partners have told you were the most critical systems and try to choose the one that will be a test for each IT Area. For instance, recovering the companies E-Mail system, Website(but not E-Commerce Site) and every-one’s home directories might be a good first test. This assumes that you have enough people to do the business as usual stuff as well as the test. You and your team will be the best judge of this and listening at this point, again for the Managers/Leaders reading this, is critical to your success. Keep in mind though you really want to see at least a few things fail. Even massive failures are extremely useful. Getting the adrenaline pumping, fighting fires and fixing issues can be galvanizing to a team. Don’t ever underestimate the power of the “we made it through this together”, and what that can do for morale and communications within a team, department or entire organization.
Once you know what your are going to attempt, and how, there is really only one thing left to do. Give it a try of course. Once you have completed the drill or test of the plan you will want to do a multi-phase review of the whole drill. Individual teams should be expected to review their test results and those should be aggregated together so everyone can help offer options that might help or what worked for their team.
Most companies try to do a drill at least once a year. The more practice, the less effort it is for the drill, but they still take time, resources, and very often project deliverables will miss completion to allow you to run the test of the plan. If the systems are critical enough, for instance the mainframe that runs 99% of the back-end processing for your E-Commerce application, you may need to or want to try a recovery multiple times a year. So how often your company decided to repeat the process is really up to your organization as well as being determined by how much they were planning. There are no hard fast rules.
In conclusion, remember that planning is the first step to any successful recovery effort. Do not allow yourself to only plan for one type of failure. The smoking hole that was your data center is always the first goal that comes to mind. The reality is that you are far more likely to loose something like power, network equipment, and your SAN or NAS. So be realistic in every step of this process. Take all suggestions seriously and do your best not to over spend on the insurance policy that is your disaster plan.