Please make us accountable for security..

This post is a direct plea to any C level managers or other managers that control what IT people do.  We at LinuxInstall.net are tired of reading about companies like yours getting hacked.  We feel comfortable saying this because while not every industrial market has seen a break in, it’s only a matter of time before they all confess.  So what can you do?

First you should hold everyone accountable for their part of your IT Security, even if they don’t work for that department.  By this we mean that your developers should be held accountable to write secure code.  Your administrators should be expected to follow strict hardening standards.  Everyone else in the company should use strong passwords and be smart about what they click on while visiting a web page.  

So how do you verify that this is all being done?  Start with an audit of every one’s processes which you have performed by outside security experts.  They will be able to accurately evaluate that your standards are both secure enough and being enforced.  With internally developed websites, the cost of outside code reviews should save you from having to spend the money on fraud protection for all of your customers or users.  Read the audits and ask both the auditors and internal staff enough questions to let them know you really did look at it and want more detail or clarifications.  It doesn’t make you look stupid and you will earn more respect this way.  This process will also very likely end with a request for more staff, so be prepared.

For everyone, including IT, forced training about good web surfing habits and passwords are a must for every company.  Regular checks by your security team for weak passwords inside your company will help to scare people straight about the policy.

Finally, listen to your staff.  You pay them to be the experts on this.  Let them be experts.  They are the ones that should be focused on reading about the latest trends and maintaining their skills. To keep up their skills they need to go to both local and remote conferences on coding, security and things of this nature.  That means that you need to spend money on it.  The volume of knowledge around the entire security world is now beyond one person.  Your teams should be both generalists and specialists.  This will let them learn the basics and then focus on their area of interest.  When you get the request to spend money, just review and approve it on the condition of cross training their teammates.

Remember security is every ones concern.  If the company is breached bad enough, going out of business is a real possibility.  With everyone working together and working securely you can take a large step towards securing the net.

Automation – Can to much of a good thing be bad?

Senior  systems administrators on any platform know that automation is the  single fastest way to improve the effectiveness of their team.  Scripts  provide stability, repeatability and reduce the time spent on often  repeated tasks.  If done correctly, automation will make everything more  stable and manageable.  

However,  scripts for managing systems can be a double edged sword.  On one hand,  they make a team highly efficient.  They can help junior admins perform  far above their experience level and free senior admins up to  investigate more difficult problems.  On the other hand though, they can  lead to a loss of knowledge.  The knowledge it took to create the  scripts becomes locked inside of them.  So what do you do to strike the  proper balance?  How can you keep the knowledge fresh in every-one’s  mind while still automating?  What steps can be taken to avoid knowledge  erosion and worse the brain drain or vacuum that is left when people  leave?

The  first thing to remember is that there is no one thing that can be done  to answer these questions.  Here we will provide you with some tips and  ideas we have found to be useful and effective.  This is a short list  and we hope that it will inspire you to think about what might work for  you and your company. 

The  first item is well documented scripts and procedures.  Taking 5 minutes  to write up what you were thinking when you wrote the script can save  you days trying to figure it out later.  As more object oriented  scripting languages like Python, Ruby and Perl take hold, it becomes  easier to break down complex scripts into much easier and digestable  chunks.  These smaller chunks, like the core ideas behind Linux, should  do one thing and do it well.  The names of the functions should describe  what they do.  For instance, a function called createNewSSHKeys, should  probably create new SSH keys.  This combined with an explanation of  what you were trying to do inside the function will help you and others  manage them.  When you get really good at this way of thinking, people  should be able to take your function calls out and write a manual  procedure that could replace your automation.  If that is your goal,  then it only makes sense that starting with a well documented procedure  to compare against when your done scripting makes sense.  It is unlikely  that every procedure step will match a function or series of function  calls.  Getting everything close does count though.

As  much as self documenting scripts helps though,  documenting  configuration files for your scripts can keep things fresh in peoples  mind.  At the very least, if done correctly, it will give them a  breadcrumb trail to follow to see if what they think is being set is  set.  We recently began testing out Puppet, an automated way to manage  server configuration files and other admin related tasks.  The  configuration files for Puppet can be used as a great example.  They  allow you to use a combination of intelligent names and comments to  inform the person reading the file what will be changed.  They also  include a description of where to look to verify that the changes are  being done correctly.  This means that I don’t need to know Ruby, the  language Puppet is written in, to figure out how or what its going to  do.  The configuration file itself tells me everything I need to know.   When you write your own script, the time it takes to do this may not be  warranted.  So at the very least, make sure that you have comments that  tell people where to look for the output based on these configurations  or what the configurations mean in the file.

Try  to keep everyone with the sharp skills needed so they are ready to  slice through problems as they arise. This also means internal training.   One of the things we have participated in on a regular basis is a  short one hour refresher put on by the subject matter experts(SME) for  each of the technologies we use.  Doing this accomplishes a few  different things at once.  It helps the SME keep their documentation  current.  It gives the SME an opportunity to share changes they want to  make or have made in the environment.  Then it gives everyone supporting  the environment a chance to ask questions about the technology when  there is no pressure.  When possible, annual reviews of each area that a  team supports, goes a long way towards elevating the teams ability to  be as productive as possible.

While  you can never completely prevent brain drain when a team member leaves,  the steps above, if done correctly, can go a long way.  Having been the  person transitioned to more than once, the better these steps are  followed, the better we have felt about taking on the responsibility.   Another side effect of these approaches and others along the same  thought process is that it allows people to migrate from one SME area to  another.  This helps people stay fresh and keeps them from becoming  bored and complaisant.  The more driven your team is to solve businesses  problems, the more profitable you will be.

Planning Your Next Disaster….

Wait, you can’t plan for a disaster can you?  They just happen when you least expect it.  You just need to be ready to react and recover, right?  While that is partially true, there are plenty of tasks you can regularly do beyond daily backups that will speed up your recovery time.  While we can’t tell you everything you need to do for your specific situation, following several of these steps will help you answer those questions for yourself over time.

For most teams that have never done a disaster plan, the whole concept seems to be a massive amount of work that no one has time for.  While this can very often be true, breaking down the goal of complete recovery is easy to do.  It’s relatively unrealistic for any company to think that any team can complete a plan with all of the details you need in the first, second and most cases third year.  The only time I have seen this done was around the time of Y2K.  At that point though, every big company had a dedicated project team and even then most had been working on it for years.  So until you have done your first attempt at an evaluation of the systems that need to be recovered, do not allow yourself to set a goal date.  As staff members and technical leaders on this type of project, we can tell you this will be both appreciated by your team and will in turn make them more productive during your drills.

Getting Started

So how do you begin?  Well IT Teams cannot begin until their partners in the business put on paper the systems they use.  The business partners then need to prioritize these systems by criticalness.  Finally, they need to identify how long they can survive without these systems and by  using manual methods like faxing and postal mail instead of E-Mail.  This last step seems hard to calculate and in certain situations the answer may be forever.  In some cases it may be as little as 24 hours.  For instance, Gmail being down for more than a few hours might cost Google millions of dollars in lost revenue and public good will.  The public good will issue will probably result in an ad campaign designed to remind people how awesome they are and that you should trust them with more of your data.  At the opposite end of the spectrum the servers for regression testing the latest build of Android might be able to be down for a week with little or no cost.  Arming your team with that list allows them to focus and prioritize where they need to  begin and start to set the boundaries on how much to spend on disaster recovery or business continuity plans.  

Let the Planning Begin

Now that the business has outlined their requirements, you can now concentrate on your requirements.  The next step is to develop a list of dependencies for each of the items on the list.  Depending on how large your list, it may also need to be approached in phases.  Once a phase and a list has been identified, you can start to talk through what the requirements are for each of those parts and keep repeating the cycle until you have reached the OS and Hardware(Virtual or Physical) level down on each the path.  Sounds like a lot doesn’t it.  It’s really a lot less than it appears to be.  Why?  Well each area of expertise, center of excellence, or however you break down your teams really knows these things like the back of their hands.  For instance, most of the people on my team can tell you exactly what version of the OS, java application server, monitoring tools, and management tools are on each of our systems and do it all from memory.  The thing that is always most impressive about good technical teams is how many of them can do this for how many different types of systems they support.  Also make sure to note what human resources are needed to complete the tasks.  There is nothing worse than realizing you are the only person who can ever recover a key piece of a system because that means you can never take a day off.

Let the Reviews Begin

Before we continue, let’s take a minute to remind  the Manager/Leader folks reading this of a few key points to remember in the whole process.  Teams we have worked on in the past, very often feel like they are missing documentation.  This leads to fears about missing steps or dependencies.  You will win big points with them by positively reinforcing how great that documentation will be as this process moves forward.  Also reinforce that failure is a positive thing in this exercise.  When you fail during a planned session or a test of the plan it will be an opportunity to grow, you will learn tricks to use even on a daily basis, and make the company stronger.  For instance, once you can restore a system in less than your original estimates, this can  lead to finding new ways to make the recovery even faster.  One team I worked with started off taking five days to completely recover a particular system.  By the third recovery drill we had reduced that time to three days.  By the fifth drill, we had that time down to twenty-four hours.  At the same time, our documentation and scripting got better, cleaner and easier to read with each drill.  The end result was that anyone, not just the “Subject Matter Expert” for the system, could recover it in the twenty-four hour time frame.

Once you know the list of dependencies you have, you can compile the list of requirements you need to recover the systems, and in which order they need to be done.   The next step I have normally taken is to work through what needs to be done for each dependency and how long it is believed to take.  Estimates at this stage are all you have to go on and close counts, and yes, close will be good enough.  Try to be as realistic as possible.  Each IT area should then walk through their plans as a larger group to look for what might be missing from the plan at this stage.  This is generally the first of what will  be many reviews of the process you are making.

The next step is to get representatives from each IT area in your organization together and do a walk through of the combined plans.  With this review, you are looking for conflicts and again anything that might have been missed.  Until you have done an actual simulation of a recovery plan you won’t know that you have everything.  Again, to Managers and Leaders, take a minute to reinforce that finding gaps is an expectation to help manage the stress of the exercise on your staff.  Repeat this step until you stop finding gaps and oversights in the documentation.  A common stumbling block at this point is an obsession with the time everyone estimates it will take to do a step.  We have spent more than a few hours either defending or arguing against someone else’s estimates.  The end result in almost every case was that both sides were wrong and it really did not matter what we had estimated because some outside force was there to mess us up.  His name is Murphy and his law is amazingly accurate.

You should now be able to start estimating the cost of your recovery plans.  There are a large number of companies ready to help you with this type of solution.  At this point you are finally ready to talk to them.  While each of them will be slightly different, they will all be looking for things like, but not limited to, network needs and configurations, server counts, and how long you would stay there until you recover.  Each of these facets will affect your price.  The closer you push towards restoring your business to a pre-disaster state, the higher the bill will be.  To control costs you may want to consider hosting/hosted solutions like Amazon’s EC or Google Apps.  While you may want more control over your data and user experience on a regular basis, in a disaster, rules normally tend to loosen.  If the costs get to large, you may want to consider buying additional hardware and converting a space in a remote office.  In most organizations, you will be presenting these costs to upper management, so showing that you have investigated these solutions even when you recommend against them, will help you sell your plan and obtain the money it will take to implement.

At this point you should have a rough idea of how long it will take you to recover from a disaster.  You should be pretty sure how much it will cost you to recover.  Hopefully your management has signed off on it.  So nothing left to do now but give the plan a real test.  You have two choices on how to proceed with this.  The first thing everyone will want to do falls into one of two camps. 1) attempt to bring back each system independently or 2) try to bring everything back at once.  The first time you are going to do an actual recovery, attempting everything at once is the surest way to waste time, money and resources.  The most successful processes we have participated in were evolutionary, not revolutionary.  Refer back to what your business partners have told you were the most critical systems and try to choose the one that will be a test for each IT Area.  For instance, recovering the companies E-Mail system, Website(but not E-Commerce Site) and every-one’s home directories might be a good first test.  This assumes that you have enough people to do the business as usual stuff as well as the test.  You and your team will be the best judge of this and listening at this point, again for the Managers/Leaders reading this, is critical to your success.  Keep in mind though you really want to see at least a few things fail.  Even massive failures are extremely useful.  Getting the adrenaline pumping, fighting fires and fixing issues can be galvanizing to a team.  Don’t ever underestimate the power of the “we made it through this together”,  and what that can do for morale and communications within a team, department or entire organization.

Once you know what your are going to attempt, and how, there is really only one thing left to do.  Give it a try of course.  Once you have completed the drill or test of the plan you will want to do a multi-phase review of the whole drill.  Individual teams  should be expected to review their test results and those should be aggregated together so everyone can help offer options that might help or what worked for their team.

Most companies try to do a drill at least once a year.  The more practice, the less effort it is for the drill, but they still take time, resources, and very often project deliverables will miss completion to allow you to run the test of the plan. If the systems are critical enough, for instance the mainframe that runs 99% of the back-end processing for your E-Commerce application, you may need to or want to try a recovery multiple times a year.  So how often your company decided to repeat the process is really up to your organization as well as being determined by how much they were planning.  There are no hard fast rules.

In conclusion, remember that planning is the first step to any successful recovery effort.  Do not allow yourself to only plan for one type of failure.  The smoking hole that was your data center is always the first goal that comes to mind.  The reality is that you are far more likely to loose something like power, network equipment, and your SAN or NAS.  So be realistic in every step of this process.  Take all suggestions seriously and do your best not to over spend on the insurance policy that is your disaster plan.