Please make us accountable for security..

This post is a direct plea to any C level managers or other managers that control what IT people do.  We at LinuxInstall.net are tired of reading about companies like yours getting hacked.  We feel comfortable saying this because while not every industrial market has seen a break in, it’s only a matter of time before they all confess.  So what can you do?

First you should hold everyone accountable for their part of your IT Security, even if they don’t work for that department.  By this we mean that your developers should be held accountable to write secure code.  Your administrators should be expected to follow strict hardening standards.  Everyone else in the company should use strong passwords and be smart about what they click on while visiting a web page.  

So how do you verify that this is all being done?  Start with an audit of every one’s processes which you have performed by outside security experts.  They will be able to accurately evaluate that your standards are both secure enough and being enforced.  With internally developed websites, the cost of outside code reviews should save you from having to spend the money on fraud protection for all of your customers or users.  Read the audits and ask both the auditors and internal staff enough questions to let them know you really did look at it and want more detail or clarifications.  It doesn’t make you look stupid and you will earn more respect this way.  This process will also very likely end with a request for more staff, so be prepared.

For everyone, including IT, forced training about good web surfing habits and passwords are a must for every company.  Regular checks by your security team for weak passwords inside your company will help to scare people straight about the policy.

Finally, listen to your staff.  You pay them to be the experts on this.  Let them be experts.  They are the ones that should be focused on reading about the latest trends and maintaining their skills. To keep up their skills they need to go to both local and remote conferences on coding, security and things of this nature.  That means that you need to spend money on it.  The volume of knowledge around the entire security world is now beyond one person.  Your teams should be both generalists and specialists.  This will let them learn the basics and then focus on their area of interest.  When you get the request to spend money, just review and approve it on the condition of cross training their teammates.

Remember security is every ones concern.  If the company is breached bad enough, going out of business is a real possibility.  With everyone working together and working securely you can take a large step towards securing the net.

Should the Baracus Project be a member of your A-Team?

As stated in the Puppet article, we have been investigating alternative solutions to the closed source server build and management options on the market.  Novell has been sponsoring a project called Baracus.  Baracus is an open source project that is trying to become the next generation system for booting, building and managing power used by systems.  It seems like something we should check out so we did.  The project was announced to the public on November 19, 2010.  How good or bad could it really be?

For testing the system out, we choose to download the projects SuSe Studio created VM.  Being  a Novell project, it’s based on OpenSuse 11.2, with all of the setup steps in the documentation done for you.  A few normal admin tasks to change passwords, setup my account and the other normal system stuff got the system up.  To make everything work, there are a few other things you will need to do like setup a DHCP server.  While not required, a DNS server would make life easier also, so keep that in mind when starting out.  With almost any distribution you can set both of these up pretty easily.  I suggest installing webmin if you have never done it before.  It will help you get them up and running, with clean configuration files at least.  The documentation from the project is already at a point where the setup instructions are amazingly complete.  I was able to login to the Web GUI without any issue and start roaming around.  The system is very well thought out.  The look, feel and options included show that this system was built by system administrators for system administratos.   It’s not the prettiest looking interface, but it works and things are grouped logically.  They give you the option of using the WebGUI or command line.  A few of the commands really have to be run via the command line at this point but they seem to be working on improvements in a timely fashion.  There have been two updates since I downloaded this in November, both with noticeable improvements.  

How does it work?
The system does a three step boot process.  The first boot interrogates the system hardware and builds a hardware profile and uploads it to the Baracus Server.  Once uploaded, or registered, you will see the servers MAC Address in a list and be able to view what step the process is at.  The second boot will either boot and bring the server to a halt, allowing you time to choose the configuration options, or start building it if you already have that setup.  The third, fourth if you paused to configure, boot sets the server to boot off of either the local disk or network boot location depending on your choice of configuration.    When we say, “configuration options” here, we mean that you set up almost anything you want to do to a server.  This can be from upgrades/patches to turning it into a net booted system.

The first thing that impressed me with this system was that it’s not just a Suse build system.  As of the writing of this article you can build Debian, OpenSuse, SuSe Linux Enterprise Desktop and Server, Fedora, Red Hat Enterprise Linux, Ubuntu, OpenSolaris, ESX 4.x, Windows 7 and 2008 server, and Xenserver.  There are examples of silent configuration files available for most if not all of the systems listed.  Updating these files and adding them to the database used by Baracus is easy and took a few minutes.

The Virtual Machines only come with OpenSuse pre-installed.  So I set off to figure out how to add Ubuntu.  It turns out it was one command.  Here it is, “basource add –isos –distro ubuntu-10.10-x86_64”.  That’s it.  It goes out and downloads the ISO, puts it in the proper location, creates the needed mount points, and adds it to the database so that you can build servers from it.  If you want to do a silent install of the supported OSes all you have to do is make your modifications to the appropriately named file and issue another command to add that configuration to the database.  In just longer than the time it took the system to download the ISO, I was ready and testing my first build of Ubuntu 10.10 over a network connection with a silent install.  Having spent hours or nights in the past setting up systems to be able to build and boot off of the network, this was pretty impressive.  I have now setup Fedora and multiple OpenSuse versions.

On the network here we built an Ubuntu 10.10 VMWare system in roughly 15 minutes.  We set up custom disk partitions, setup our users, groups and additional software packages.  With a few more changes we had a script setup to update the repos and patch the system.  Then finally set up some scripts to automatically configure Puppet.  Now in less than 20 minutes we can take a raw VMWare Server and have it completely configured and up to date.  Having done all of this in the WebGUI, I tried doing it from the command line. It worked just as well and was actually a little faster.

So it’s so green what is wrong with it?
Really there are very few misses in the WebGUI, documentation and command line.  A few things we believe to be either documentation errata or bugs.  These did not show themselves however until I tried to bend it to what I wanted.  The problems with the WebGUI are mostly that we would like to see better errors to the user in a few odd spots and more AJAX like behaviors.  Having the assigned machine names instead of MAC addresses would be really helpful, as would some other views of the systems.  The groups functionality seemed too hard to use and isn’t offering enough right now.  Most of the documentation seems complete but more documentation on errors and what to do about them needs to be flushed out.  Where we had problems though, it didn’t take long to find and fix the problem.

Our Conclusion
Baracus is a great system that should be an amazing system with just a few cleanup and documentation fixes.  At this time I am not sure it’s really ready for people to use in production.  So we here are Linuxinstall.net say try it but don’t rely on it just yet.

Did we not answer your question?  Please ask it in the comments.

 

Managing systems like a Puppet Master instead of a Server Servant

Recently I began playing around with free and open source tools that could be replacements for some closed source tools I have used to build and manage systems.  I specifically wanted to be able to build servers up to a base level so that I could test out tools like Hudson, several different Source Code Management solutions and different application server platforms.  I also run some systems in my home network that I find myself rebuilding often so I can be on the latest and greatest versions of my favorite distro’s.  In the past, this has meant building a new machine, installing and configuring the software, then swapping out the old machine for the new one.  That last step is almost always followed by several hours, if not days, of tweaking and trying to remember the settings I learned about the last time that made thing 1 work better with thing 2.  I was able to improve efficiency at work with a traditional Vendor Model(Closed Source) software, but until recently had not found the open source alternatives to be that compelling.  Considering that this is my hobby as well as my job, it has always just seemed faster and easier to set them up more or less by hand.  Now that is all about to change and I am ready to be a Puppet Master.

Puppet as described via the Puppet website is:

“an open source data center automation and configuration management framework. Puppet provides system administrators with a simplified platform that allows for consistent, transparent, and flexible systems management.”

Why not use Launchpad, FAI or Spacewalk you may be wondering?  I was looking for simple tools that do just what I require.  This was also a much easier and faster setup.  The time it took me just to read the documentation when I was looking into these solutions was more than I have spent installing and configuring Puppet.  I am sure there are a few features I am missing, like the ability to install an OS instance, but I do not do that too often and was just looking to manage my systems with some tool for this round of changes.  I have started playing with Novell’s Open Source Baracus(http://baracus-project.org/Site/Baracus.html) for doing OS Installs and updates.  I will leave that discussion for an upcoming article.

Before you download and start playing with Puppet, I strongly suggest setting up a Source Code Management system like Git or Mercurial, assuming you don’t already have one running.  I didn’t know which one I wanted to use so I downloaded the SCM machine from the guys at

Turnkey Linux

.

 There is nothing worse than accidentally deleting a configuration line you spent hours searching to find to solve a major issue.  Except that is, saving the errant file into your Puppet repository, logging off and going home for the day expecting it to replicate while you sleep and solve a problem.  While you sleep the file is replicating and services are restarting putting you back to where you began or making things even worse for you the next morning.  If you have the file backed up and/or version controlled it will let you get back to where you thought you were when you left for home.  The last thing is a tool like

etckeeper

.  Etckeeper is a great little tool the folks at Puppet turned me on too.  I have made several attempts in the past to add the use of version control to my /etc/ directory.  If you don’t keep up with it or remember to commit your changes before an update or upgrade to software, you can often lose your most current configurations.  Etckeeper has hooks into apt, yum and pacman-g2, that allow it to use your favorite SCM tool, as long as they are not svn or cvs, to check in any changed files to /etc before the package that contains it are installed or removed.  I have only been using it for a couple of weeks and have already tested out it’s ability to correct my mistakes twice.

I use Ubuntu Servers at home and SuSE servers at work.  I have chosen Ubuntu at home so I can keep up my ability to flip between Debian based and RPM based systems.  One of the things I like most about Debian systems is the apt-get/dpkg package management system.  With only a few commands I had Puppet installed on the server and a client ready to receive files.  Once installed, another fifteen to thirty minutes and I had them talking and my first few files were under Puppet management.  I have now setup a few RPM based machines using YUM instead of just RPM.  It gave me a vary similar experience and the Puppet software just worked.  That shows the level of polish and unity the Puppet project has going on with things they consider complete.  

Having such good luck with the initial setup I decided to try setting up the Puppet Dashboard.  It was at this point that things came to a grinding halt.  The software installs well enough.  It’s figuring out how to configure it that seems to be impossible based on the documentation provided.  It’s all written in Ruby and while I could probably read the code and figure out what I am doing wrong, how many other people are going to do that?  In the corporate world, probably no one will attempt this while at their day job.  The almost complete lack of documentation on this part of the tool should have been a warning to me.  Instead I spent several hours trying to figure out what I had done wrong. (At the time of this writing I still haven’t.)  The good news is that this is a relatively new part of the package and the dashboard not running isn’t a showstopper.  Configuring several additional servers was about three steps per server.  No rebooting and only the services with related configurations needed to be restarted.  After a few more hours of work I have the most common files I update replicating across the network.  I then imported my DNS and DHCP related files and set them up to be managed with Puppet.  The great part of this addition is that I can make changes to these files and once updated, Puppet automatically restarts the appropriate services.  

Puppet is extremely flexible with how it is configured and where you can put the related files, so I decided that I would place them into a “special” directory.  I then wrote scripts to help me remember the SCM commands to check the files in periodically.  This now give me exactly what I have been wanting to do for years.  Etckeeper puts copies of my files into the SCM in a directory for each server.  Puppet then updates the files and with each update, etckeeper backs up the current file to the SCM.  

Here is a drawing of a basic network similar to what I have in my home network.

The Puppet software let’s you design systems with a base configuration and then group the servers that are similar in purpose and apply additional configurations.  So in the picture above, the Dark Blue lines represent the base configuration.  The Red Arrow between the Puppet Server and Mail Server represents the Mail server specific settings.  If you do have multiple servers that do a basic functions like DNS/DHCP above, you can even use templates to change the files or a set of the files it copies based on a server specific configurations.  Puppet can then change the server specific parts of the configuration you tell it to and the server it’s being applied to.  There are no real limits to what you can control and push out with Puppet, as long as it’s in a file and goes in a standard place.  This could easily be used to manage things like Web Sites, but probably isn’t the best solution for replicating your file servers data, even though it could.  

As you are setting Puppet up, one of the most interesting steps is that you are required to create certificates between the client servers and the Puppet Master Server.  If managed properly, this gives you a relatively secure and reliable way to know the server you are configuring is meant to be that type of server and with that type of configuration.  The documentation repeats in several places that while you can “mis-configure” the Puppet Master to accept any computers certificate on request, you should not.  If you ignore this advice, it would allow anyone to request a key with any servers configuration.  For example, I could stand up a copy of the company website and get the configuration then forge a whole new site.

Some other cool features is that the push to servers is staggered to keep them from all trying to pull the same file or files as soon as you update them.  For most new Admins, this seems like overkill and if you are on a Gigabit network then it probably is.  If however, you are on a mixed network with servers both local and remote, you may not have the bandwidth for a remote site to pull all of the files associated with changes to every server at the remote location at once.  Once configured, it will also track the changes progress.

If you are looking for an easy way to manage the configurations of 5 or 50,000 servers, this tool makes it simple.  The configuration of the tool is simple, the template language/format is great, and their future plans are only going to make it all better.  All in all, we here at

linuxintall.net

have to give this a “Go Install.”

Automation – Can to much of a good thing be bad?

Senior  systems administrators on any platform know that automation is the  single fastest way to improve the effectiveness of their team.  Scripts  provide stability, repeatability and reduce the time spent on often  repeated tasks.  If done correctly, automation will make everything more  stable and manageable.  

However,  scripts for managing systems can be a double edged sword.  On one hand,  they make a team highly efficient.  They can help junior admins perform  far above their experience level and free senior admins up to  investigate more difficult problems.  On the other hand though, they can  lead to a loss of knowledge.  The knowledge it took to create the  scripts becomes locked inside of them.  So what do you do to strike the  proper balance?  How can you keep the knowledge fresh in every-one’s  mind while still automating?  What steps can be taken to avoid knowledge  erosion and worse the brain drain or vacuum that is left when people  leave?

The  first thing to remember is that there is no one thing that can be done  to answer these questions.  Here we will provide you with some tips and  ideas we have found to be useful and effective.  This is a short list  and we hope that it will inspire you to think about what might work for  you and your company. 

The  first item is well documented scripts and procedures.  Taking 5 minutes  to write up what you were thinking when you wrote the script can save  you days trying to figure it out later.  As more object oriented  scripting languages like Python, Ruby and Perl take hold, it becomes  easier to break down complex scripts into much easier and digestable  chunks.  These smaller chunks, like the core ideas behind Linux, should  do one thing and do it well.  The names of the functions should describe  what they do.  For instance, a function called createNewSSHKeys, should  probably create new SSH keys.  This combined with an explanation of  what you were trying to do inside the function will help you and others  manage them.  When you get really good at this way of thinking, people  should be able to take your function calls out and write a manual  procedure that could replace your automation.  If that is your goal,  then it only makes sense that starting with a well documented procedure  to compare against when your done scripting makes sense.  It is unlikely  that every procedure step will match a function or series of function  calls.  Getting everything close does count though.

As  much as self documenting scripts helps though,  documenting  configuration files for your scripts can keep things fresh in peoples  mind.  At the very least, if done correctly, it will give them a  breadcrumb trail to follow to see if what they think is being set is  set.  We recently began testing out Puppet, an automated way to manage  server configuration files and other admin related tasks.  The  configuration files for Puppet can be used as a great example.  They  allow you to use a combination of intelligent names and comments to  inform the person reading the file what will be changed.  They also  include a description of where to look to verify that the changes are  being done correctly.  This means that I don’t need to know Ruby, the  language Puppet is written in, to figure out how or what its going to  do.  The configuration file itself tells me everything I need to know.   When you write your own script, the time it takes to do this may not be  warranted.  So at the very least, make sure that you have comments that  tell people where to look for the output based on these configurations  or what the configurations mean in the file.

Try  to keep everyone with the sharp skills needed so they are ready to  slice through problems as they arise. This also means internal training.   One of the things we have participated in on a regular basis is a  short one hour refresher put on by the subject matter experts(SME) for  each of the technologies we use.  Doing this accomplishes a few  different things at once.  It helps the SME keep their documentation  current.  It gives the SME an opportunity to share changes they want to  make or have made in the environment.  Then it gives everyone supporting  the environment a chance to ask questions about the technology when  there is no pressure.  When possible, annual reviews of each area that a  team supports, goes a long way towards elevating the teams ability to  be as productive as possible.

While  you can never completely prevent brain drain when a team member leaves,  the steps above, if done correctly, can go a long way.  Having been the  person transitioned to more than once, the better these steps are  followed, the better we have felt about taking on the responsibility.   Another side effect of these approaches and others along the same  thought process is that it allows people to migrate from one SME area to  another.  This helps people stay fresh and keeps them from becoming  bored and complaisant.  The more driven your team is to solve businesses  problems, the more profitable you will be.

Linux Security a CTO’s Guide

As a CTO, manager, or technical lead, what questions should you be asking when it comes to securing a Linux server?  Are Linux servers really as secure as everyone says?  What should be the focus of your team when securing servers for your company?  In this CTO Brief, we are going to try and answer some of these questions and possibly a little more.

Are Linux machines really more secure than other servers?

The answer to that question is that it depends.  No computer can ever be made to be completely secure.  Sometimes no matter how secure you make a computer, an inexperienced employee may hand over more information without even realizing what they are doing.  But that will be a discussion for another Brief, and we will now get back to the topic at hand.  Out of the box Windows machines used to be far more insecure than they are today. No matter what, there has always been a need for them to be secured. What has been an issue is that you had be a Windows guru to get it done and still have a stable machine.  Microsoft has worked hard to change this, but Linux started out with a sizable lead.  One of the things that Linux inherited from other forms of Unix was its powerful and mature security model.  If you ignore the basics of security, like weak passwords and opening up insecure services, a Linux machine can easily become a very insecure machine.  It is easy to avoid most of the pitfalls if your team takes their time and thinks through the issues you have to solve.

What questions should you be asking when it comes to securing a linux server?

What is each of your Linux servers going to be used for should be your first question.  Until you know and define what a particular machine is going to do, you cannot determine what needs to be running and which risks are justifiable.  There are no hard fast rules as to which services should go together with other services.  Each environment and situation will have its own unique challenges.  For the rest of the examples here, we will focus on a Web Server on the Internet that also serves as a back up DNS server.  The only service ports on this mythical machine that should be visible to the internet would be Port 80 and 443 for the web, 53 for the DNS service, and 22 so that you can login and manage the server.  All other services or daemons that expose or listen on an ip address available to people not on the machine, should be shut off or configured to listen to the internal ip address of 127.0.0.1(the servers local loopback address or internal address).

What if my team uses a Web Based administrative tool, or any other type of remote administration tool, and we cannot live without it.  What should we do?

If your team is using a web based solution to manage servers more effectively, the web server that service is running on should never be visible from the outside world.  How can that still be used then?  With SSH enabled it is easy to map a local port on the admin’s workstation to a remote port where the Web Based Administration software is running.  Once that mapping is established, the admin just points a web browser to the port on their workstation and all requests are forwarded over a secured ssh connection.  For this to be effective though, the management software must only listen to localhost(127.0.0.1) which is the machine internal address.

How do we audit what we did so far and make sure it doesn’t change?

Depending on the level of security you need, there are a wide range of both closed and open source solutions.  Before spending any money on a closed source tool, I strongly recommend investigating the open source tools available.  We have found tools like Puppet and Nessus, both open source, to be some of the best in the industry both paid and unpaid.  Both projects offer contract support and enterprise feature sets.  Puppet gives you the ability to check and update a systems local configuration.  Nessus on the other hand makes sure that only the services, ports and applications you want are accessible from outside of the machine.  This two factor approach along with service monitoring from a tool like GWOS or Zenoss and a rigorous software update process and schedule will give you a terrific base to start off from.

So what about anti-virus software for Linux.  I heard you don’t need it  Is that true?

Linux or any Unix for that matter are not immune from getting viruses.  There are just considerably fewer viruses that can exploit Linux and the services that run on it.  But the thing about Linux that makes it so hard to write a virus for, is that every person using the machine cannot do damage.  On most systems, only the root user can do the most damaging commands.  If the machine is serving as a file server, for say windows machines on the network, Linux can still pass an infected file around your network if you aren’t using anti-virus software.  ClamAV is the most widely known Anti-Virus tool for Linux.  The tool is opensource and lets you protect both Linux and Windows files from viruses.  Other options do exist and you should do some level of comparison and decide how your company should move forward.

So what’s next?

The next steps after this, is to keep closing any holes identified by Nessus.  Then start moving down through the OS and locking down the file and directory permissions.  If you still need more security, start using tools like app armor which place a shield around your exposed applications to protect the operating system.  From here, the goal of what you are attempting to do will lead you to the next set of tools.

The items we mentioned here are meant to be a base for you to build on.  By completing these steps, your machines will be more secure than than a default Linux machine.  What you need to do from here is up to your security officers, customers, and other businesses your company interacts with.

Are you monitoring your servers and Network?

In the last CTO-Brief we discussed building and managing a large number of servers.  The general response we received on reddit, LinkedIn, Twitter, and in E-Mail was that the article was informative but overlooked monitoring.  Let me assure you that we did not leave monitoring out on accident.  We thought it was too large a topic for one article.  Everyone who criticized us was absolutely right about saying that once you build it, you then have to monitor it.  The reasons you need to monitor are pretty simple.  Following is our list of top five reasons for monitoring:

  1. Keeping Customers Happy – You cannot fix what you do not know is broken.  Unless you are monitoring, you will have to rely on customers to tell you when something is down.  When you do have an outage, being able to tell your customers that you are already aware and working on the problem builds their confidence in your abilities to administer the systems.
  2. Proving that you are an AWESOME administrator and/or Administration Team – I have had more than one Director of Operations tell me that we need to “tell the story” of how good we are.  Unless you can demonstrate with data and confidence that you are meeting the Service Level expectations of your customers, there really is no story to tell.
  3. Getting a restful nights sleep after a major release or update to your systems – If you are monitoring and trust those systems to do their jobs then sleeping is easy the night of a big deployment or upgrade.
  4. Performance Management – Knowing when to buy that next system or when to shutdown a server or two, is best shown with data than without.  Getting new machines approved is far easier when you can show managers a graph of how the use of a system is growing and needs to be scaled to the next level.  If your plans including a migration to a Virtual Infrastructure, monitoring lets you easily pick off the first candidates for virtualization.  The machines with the least used CPU’s and Memory can be the ones to set your site on.
  5. Troubleshooting Application Issues – Both performance and troubleshooting, benefit from being able to see what was going on when the problems occurred.  Looking at a set of pretty graphs can save hours of time looking for errors in logs and running down the wrong path to a speedy resolution.

So now we know why to monitor. Next we need to know what to monitor.  To do that, we need to know what our goals and priorities are for monitoring.  The goals for monitoring do not tell you much about which tools to use, but they do tell you how far you need to go.  For instance, if all you want to monitor is whether a server is up and functional, your monitoring needs are quite less then if you want to monitor down to an application level.

The number of open source options in the area of monitoring generally gather information in one of two ways.  The first is by use of the Simple Network Management Protocol or SNMP.  The second is with a software agent, which is usually proprietary to the monitoring software.  The more advanced systems can sometimes take a hybrid approach of both.  There are advantages to both approaches.  SNMP is a very low resource consuming system.  SNMP is supported by nearly every network device and operating system.  If not configured properly though it can be extremely insecure.  Where security is concerned, agents are not guaranteed to be any better.  What they do offer though is tighter integration between the client and hosts.  One drawback to an agent thought can be the additional system resources that they consume, but this depends on the agent in question.

In our next article we will delve deeper into one monitoring project called Nagios which is the base of several other pieces of monitoring software.  Nagios is a wonderful open source project that is amazingly feature complete.  One of the most useful features are System Templating, Hours of operations for alerting, Outage Windows, Escalation Paths and reporting.  The big complaint with it though is how painful it is to configure.  It is not overly complicated, but setup can be very tedious.  To address this, several different projects have created web based user interfaces that abstract the configuration into an easy to use system of templates and other tools to make life with Nagios as close to perfection as possible.  These tools generally incorporate other tools with painful configuration files like MRTG and Cacti for performance and usage graphing.  Both of these reporting packages are awesome projects we have used on numerous projects to show off all kinds of facts about system performance and usage. 

In the future we plan to review Zenoss, GroundWork, and Hyperion HQ.  We know this is not a complete list, but we think it is a pretty good start.  Is there one you think we are crazy to leave off?  If so please let us know in the Comments.

Managing Large Numbers of Linux Systems

So you have seen the power and stability of Linux and are ready to get your feet wet with the little Penguin.  Your Management is sold and they have started buying more and more Linux servers.  How do you manage and control this growth?  Where should you focus your efforts first when trying to manage all of this?  Do you focus on building servers fast, or is managing your configurations the most important task at hand? 

In our opinion, your end goal should be to build, manage and monitor all of your servers with an automated process via a series of scripts and applications.  To determine what order to accomplish this you need to determine why you are growing.  If you are growing because development efforts on Linux are in full force, you will probably want to focus on building servers fast.  If your are growing because your production servers are getting large amounts of traffic, then you should probably focus on both building and managing your configuration first. 

How do you build a server really fast?
On the free side of things we recommend that you use the Red Hat created system called Anaconda.  Anaconda allows you to create a text file that describes almost everything about a system.  When invoked, the Anaconda process will create a complete system with all of the packages you want to use installed and configured. Both Ubuntu(Debian based package system) and every RPM based system I know of like Fedora, OpenSuSe, and Mandrivia have support for Anaconda. (more detailed Anaconda information can be found here)  If you have a system you want to clone or use as a base system, you will want to use Anaconda to profile the system and create the KickStart Configuration file for you.  Most installers create an Anaconda created KickStart file for the system in the root users home directory. (Normally called anaconda-ks.cfg.) If you then take this file and change the machine specific information, like the host name and ip address, you can create a new system.  Combine that with either the use of a PXE booting system, or command line arguments to the installer program for your configuration files location and it will be setup for you on the new machine.  Normally you will set up a few templates of key system types.  For instance, one Kickstart file for web servers, one for database servers and one for desktops.

If you prefer to use disk images similar to the old Ghost program from Norton(Symantec), then take look at the Clonezilla project.  This project started in the educational arena and is used by a fair number of K-12 and College schools.  It has the advantage of being able to manage both Linux and Windows Images.  The speed to install is similar to Anaconda and Clonezilla and also has OS plug-in’s that allow you to configure the system with the unique system information.  If you happen to be using VMWare they have built in cloning and templementing for a very similar this and with the same limitations.  The main downside to this system and any of the other disk clone systems is that to update a piece of software you must build and then re-clone the entire system.  By contrast with Ananconda, as long as the packages in the package repository are up to date, the system will be built with them.  This means no additional steps are required to bring the system up to the latest patches after building.

On the paid side of the equation, the one that seems to be leading the pack is Novell’s Zenworks product.  It can use both snapshots(or images) or do an Anaconda derived install.   It will allow you to manage the packages and configurations on both Suse and Redhat Linux machines.  The configuration of the software includes the ability to setup and manage DHCP and PXE boot servers.  These two server types can combine to allow you to place a system on your network,  assign the new machine to a template type and grouping, and when it boots, create the server from scratch without any assistance from a person after switching on the power.  The software works well and is easy to configure and use. There is an agent that runs to allow you access to manage the configuration after the install.  This agent can be configured to alert on most of the common system problems like low disk space, and high CPU load.  In this role, it works best as a feeder system into a more robust logging and alerting system.

How do I keep all of my servers configuration complete and consistent?
On the paid side, I believe the best choice is the Novell Zenworks product.  Several others exist, but the cost per machine is much steeper and they generally do not offer any additional  features.  Several companies have gone so far as to just package one of the two configuration titles I mentioned on the free side and re-produce them as their own.

On the free side the two leaders for configuration file management are CFEngine and Puppet.  Both offer a framework of files, the flexibility to automate nearly any task, and agents for the systems to audit and verify that everything stays consistent after initial install.   If they are so similar than what is the difference?  The main difference is the syntax for the input or configuration files.  Having played with both files and formats, the Puppet teams software was much easier to work with and was faster at getting to a point of configuring systems.  They both have tutorials and seem to work, once they are configured.  Also, both pieces of software can be configured to observe, validate and then correct if needed, what the configuration should look like from a remote server and centralize your configuration.  Once you have the software set up, they will quickly become both your auditors dream and your savior.  When you can show the auditors that just because you changed a file, it does not mean it will stay that way, will make even the grumpiest of them at least a little more happy.  This type of system builds a tremendous level of confidence within your development and management ranks.

How long does it take to set all this up?

That really depends on the choices you make and your knowledge in the tools.  People new to systems like this will generally take a day or two to get the software installed and a first attempt at building a server going.  Getting to the state of complete management of all systems takes time and will depend on where you are in the system life cycle.  Spending the time when you are starting out and thinking through will pay itself back in weeks or months depending on the rate you are building.  Keeping it current after that is generally simple. 

Conclusion
Managing your systems with these tools and some simple scripts reduces staff in the long run while simultaneously increasing stability and consistency.  The bulk of the cost you will spend on these systems will be in the initial setup and configuration.  Once the majority of the servers are incorporated into the system, the number of changes will drop tremendously.  Even a server count as low as ten is more than enough to get a fast Return On Investment.

Give us your feedback in the comments by answering any of the following questions:
So what’s your favorite system management tool?
Why do you prefer it?
What did we miss?