In the last CTO-Brief we discussed building and managing a large number of servers. The general response we received on reddit, LinkedIn, Twitter, and in E-Mail was that the article was informative but overlooked monitoring. Let me assure you that we did not leave monitoring out on accident. We thought it was too large a topic for one article. Everyone who criticized us was absolutely right about saying that once you build it, you then have to monitor it. The reasons you need to monitor are pretty simple. Following is our list of top five reasons for monitoring:
- Keeping Customers Happy - You cannot fix what you do not know is broken. Unless you are monitoring, you will have to rely on customers to tell you when something is down. When you do have an outage, being able to tell your customers that you are already aware and working on the problem builds their confidence in your abilities to administer the systems.
- Proving that you are an AWESOME administrator and/or Administration Team - I have had more than one Director of Operations tell me that we need to "tell the story" of how good we are. Unless you can demonstrate with data and confidence that you are meeting the Service Level expectations of your customers, there really is no story to tell.
- Getting a restful nights sleep after a major release or update to your systems - If you are monitoring and trust those systems to do their jobs then sleeping is easy the night of a big deployment or upgrade.
- Performance Management - Knowing when to buy that next system or when to shutdown a server or two, is best shown with data than without. Getting new machines approved is far easier when you can show managers a graph of how the use of a system is growing and needs to be scaled to the next level. If your plans including a migration to a Virtual Infrastructure, monitoring lets you easily pick off the first candidates for virtualization. The machines with the least used CPU's and Memory can be the ones to set your site on.
- Troubleshooting Application Issues - Both performance and troubleshooting, benefit from being able to see what was going on when the problems occurred. Looking at a set of pretty graphs can save hours of time looking for errors in logs and running down the wrong path to a speedy resolution.
So now we know why to monitor. Next we need to know what to monitor. To do that, we need to know what our goals and priorities are for monitoring. The goals for monitoring do not tell you much about which tools to use, but they do tell you how far you need to go. For instance, if all you want to monitor is whether a server is up and functional, your monitoring needs are quite less then if you want to monitor down to an application level.
The number of open source options in the area of monitoring generally gather information in one of two ways. The first is by use of the Simple Network Management Protocol or SNMP. The second is with a software agent, which is usually proprietary to the monitoring software. The more advanced systems can sometimes take a hybrid approach of both. There are advantages to both approaches. SNMP is a very low resource consuming system. SNMP is supported by nearly every network device and operating system. If not configured properly though it can be extremely insecure. Where security is concerned, agents are not guaranteed to be any better. What they do offer though is tighter integration between the client and hosts. One drawback to an agent thought can be the additional system resources that they consume, but this depends on the agent in question.
In our next article we will delve deeper into one monitoring project called Nagios which is the base of several other pieces of monitoring software. Nagios is a wonderful open source project that is amazingly feature complete. One of the most useful features are System Templating, Hours of operations for alerting, Outage Windows, Escalation Paths and reporting. The big complaint with it though is how painful it is to configure. It is not overly complicated, but setup can be very tedious. To address this, several different projects have created web based user interfaces that abstract the configuration into an easy to use system of templates and other tools to make life with Nagios as close to perfection as possible. These tools generally incorporate other tools with painful configuration files like MRTG and Cacti for performance and usage graphing. Both of these reporting packages are awesome projects we have used on numerous projects to show off all kinds of facts about system performance and usage.
In the future we plan to review Zenoss, GroundWork, and Hyperion HQ. We know this is not a complete list, but we think it is a pretty good start. Is there one you think we are crazy to leave off? If so please let us know in the Comments.