What we have learned about Linux and Virtualization….

To virtualize or not to virtualize, that is the question.  A simple question, but one with a difficult answer.  The process for determining whether any environment should be virtualized is not always as simple as it seems.  In this brief, we will discuss the steps we use to determine whether or not something is a good fit for virtualization.  We go over some general rules of thumb for making this type of decision.  We will also discuss which types of tools you should use to complete the assessment.

It seems like every time I enter into a discussion with any of my peers these days, the topic of virtualization is surely going to come up.  A lot of the reason behind this is because I work in an environment that has gone from almost no virtualization five years ago, to nearly ninety percent(90%) virtualization today.  We aren’t alone though, most of the large companies I have worked with in the past have done pretty much the same thing.  With large inexpensive blade servers and the ability to put a large number of CPU Cores in an Xseries box these days make it almost impossible to pass up.  What my peers and I have discovered as we have been asked to race to a 100% virtualized environment finish line, is that it’s not always the right move.  There are a lot of situations where a completely virtualized environment makes sense.  However, it’s the few edge cases that will keep you from reaching that goal.

Software licensing is one of the easiest ways to increase the ROI of any virtualization project.  In the beginning, that savings alone could pay for the whole project.  As businesses themselves, our vendors though have started to catch up and figure out how to keep their profits up, while the virtualization wave keeps rolling in.  In some cases, these changes have removed some, if not all of the savings we once would have gotten.  The fact is that the ever faster Moore’s Law races to better performance keeps forcing prices down while increasing the CPU cores per machine.  To give you an idea of how things have changed, let’s take these two old school models and show you how they might have been charged or licensed.  Any resemblance to how a specific company charges for software is purely coincidence, this is just an example.  Software A is sold at a per CPU basis prior to the virtualization race.  Software B is sold per physical machine.  In some cases, the companies selling Software A had to redefine what a CPU meant.  Did it mean virtual CPU or physical CPU?  The best model we have seen is a a lowest count model.  So if the number of virtual machine cores is less than the total number of physical CPU cores in the machine, you pay for the lower number of CPU’s.  Other companies are using a more confusing model where they convert CPU Cores to a mathematically created number.  This number is then used to price against.  This method can get confusing quickly and may or may not save you any money.  Software licensed under Software B per machine model, have largely left that mode in place.  They tend to choose to license at the virtual machine level instead of coming up with a new formula.  With either method though, you need to read your contracts ahead of time and most likely ask a sales representative to stop over and explain it to you.  This is most often a large part of the decision process I use when advising about whether or not it will be worth virtualizing any piece of closed source software.  Even some open source support models get into the act, so read all of your contracts carefully and ask your sales representatives how they handle licensing under a virtualized environment.

Disk I/O or applications with a lot of reading and writing to the disk are a questionable fit.  The reason is that you only have so much throughput to work with going to the disk.  New disk technologies are coming on all the time that will help to eliminate this issue in the future.  Until then we only have the option of solutions like direct attached storage dedicated to the virtual machine.  This can still suffer from the same bandwidth issues, but for now it is the best option even with a higher cost of implementation for the cards and other equipment needed.  These direct attached options also have other problems that can limit things like automated migrations and issues with backing up these same disks.  

Where virtualization shines, is in CPU and memory intensive applications.  So while a syslog server might not be a great choice for virtualization because of all the disk writes it will need to do.  Completing analytics on those logs though, might work wonderfully.  Applications mixed on the virtualization host are also great ways to exploit more of the potential of the hardware that you purchase.  So if you have a web application that your customer care representatives use to support customers from 8am to 5pm, you will be able to share that capacity with your billing processing application from 5PM to 8AM.  At the same time, a large group of low CPU but High memory consuming apps can very effectively be combined with high CPU/low memory applications.

The hardest of all of the values of virtualization to show in a ROI is the ability to build, tear down, and restore to a snapshot are amazing.  One of the scariest parts of doing an in place upgrade of any software, from the OS to the web browser,  is the possibility of complete corruption.  When you put this software on a virtual machine you can take a snapshot of the machine before you attempt the upgrade.  Then no matter what, within seconds, you can revert to that snapshot and reboot.  Everything is back to the state it was in prior to you starting.  That restore point could be from yesterday or last year when you clicked the button to take a snapshot, is the only time based restriction.  Here is an example of what we mean:

A restaurant has a point of sale system that could be run on a virtual machine.  The vendor that supplies this software sends the restaurant a new upgrade.  The upgrade has to be done to the existing system in place on the server because of database updates.  The restaurant can take a snapshot of the virtual machine first to save a copy of what it looked like before they begin.  Once that is complete, they can then run the vendors update software.  If at any point they decided that the update either isn’t working as expected or just isn’t working at all, they just tell the virtualization software to go back to the way it was before they began.  They can also make copies of the virtual machine and do test runs of the upgrade on the copy.  Then if the update doesn’t go well they can keep trying it until they work out the complete and final process with no interruption to the other functions.  They come out of the testing processes with a tested process and can be much more confident with the release of the updates.  

Having participated in several releases of new software like this, I can honestly say doing things on physical hardware is incredibly more stressful and demanding.  When you don’t have to worry about making a mistake you tend not to miss as many steps or cause yourself other issues.  This by no means guarantees that issues won’t happen that you didn’t see in your tests.  With this type of process you can back out and try again until you are successful.  

The next big feature of virtual machines is the ability to migrate them from one piece of hardware to another.  In a recent FLOSS Weekly episode the guys from Virtual Box discussed a test they do by pushing a virtual machine around the office.  The test isn’t successful until they have pushed the machine from all of the major operating systems in a big circle.  This is a silly example, but if you had a server in one data-center that needed to be shut down for maintenance you could easily push the virtual machine either across the data-center or around the world.  This technology makes it easy to get the machine to a safely running system with no issues.

So we have the list of reasons showing when to do virtualization including what steps and tools we use to make the final decision.
1) Collect as much data as makes reasonable sense.  If you have a monitoring solution like Zenoss or GWOS, create some reports there.  If you don’t, then in Linux there is a package and tool called SAR.  SAR can be set up to run and collect data about how the system is performing.  You can then use tools like kSar to display the output and create pretty graphs.(Pretty graphs always help tell the story.  A picture really is worth a thousand words in this situation)

2) Determine the servers that seem most likely to be able to co-exist on the same hardware.  Try to come up with a scheme that can be easily explained to others.  Focus on your balance between CPU and Memory.  Shy away from things that consume a lot of Disk I/O.

3) Determine the hardware needs of your organization based on the data gathered and your costs.  For instance, do not spend big money on memory if the applications that will use it won’t migrate for a year when the prices will have dropped.

4) Verify your design with any internal developers and other support team members.  Does everyone agree with your assessment and plan?  How do they think you did?  What builds do they have for you?

We know this seems to be very simplistic and it is because keeping it simple and straight forward is what we have found works best.  Do not over think the decisions, just give it a try and see what works.  Some of our best plans based on the most data have blown up in our face because things just acted differently in a virtual world than our predictions.  The biggest problem we have faced have been around putting to many a a certian workload type together.  While allocating more Virtual CPU’s than you have physical ones generally works doing the same with Memory does not.   Remember that you can almost always migrate the hosts to other hosts to balance them out if you make a grave mistake in this area as long as you have machines to do it with.

We couldn’t live without virtualization at this point.  The cost savings are smaller than we had hoped but the productivity gains have been massive.  The confidence level of our admins is also rising on top of the snapshot abilities and quick cloning of a machine.  If you haven’t started yet, don’t wait, it’s simple, easy and can cost your company very little to get started.

Groundwork OpenSource the little monitoring engine that could…

So you want to monitor your network, you don’t have a lot of time to learn how to setup Nagios, and you have no budget for either consultants or Off the Shelf Software.  What do you do?  One of your options is to use GroundWork Open Source or GWOS.  GWOS is a group of scripts that wrap the Nagios OpenSource Monitoring, Cacti, MTRG and some other tools they have developed on their own with a pretty simple GUI.  As a super jump start, I setup and highly recommend the VM that you can get from their website.  This VM makes quick work of the setup portion of getting the software up and running.  If you are planning on monitoring more than a few hundered devices, this VM solution will likely not work optimally.  That isn’t a reflection on Virtualization Technologies or Groundwork but a reality that current storage and hardware technologies have difficulties writing large quantities of small data points to disk efficiently.  I set this up with about twenty hosts and seven of them over a WAN/VPN connection to simulate a remote office.  Here are Joe’s and my impressions of the whole process. 

One of the things we looked for in a solution like this, is what it shows in regards to a map of the network as well as informing us of what should be monitored.  The first thing that impressed us about Groundwork was their use of the NMAP Program(link to nmap.org) to identify what OS, Ports and applications were in use on the machines on the network.  The auto discovery found all devices, and was able to identify all but one machines OS.  It then went that next step and configured the machines with SSH but without SNMP, so that we could use SSH to monitor to them.  The machine it did not identify the OS of or setup tests/monitors for was my 24 port network switch, which does not appear to look like any OS on it’s web interface.  This was the only miss to this part of the tool, and it is really minor, but it did not map anything out.  All of the devices initially looked like they were directly connected to the monitoring server.  This was easy to fix by setting up some associations that identify parent and child devices.  A parent device is something that is dependant on by one or more devices.  Our network switch is a parent device to a VMWare server which is in turn a parent to the VM’s it hosts.  The interface makes what is a tedious process in Nagios, faster and more efficient.  Once you set all of the associations the maps draw themselves into a clear and easy to understand drawing of where your dependencies are.  These associations do more than create informative maps, it also tells the alerting parts of Groundwork when to ignore false alarms caused by events like downed switches or internet links.

So once we got the basics working we started trying to get the alerting working.  Unfortunately things like my Droid and Ipod regularly go on and off my network.  So the first day or two of working with the alerting was painful but more my own fault than the software.  Once that was all sorted out things started to hum.   The dependency checking worked as expected.  When I dropped the VPN link between Joe and I the only alert we received was for the firewall that did the link.  None of Joe’s devices alerted.  Once restored the check restarted and everything was updated.

All in all it acted and reacted as expected.  The only real issues were related to UI and a need for better testing.  Joe attempted to name a device Joe’s Desktop through the interface.  Groundworks accepted the illegal ‘ character until he tried to save the device.  At that point all of the information about the device disappeared.  We attempted to delete the device so we could read it in multiple places in the UI and none of them seemed to work.  While looking for something else I found a 3rd or 4th place to delete devices which actually worked and let us save. There are some minor user interface glitches that while annoying, are not show stopping. Things like tabs, that when clicked, do not work on some screens but do on others. All in all these are just minor annoyances and not major issues.

If you are looking for a nice tool that is easy to use and free, unless you want to purchase support from them, this should be on your short list of systems to review.  I would probably suggest purchasing support for any of our recommended and reviewed systems if available, for at least the first year to get these issues corrected.  The Virtual Machines they offer is perfect for setting up a quick proof of concept.  New users to the system should expect at least 8-16 hours of effort to get the machines to the level where they are presenting useful alerts and data.  If you plan on measuring a large number of devices and software products with this tool, using a system on bare metal would be my recommendation.  The problem with any monitoring solution is the amount of data being written or read from the database.  So go out download it and give it a try.  The faster you get monitoring the faster you and your admins will be able to get a good nights sleep.

How to setup and Manage a Web Development Environment Properly….

So what does it take to reliably build Web Applications? The answer really depends on two factors, the size of your company and more importantly the size of your development and testing teams. Over or under building your development environment can end up being every bit as expensive to your company’s bottom. If you under build it, you can impact testing teams or end up running too few tests and miss critical bugs. Over build the environment, and you will end up with a lot of servers wasting electricity, cooling, licensing and hardware resources as well as additional payroll costs to administer the servers.

Even if you are a boutique shop that has less than a dozen developers you should have at least two development environments. One for developers and one for testers. The developers will use the development environment to merge their code and unit test it as a complete unit with all of the other developers. The testing environment will be used to do functional and performance testing. So why do we define what the environments should be used for? The answer is simple, more often than not, most companies admit testers to start spending more time in the developers environment than they do in the testing environment. Why is that bad? This keeps the developers from checking code as they should because their space for doing that testing is now tied up with people doing “functional” testing. The development environment is also the place that your system administrators should be making the first attempts at updates of critical infrastructure. So having rules around the use of these environments and enforcing them is as critical as having them.

As your company’s Web Development team grows you will need to assess and adapt to the growth. As you start dedicating resources to performance testing you will need an environment for those resources to use. As you grow still further you will likely add a pre-production environment that mimics your production environment that we generally refer to as staging. A staging environment is really for the operations teams to test their final deployment procedures. A staging environment should not be sized the same as production. For instance, if you have four(4) Web Servers, six(6) application Servers, and three(3) LDAP servers in your production environment your staging environment might have just two(2) web servers, two(2) Application Servers and a pair of LDAP servers. The point of this environment is just to make sure that you will have a close proximity to your production environment that will let you test for single server failures, verify that your processes work across multiple servers, and give you an environment to test production problems in until the next deployment is prepared. This builds confidence and assures themselves and others that they have all of the steps needed in the complex environments that represent today’s application environments.

How large is too large a development environment? The best way to determine this is by looking at the usage of each of the environments. Simple monitoring can be done with log analyzing tools like webalizer. This type of tool will show you how much the systems are getting used but will consume small amounts of resources. I have recently used this type of data to convince several development teams that we could and should reduce our development server count by almost twenty(20) percent. Remember that not every development effort needs to be on its own environment or servers. Use things like Web Server Virtual hosts and Virtualization technology like VMWare and VirtualBox to condense the relatively low utilization servers. By doing this, a small team of developers can share a larger piece of hardware that will be more fully utilized. The easiest way to figure out if your environments are being underused is through monitoring. If you have any environments that are getting zero, or nearly zero usage at some point in the development cycle, you probably want to look at why. The situation of too many environments generally happens when development pulls back. The recent economic downturn and recession has placed many companies in this situation. In some instances I have seen companies, especially those using virtualization, end up with a one server per project team for either or both development and test. This leads to massive overbuilding. No web technology exists today that needs a one to one relationship. If you do find one, start looking for something else.

So what about Technology like IBM’s Virtual Enterprise, VMWare’s Lab Manager Product, or something similar? If you aren’t familiar with this class of application they are the ones around automatic provisioning of systems. These are sold with the promise of better utilization and control over your environment. For instance if you have a critical application that suddenly got hit with a Slashdot or Digg style event, something that draws an unusual amount of traffic to your site, the systems could build and bring up additional instances to help with the load. When the crisis passes, then the servers could be destroyed and the resources that had been allocated returned to the pool of available resources. This sounds like exactly what you need for your development environment right? Only as long as you have strong rules around how it can be used which will assure that systems at some point in time will get destroyed and the resources returned, then it is beneficial. Be careful with this idea though, and make sure you have someone who is ready to be the enforcer, or get ready to pay big bucks buying the resources needed to support an environment gone wild.

When building a development environment, web or otherwise, keeping it under control is the key. Uncontrolled growth with or without a good tool, will eventually crush the support teams. At the other extreme, too small and rigorous an environment will cost developers time and leave testers and admins impacted while they wait for changes to get to them. Automation of tools to build and remove environments sounds like a great time saving solution, but they have other hidden costs in the area of setup and maintenance that may well whip out any savings. So keep that all in mind. Let me know what your latest Web App is in the comments.

Questions of the week:

How many environments do you have at your company for how many developers?

What are your biggest challenges in your development environment?