What we have learned about Linux and Virtualization....

To virtualize or not to virtualize, that is the question.  A simple question, but one with a difficult answer.  The process for determining whether any environment should be virtualized is not always as simple as it seems.  In this brief, we will discuss the steps we use to determine whether or not something is a good fit for virtualization.  We go over some general rules of thumb for making this type of decision.  We will also discuss which types of tools you should use to complete the assessment.

It seems like every time I enter into a discussion with any of my peers these days, the topic of virtualization is surely going to come up.  A lot of the reason behind this is because I work in an environment that has gone from almost no virtualization five years ago, to nearly ninety percent(90%) virtualization today.  We aren’t alone though, most of the large companies I have worked with in the past have done pretty much the same thing.  With large inexpensive blade servers and the ability to put a large number of CPU Cores in an Xseries box these days make it almost impossible to pass up.  What my peers and I have discovered as we have been asked to race to a 100% virtualized environment finish line, is that it’s not always the right move.  There are a lot of situations where a completely virtualized environment makes sense.  However, it’s the few edge cases that will keep you from reaching that goal.

Software licensing is one of the easiest ways to increase the ROI of any virtualization project.  In the beginning, that savings alone could pay for the whole project.  As businesses themselves, our vendors though have started to catch up and figure out how to keep their profits up, while the virtualization wave keeps rolling in.  In some cases, these changes have removed some, if not all of the savings we once would have gotten.  The fact is that the ever faster Moore’s Law races to better performance keeps forcing prices down while increasing the CPU cores per machine.  To give you an idea of how things have changed, let’s take these two old school models and show you how they might have been charged or licensed.  Any resemblance to how a specific company charges for software is purely coincidence, this is just an example.  Software A is sold at a per CPU basis prior to the virtualization race.  Software B is sold per physical machine.  In some cases, the companies selling Software A had to redefine what a CPU meant.  Did it mean virtual CPU or physical CPU?  The best model we have seen is a a lowest count model.  So if the number of virtual machine cores is less than the total number of physical CPU cores in the machine, you pay for the lower number of CPU’s.  Other companies are using a more confusing model where they convert CPU Cores to a mathematically created number.  This number is then used to price against.  This method can get confusing quickly and may or may not save you any money.  Software licensed under Software B per machine model, have largely left that mode in place.  They tend to choose to license at the virtual machine level instead of coming up with a new formula.  With either method though, you need to read your contracts ahead of time and most likely ask a sales representative to stop over and explain it to you.  This is most often a large part of the decision process I use when advising about whether or not it will be worth virtualizing any piece of closed source software.  Even some open source support models get into the act, so read all of your contracts carefully and ask your sales representatives how they handle licensing under a virtualized environment.

Disk I/O or applications with a lot of reading and writing to the disk are a questionable fit.  The reason is that you only have so much throughput to work with going to the disk.  New disk technologies are coming on all the time that will help to eliminate this issue in the future.  Until then we only have the option of solutions like direct attached storage dedicated to the virtual machine.  This can still suffer from the same bandwidth issues, but for now it is the best option even with a higher cost of implementation for the cards and other equipment needed.  These direct attached options also have other problems that can limit things like automated migrations and issues with backing up these same disks.  

Where virtualization shines, is in CPU and memory intensive applications.  So while a syslog server might not be a great choice for virtualization because of all the disk writes it will need to do.  Completing analytics on those logs though, might work wonderfully.  Applications mixed on the virtualization host are also great ways to exploit more of the potential of the hardware that you purchase.  So if you have a web application that your customer care representatives use to support customers from 8am to 5pm, you will be able to share that capacity with your billing processing application from 5PM to 8AM.  At the same time, a large group of low CPU but High memory consuming apps can very effectively be combined with high CPU/low memory applications.

The hardest of all of the values of virtualization to show in a ROI is the ability to build, tear down, and restore to a snapshot are amazing.  One of the scariest parts of doing an in place upgrade of any software, from the OS to the web browser,  is the possibility of complete corruption.  When you put this software on a virtual machine you can take a snapshot of the machine before you attempt the upgrade.  Then no matter what, within seconds, you can revert to that snapshot and reboot.  Everything is back to the state it was in prior to you starting.  That restore point could be from yesterday or last year when you clicked the button to take a snapshot, is the only time based restriction.  Here is an example of what we mean:

A restaurant has a point of sale system that could be run on a virtual machine.  The vendor that supplies this software sends the restaurant a new upgrade.  The upgrade has to be done to the existing system in place on the server because of database updates.  The restaurant can take a snapshot of the virtual machine first to save a copy of what it looked like before they begin.  Once that is complete, they can then run the vendors update software.  If at any point they decided that the update either isn’t working as expected or just isn’t working at all, they just tell the virtualization software to go back to the way it was before they began.  They can also make copies of the virtual machine and do test runs of the upgrade on the copy.  Then if the update doesn’t go well they can keep trying it until they work out the complete and final process with no interruption to the other functions.  They come out of the testing processes with a tested process and can be much more confident with the release of the updates.  

Having participated in several releases of new software like this, I can honestly say doing things on physical hardware is incredibly more stressful and demanding.  When you don’t have to worry about making a mistake you tend not to miss as many steps or cause yourself other issues.  This by no means guarantees that issues won’t happen that you didn’t see in your tests.  With this type of process you can back out and try again until you are successful.  

The next big feature of virtual machines is the ability to migrate them from one piece of hardware to another.  In a recent FLOSS Weekly episode the guys from Virtual Box discussed a test they do by pushing a virtual machine around the office.  The test isn’t successful until they have pushed the machine from all of the major operating systems in a big circle.  This is a silly example, but if you had a server in one data-center that needed to be shut down for maintenance you could easily push the virtual machine either across the data-center or around the world.  This technology makes it easy to get the machine to a safely running system with no issues.

So we have the list of reasons showing when to do virtualization including what steps and tools we use to make the final decision.
1) Collect as much data as makes reasonable sense.  If you have a monitoring solution like Zenoss or GWOS, create some reports there.  If you don’t, then in Linux there is a package and tool called SAR.  SAR can be set up to run and collect data about how the system is performing.  You can then use tools like kSar to display the output and create pretty graphs.(Pretty graphs always help tell the story.  A picture really is worth a thousand words in this situation)

2) Determine the servers that seem most likely to be able to co-exist on the same hardware.  Try to come up with a scheme that can be easily explained to others.  Focus on your balance between CPU and Memory.  Shy away from things that consume a lot of Disk I/O.

3) Determine the hardware needs of your organization based on the data gathered and your costs.  For instance, do not spend big money on memory if the applications that will use it won’t migrate for a year when the prices will have dropped.

4) Verify your design with any internal developers and other support team members.  Does everyone agree with your assessment and plan?  How do they think you did?  What builds do they have for you?

We know this seems to be very simplistic and it is because keeping it simple and straight forward is what we have found works best.  Do not over think the decisions, just give it a try and see what works.  Some of our best plans based on the most data have blown up in our face because things just acted differently in a virtual world than our predictions.  The biggest problem we have faced have been around putting to many a a certian workload type together.  While allocating more Virtual CPU’s than you have physical ones generally works doing the same with Memory does not.   Remember that you can almost always migrate the hosts to other hosts to balance them out if you make a grave mistake in this area as long as you have machines to do it with.

We couldn’t live without virtualization at this point.  The cost savings are smaller than we had hoped but the productivity gains have been massive.  The confidence level of our admins is also rising on top of the snapshot abilities and quick cloning of a machine.  If you haven’t started yet, don’t wait, it’s simple, easy and can cost your company very little to get started.