In "Refresh Now!" I used cash flow (for the finance folks) and GAAP analysis (for the accountants) to prove that you could make a business case for replacing physical servers with VMs, even when those servers are not yet "due" for a refresh.
Sometimes, there is a disconnect between IT, finance, and accounting when it comes to spending money. To make it easier on the IT managers, the CFO gives them a budget to work within. Usually the budget is split between capital and operating buckets, which helps them manage balance sheets and cash flows better than a single, large bucket of money.
So, when talking to people working within a capital budget and operating budget, we can't use common financial measures such as NPV, ROI, payback period, or (my favorite), EVA. They need to justify their spending within the boundaries they are given, and it takes a higher level (e.g. CFO) decision to adjust those boundaries.
So, let's assume an IT manager at Acme, Inc. is operating under the following constraints:
If we assume that all other elements of the capital and operating budgets stay the same, we can focus on the elements of the budget that are affected by server virtualization:
Working within these constraints, I can do the following:
On to the operating budget…
So, in this simple case, I can refresh twice as many servers as I normally would with the same capital budget, and have a huge impact on the operating budget. Again, this is a simple case, but the assumptions are quite conservative.
In my article ("Virtualization Adoption Lifecycle") I introduced the concept of "Transient VMs".
At around the same time, I started hearing other people at VMware talking about "Transient VMs" – so, I'm not going to take credit for coining the term (unless our product management organization regularly reads vmMBA.com).
When I discussed this with a co-worker (who recently moved from the San Francisco Bay area), he said "yes, we really need to do something about those transients". (Probably funnier when you're sitting through day-long product update presentations like we were). No, I'm not talking about homeless people.
Consider two types of virtual machines:
Traditional VMs can be managed and priced in a way that is somewhat similar to physical servers. Although there are efficiencies gained through portability, replication, snapshots, and various automation touch points, a traditional VM is a server (or desktop) that must be managed on a day-to-day basis, and consumes resources even when it is idle.Transient VMs give us a lot more flexibility in management and allow us to optimize resource utilization. We can take advantage of transient VMs in several ways today, and there are several technologies that are on their way that will make transient VMs even more prevalent.
Here are some examples:
Example 1: Legacy Application Servers
In my days in outsourcing, I came across a lot of servers that had long outlived their usefulness, but administrators were afraid to turn them off. Development servers for applications that had reached a stable state are often not used for years. Production servers for applications that had been sunset often have regulatory requirements to remain accessible.
Legacy servers are scary. They typically run on unsupported operating systems, which do not even have driver support for new servers. They are left in place, because no one even knows how to rebuild them. If we convert them to VMs, we immediately improve availability if they are left running (VMs are hardware-independent, run on newer hardware, and have HA built-in for VI3 Enterprise). Or, we can archive them, knowing we can recover the full state (configuration, OS, application, and data – not just data). We could even leave them powered off and keep them up to date on patches with Update Manager, with minimal human intervention.
Example 2: VMware Lab Manager
Traditionally targeted to the myriad of test lab sandboxes, Lab Manager gives a subset of control over to the development, test, and application management staff, allowing them to build entire application stacks from templates and collaborate on bug fixes (while IT still maintains templates and performs system and security management). Lab Manager is also evolving into a general-purpose tool for managing Transient VMs as customers find new and unique ways to use it (for example, in training labs).
Example 3: VMware Stage Manager
Announced at VMworld Europe, Stage Manager will allow application administrators to march an application through phases such as Unit Testing, Integration Testing, QA, Staging, and User Acceptance testing, while following prescribed change management processes, approvals, and archival, as applicable. Test systems can also be created as copies of production. The intent is to minimize configuration drift while also minimizing management costs (higher quality and lower costs, wow!) – and Transient VMs are a big part of it.
Example 4: VMware Lifecycle Manager
With the introduction of Lifecycle Manager last quarter, VMware fundamentally changed the way VMs can be requested and provisioned, and give customers an easy option for an "expiry date" for VMs. Although the automated provisioning features are very helpful, the fact that VMs can have defined approvals, ownership, costing, and set expiry dates means ongoing infrastructure and management costs can be reduced significantly.
Example 5: Instant Test Servers
Most of the servers deemed "test" are used very infrequently. Let me differentiate "test" from "staging", "user acceptance testing" or "QA" servers: in this case, "test" servers are used by IT staff to test configuration changes, patches, or upgrades.
Whereas production servers are used constantly, and development/qa/staging/UAT servers are used in bursts of activity, test servers are typically used before making a change on a production server. When we use traditional VMs (or physical servers), we "manage" the server (monitor it, keep it up to date on patches, and troubleshoot when certain things go wrong).
Conversely, we could create a test environment from a clone of production (or an image-level backup) when needed. We could even create multiple test environments to test multiple scenarios – thus improving our test quality. The overall result should be a lower management cost and higher quality of service.
Example 6: VDI
Changing gears: in many virtual desktop scenarios, user sessions can be defined as non-persistent. If user data and profiles are stored outside of the VMs themselves, and users do not self-install applications, "permanent" VMs aren't required. In these cases, the only image that is patched and managed is the master image – the non-persistent VMs can be destroyed at logoff (and re-created from the master for the next login).
Transient VMs require a new way of looking at the way we build architectures for the data center. When faced with some key events such as migrations, refresh projects, or new implementations, we should evaluate whether 100% of migration candidates are actually needed. Some typical candidates for transient VMs include:
When we go through a server list to decide upon a migration plan, build plan, or refresh plan, we should be thinking about how transient VMs could be used to reduce costs and/or improve flexibility for IT operations, application developers, or the business units themselves.
This is the challenging part. It's complex enough to determine the fixed, variable, and semi-variable costs, along with the shared and dedicated components when all of our VMs are powered on and running 100% of the time. Transient VMs have some unique features that affect the cost model:
At some point in the future, I'd like to build a cost model that addresses transient VMs. If anyone has any that they can share with me, please forward.
It may seem counter-intuitive that VMware is finding ways to reduce the number of VMs in a customer's environment. The assumption is that the prospect of Transient VMs will do much more than previous waves of virtualization to transform the way infrastructure is designed, built, and managed, and thus move more servers (and desktops) over to a virtual world than is possible with standard processes.
Why would you replace a server that runs fine, meets SLAs, and is not fully depreciated?
This is a common obstacle to replacing a physical server with a VM when it is not yet "ready" to be refreshed. Historically, people have had a hard time producing a business case for these servers. Here, I present a framework that can help justify a "refresh now" plan, and it is based upon a few key requirements:
When you take into account some of the above factors, it's often quite easy to produce a strong business case for virtualizing servers that are not yet due for a refresh.
The following spreadsheet shows a per-VM analysis that determines the breakeven point, in months, for virtualizing a server that still has useful life remaining. Some things to keep in mind:
Feel free to make comments or suggestions on this spreadsheet – it is a work in progress. You may find that some of the numbers are conservative, others may seem aggressive, so your mileage may vary. The full version is here.
Technology adoption almost always follows a bell curve, as in Everett Rogers' Technology Adoption Lifecycle model:
By some measures, virtualization is in the "Early Majority" or even "Late Majority" phase: for example, VMware has 100% of the Fortune 100 as customers, which says that large enterprises are using virtualization. At the same time, however, various studies pinpoint the number of virtual servers as something less than 10% of the addressable population (putting us in the "Early Adopter" phase - not really applicable for a 6th generation product).
This tells me that the technology adoption lifecycle isn't the best framework for categorizing where an organization is in the adoption of virtualization.
Maybe we could look at it as an S-Curve:
...that's better. One could fit a 5-stage framework into that curve. However, that only looks at the percentage of addressable servers that are virtual. It doesn't take into account the way virtualization is actually used (i.e., is the customer using VMotion for maintenance activities, or snapshots for quick roll-backs).
So, I came up with a 5-level framework that could probably fit into some kind of 2x2 matrix, but we'll defer the visuals (for now), and look at five stages, and the corresponding timeframes at which customers reached these stages.
In this phase, an organization uses physical (non-virtual) infrastructure for all new builds and refreshes of existing assets. Virtualization is only used in pilot, proof-of-concept, or limited development deployments.
Most organizations are already beyond this level.
As organizations became a little more comfortable with virtualization, they began to use it to replace actual physical servers -- these servers were normally development, test, or non-critical production servers. Some of the common themes in Level 2 include:
Here, the financial effects of server virtualization are mostly limited to capital costs (mainly server hardware) and operational costs associated with power, cooling, and floor space. It is very difficult to take advantage of the flexibility and operational efficiencies afforded by virtualization when virtual servers are treated as second class citizens.
Somewhere in the past two years, IT shops began to see server virtualization as strategic. For a time, this was my definition of "strategic virtualization" - in other words, if a virtual server is the default target for a new deployment or refresh of an existing asset, then it passes the test for whether virtualization was considered "strategic" to an organization. I now realize that the "strategic" bar should be set higher (see Level 4).
The "Virtualize First" policy means that at the time a decision point is made on a server (for example, a new deployment, refresh, migration, or event caused by power/cooling constraints), the default target is a VM, unless a logical counter-case can be made.
Examples of logical counter-cases might include:
There are still a number of poor counter-cases in organizations' decision trees, for a variety of reasons. For example, VMware ESX Server 2.x was limited to 3.6 GB of RAM per VM - whereas a 3.6 GB limit was applicable in 2005, the ESX 3.0.x limit is 16 GB, and the current (ESX 3.5 limit) is 64 GB. At the same time, processing power (mainly due to multi-core processors) and I/O bandwidth capabilities have kept pace.
Capacity Planner studies, time and again, show that 80-90% of Wintel workloads are virtualization candidates in organizations large and small. Finally, whereas application vendors were loath to support their applications in a virtual environment in the past, much of that has gone away due to customer pressure and general industry maturity.
As promising as the "Virtualize First" policy is, it does not, on its own, provide much operational efficiency. Even if an organization moves 100% of its servers to VMs, it will not see much operational efficiency if it continues to manage its newly-virtualized servers as if they are the same physical assets that they replaced.
Somewhere between Levels 4 and 5 is where the term "strategic virtualization" could be applied.
Level 4 can only be achieved with some level of executive buy-in. Whereas it is relatively easy migrate a large number of physical assets into virtual machines without changing an organization, it requires real executive sponsorship to drive a change in the way systems are managed.
Examples of the types of activities that can be transformed because of virtualization may include:
This is the year of Business Transformation through virtualization.
Some organizations got a head start on Business Transformation in 2006-2007 through the uses of Virtual Desktop Infrastructure (VDI) and Virtual Software Lifecycle Automation (VSLA) tools like Lab Manager. These tools allowed them to change the way development infrastructure or desktop services were built and managed, without transforming their production server infrastructure. We're seeing an increased adoption of those solutions, but my focus here is on the transformation of the way production servers are architected, built, managed, and commercialized.
Another key development is the concept of "Transient VMs". Traditional (physical) server architecture means static servers, built for a purpose, and those servers typically "stick around" for a long, long time. A server that is powered on and placed on the network is a server that must be patched and managed. Transient VMs are those that are created for a purpose, but then may be archived or destroyed as required.A combination of organizational experience, experience in the systems integrator community, and product capability has given us the "perfect storm" in 2008 for business transformation.
[Reposted 2/11/08 - somehow this post disappeared from the blog]
Sometimes it amazes me when large organizations tell me that the "entry costs" for server virtualization are too high.
I shouldn't be amazed. There are two good reasons when even large enterprises may build out in small pieces.
In either case, the question of breakeven point comes up: what is the minimum number of servers required for financial breakeven of virtualization (relative to traditional, physical servers)?
So, of course, I built a little spreadsheet to find out.
Most of the assumptions are outlined in the corresponding spreadsheet, but here are the high-level assumptions:
It turns out that without High Availability, the breakeven point is two servers. With two hosts, some basic replication and manual restart of VMs, the breakeven point is 4 VMs. With two hosts, entry-level shared storage, and VMware HA, the breakeven point is 5 VMs.
The full analysis is here, and it includes a detailed Bill of Materials, energy costs, and cost-per-VM detail for each option.
One might argue that with only two hosts, you would only want to run up to 50% utilization - but 50% utilization could still be 20 VMs on two hosts, with redundancy. I could also extend the calculator to determine the breakeven point when a third host is added.
Those of you who were at VMworld 2007 can view the session "IP29: Virtualization of Remote Sites", which explored two main ways that customers are handling remote offices. The first is to bring infrastructure into a centralized data center (leveraging things like VDI, WAN Acceleration (e.g. RiverBed), and cheap bandwidth). The second is to leave infrastructure at the remote sites, but use virtualization to reduce costs and simplify management.
Shared+Fixed: This is typically our initial build-out. In a VMware Infrastructure world, it includes the first Virtual Center server, the initial set of ESX Server hosts, and the initial set of shared storage. It also includes the administration and engineering costs for the core infrastructure
