vmMBA.com
Exploring the Business Side of Virtualization
vmMBA.com

Refresh Now! Revisited

In "Refresh Now!" I used cash flow (for the finance folks) and GAAP analysis (for the accountants) to prove that you could make a business case for replacing physical servers with VMs, even when those servers are not yet "due" for a refresh.

Sometimes, there is a disconnect between IT, finance, and accounting when it comes to spending money. To make it easier on the IT managers, the CFO gives them a budget to work within. Usually the budget is split between capital and operating buckets, which helps them manage balance sheets and cash flows better than a single, large bucket of money.

So, when talking to people working within a capital budget and operating budget, we can't use common financial measures such as NPV, ROI, payback period, or (my favorite), EVA. They need to justify their spending within the boundaries they are given, and it takes a higher level (e.g. CFO) decision to adjust those boundaries.

So, let's assume an IT manager at Acme, Inc. is operating under the following constraints:

  • Capital budget is $2 million for 2008. $250,000 of that is meant to refresh 50 existing servers, and $125,000 is for 25 new servers (for new projects)
  • Operating budget is $3 million for 2008. Of that, $200,000 is for power and cooling of the data center, and $300,000 is salary costs to manage the servers (our total server population is 200)

If we assume that all other elements of the capital and operating budgets stay the same, we can focus on the elements of the budget that are affected by server virtualization:

  • We have a capital budget of $375,000 for the purpose of purchasing 75 new servers.
  • We have an operating budget of $500,000 to keep all of our servers running

     

Working within these constraints, I can do the following:

  • Instead of purchasing 75 servers, I am going to purchase 5 large (2-socket quad core) servers (with a VM-to-host ratio of roughly 15:1 – very, very conservative). Each costs $20,000, so my total server cost is $100,000 (from the capital budget)
  • I'll need to purchase 5 licenses of VMware Infrastructure 3 (Enterprise Edition), at $5,750 list price each: total cost of $28,750 (ignoring discounts). Let's assume we already have the management infrastructure in place (e.g. Virtual Center)
  • I'll also need some shared storage to really take advantage of virtualization. Let's say it's another $35,000 (about $15 per GB if we're using NAS or iSCSI)
  • I would normally need more network ports for those extra 25 servers – with consolidation, I don't need them. That saves me $10,000.
  • Therefore, my net capital cost is $100,000 + $28,750 + $35,000 - $10,000 = $153,750. That leaves over $150,000 in my capital budget to replace servers that aren't due for a refresh yet!

 

On to the operating budget…

  • VMware support and subscription is roughly 25% of the original purchase cost – so, roughly $7,100 per year
  • A normal server in the US costs $1000 in power and cooling expenses. ESX hosts are usually larger, so let's say it's $1500 per year. So, for the 75 servers in this study, we are saving (75 x 1000 – 5 * 1500 = ) $67,500 in energy costs
  • My IT department tells me that the current staff can manage 20% more VMs than they can physical servers (template-based provisioning, more standardization, the use of snapshots, consolidated backups, cloned copies of production for testing, etc). Since I am adding (25/200=12%) more servers, then I am avoiding a 12% addition to the salary budget (or, 0.12 x 300,000 = $36,000)
  • So, even though I am adding $7,100 in software maintenance, I am saving (36000+67500) $103,500 in operating costs
  • Those savings are even higher if I pull in next year's servers into the refresh plan, since I can lower my energy costs and (possibly) my salary costs even more

 

So, in this simple case, I can refresh twice as many servers as I normally would with the same capital budget, and have a huge impact on the operating budget. Again, this is a simple case, but the assumptions are quite conservative.

 

 

 del.icio.us  Stumbleupon  Technorati  Digg 

More on Transient VMs

In my article ("Virtualization Adoption Lifecycle") I introduced the concept of "Transient VMs".

At around the same time, I started hearing other people at VMware talking about "Transient VMs" – so, I'm not going to take credit for coining the term (unless our product management organization regularly reads vmMBA.com).

When I discussed this with a co-worker (who recently moved from the San Francisco Bay area), he said "yes, we really need to do something about those transients". (Probably funnier when you're sitting through day-long product update presentations like we were). No, I'm not talking about homeless people.

 

What are Transient VMs?

 

Consider two types of virtual machines:

  1. Traditional VMs: deployed with the intent that it remain powered on and managed indefinitely
  2. Transient VMs: deployed for a specific purpose, with a non-permanent lifespan; may or may not remain powered on constantly during its lifespan

     

Traditional VMs can be managed and priced in a way that is somewhat similar to physical servers. Although there are efficiencies gained through portability, replication, snapshots, and various automation touch points, a traditional VM is a server (or desktop) that must be managed on a day-to-day basis, and consumes resources even when it is idle.Transient VMs give us a lot more flexibility in management and allow us to optimize resource utilization. We can take advantage of transient VMs in several ways today, and there are several technologies that are on their way that will make transient VMs even more prevalent.

Here are some examples:

Example 1: Legacy Application Servers

In my days in outsourcing, I came across a lot of servers that had long outlived their usefulness, but administrators were afraid to turn them off. Development servers for applications that had reached a stable state are often not used for years. Production servers for applications that had been sunset often have regulatory requirements to remain accessible.

Legacy servers are scary. They typically run on unsupported operating systems, which do not even have driver support for new servers. They are left in place, because no one even knows how to rebuild them. If we convert them to VMs, we immediately improve availability if they are left running (VMs are hardware-independent, run on newer hardware, and have HA built-in for VI3 Enterprise). Or, we can archive them, knowing we can recover the full state (configuration, OS, application, and data – not just data). We could even leave them powered off and keep them up to date on patches with Update Manager, with minimal human intervention.

 

Example 2: VMware Lab Manager

Traditionally targeted to the myriad of test lab sandboxes, Lab Manager gives a subset of control over to the development, test, and application management staff, allowing them to build entire application stacks from templates and collaborate on bug fixes (while IT still maintains templates and performs system and security management). Lab Manager is also evolving into a general-purpose tool for managing Transient VMs as customers find new and unique ways to use it (for example, in training labs).

 

Example 3: VMware Stage Manager

Announced at VMworld Europe, Stage Manager will allow application administrators to march an application through phases such as Unit Testing, Integration Testing, QA, Staging, and User Acceptance testing, while following prescribed change management processes, approvals, and archival, as applicable. Test systems can also be created as copies of production. The intent is to minimize configuration drift while also minimizing management costs (higher quality and lower costs, wow!) – and Transient VMs are a big part of it.

 

Example 4: VMware Lifecycle Manager

With the introduction of Lifecycle Manager last quarter, VMware fundamentally changed the way VMs can be requested and provisioned, and give customers an easy option for an "expiry date" for VMs. Although the automated provisioning features are very helpful, the fact that VMs can have defined approvals, ownership, costing, and set expiry dates means ongoing infrastructure and management costs can be reduced significantly.

 

Example 5: Instant Test Servers

Most of the servers deemed "test" are used very infrequently. Let me differentiate "test" from "staging", "user acceptance testing" or "QA" servers: in this case, "test" servers are used by IT staff to test configuration changes, patches, or upgrades.

Whereas production servers are used constantly, and development/qa/staging/UAT servers are used in bursts of activity, test servers are typically used before making a change on a production server. When we use traditional VMs (or physical servers), we "manage" the server (monitor it, keep it up to date on patches, and troubleshoot when certain things go wrong).

Conversely, we could create a test environment from a clone of production (or an image-level backup) when needed. We could even create multiple test environments to test multiple scenarios – thus improving our test quality. The overall result should be a lower management cost and higher quality of service.

 

Example 6: VDI

Changing gears: in many virtual desktop scenarios, user sessions can be defined as non-persistent. If user data and profiles are stored outside of the VMs themselves, and users do not self-install applications, "permanent" VMs aren't required. In these cases, the only image that is patched and managed is the master image – the non-persistent VMs can be destroyed at logoff (and re-created from the master for the next login).

 

 

How do we design for Transient VMs?

 

Transient VMs require a new way of looking at the way we build architectures for the data center. When faced with some key events such as migrations, refresh projects, or new implementations, we should evaluate whether 100% of migration candidates are actually needed. Some typical candidates for transient VMs include:

  1. Test servers: could point-in-time test servers (clones of production) be more useful than dedicated test servers?
  2. Development servers: would developers be better served with a flexible self-service environment that facilitates better collaboration (and would this ease the burden on server operations staff)?
  3. Legacy applications: are there servers with no active users, but are kept powered on to satisfy regulatory requirements or out of fear?
  4. Bursty workloads: are there applications that require a set of servers for brief periods in order to satisfy cyclical or intermittent workloads, such as tax season, end-of-year processing, or peak sales periods? Web and SOA applications that are componentized are usually good candidates for a more flexible approach with transient VMs. Additional web and application server VMs can be created as needed, added to the pool, and destroyed when no longer required.

When we go through a server list to decide upon a migration plan, build plan, or refresh plan, we should be thinking about how transient VMs could be used to reduce costs and/or improve flexibility for IT operations, application developers, or the business units themselves.

 

How do we account for Transient VMs?

 

This is the challenging part. It's complex enough to determine the fixed, variable, and semi-variable costs, along with the shared and dedicated components when all of our VMs are powered on and running 100% of the time. Transient VMs have some unique features that affect the cost model:

  1. Infrastructure costs are not "free" for powered-off VMs. There needs to be enough reserve capacity to handle the maximum number of transient VMs that are powered on at one time. When the use of transient VMs comes in bursts (several at a time – such as during enterprise application performance testing activities), it may be easier to assume that 100% of their capacity is required at all times
  2. Energy costs may be reduced if the number of powered-on hosts can be easily managed based upon changing workloads (i.e. using Distributed Power Management)
  3. One-time costs should be minimized: if system administrator intervention is required for each power-on and power-off (and/or archive) operation, it can hurt the value proposition for transient VMs. Automation and self-provisioning can help
  4. Storage costs can be better managed with an Information Lifecycle Management strategy for transient VMs to move them off to low-cost storage when not in use. This is made easier with Storage VMotion, or with storage virtualization technologies such as EMC Invista, Hitachi's USP, or IBM's SAN Volume Controller
  5. Monitoring costs can be lower, but there must be a streamlined way to update the monitoring tools when VMs are archived or removed (so that it does not trigger a false positive downtime event). In many cases (e.g. Lab Manager), the VMs themselves may be unmonitored
  6. Management costs should be lower for transient VMs. Automation helps. Lab Manager, for example, is a self-service environment, and the VMs themselves are often "unmanaged". With Lifecycle Manager, many of the typical provisioning and configuration workflows are automated. VMware Update Manager can patch VMs even when they are offline. Even without automation, a VM that is powered on only infrequently should be easier to manage than one that is continuously powered on (as long as server operations are streamlined for transient VMs). Even better are VMs that are truly temporary and created for a specific short-term purpose (e.g. a clone of production for a temporary test activity).

At some point in the future, I'd like to build a cost model that addresses transient VMs. If anyone has any that they can share with me, please forward.

It may seem counter-intuitive that VMware is finding ways to reduce the number of VMs in a customer's environment. The assumption is that the prospect of Transient VMs will do much more than previous waves of virtualization to transform the way infrastructure is designed, built, and managed, and thus move more servers (and desktops) over to a virtual world than is possible with standard processes.

 

 

 del.icio.us  Stumbleupon  Technorati  Digg 

Refresh Now!

 

Why would you replace a server that runs fine, meets SLAs, and is not fully depreciated?

This is a common obstacle to replacing a physical server with a VM when it is not yet "ready" to be refreshed. Historically, people have had a hard time producing a business case for these servers. Here, I present a framework that can help justify a "refresh now" plan, and it is based upon a few key requirements:

  • Energy costs are real. We can ignore the capital costs of power and cooling infrastructure (e.g. UPS, PDU, generator, chillers, etc. - which do not go away as a result of virtualization, unless new infrastructure is required due to growth). What we can't ignore is the monthly electrical utility bill. A typical server in the US costs over $1000 in electricity per year to remain powered on and kept cool. That number jumps to $1400 in Western Europe and is $1500 in certain areas of the US such as New England
  • Shared Storage costs are decreasing. The use of NFS and iSCSI, as well as the typical annual decrease in cost per GB means that the entry cost for enterprise server virtualization is becoming lower
  • Multi-core processors have an impact in two areas: it increases the VM-per-server ratio (a server with 8 cores can easily handle over 20 typical VMs) and it reduces the VMware license cost per VM (VMware counts by CPU socket)

When you take into account some of the above factors, it's often quite easy to produce a strong business case for virtualizing servers that are not yet due for a refresh.

The following spreadsheet shows a per-VM analysis that determines the breakeven point, in months, for virtualizing a server that still has useful life remaining. Some things to keep in mind:

  • Analysis includes both cash flow (for the Finance people) and GAAP views (for the accounting people). Cash flow includes a leasing option, which helps ease the cash flow hit with these kinds of projects
  • Cost savings of data center floor space and capital infrastructure (PDUs, UPSs, etc) are ignored
  • VMware software is calculated at list price (not discounted)
  • A certain percentage of existing servers may be re-used as VMware ESX hosts, but server depreciation expense continues until end of term for those that remain

Feel free to make comments or suggestions on this spreadsheet – it is a work in progress. You may find that some of the numbers are conservative, others may seem aggressive, so your mileage may vary. The full version is here.

 del.icio.us  Stumbleupon  Technorati  Digg 

Virtualization Adoption Lifecycle and Process Maturity

I've been talking to various people recently about my last entry, the " Virtualization Adoption Lifecycle". 

I'd like to differentiate my framework a little bit from the various operational readiness / process maturity models out there.  The framework I proposed is about strategy: in other words, how embedded is virtualization into an organization's overall IT strategy.  It affects the way design decisions, infrastructure choices, cost models, and chargeback frameworks are made.  For example, an organization that chooses to use "transient VMs" over the traditional server model is making a strategic decision independent of processes.

There is a whole other body of knowledge out there about process optimization.  One of the more common frameworks is the one from Carnegie Mellon's Software Engineering Institute - the Capability Maturity Model Integration (CMMI) - the "I" has been added rather recently.  It, and others like it, look at an organization's maturity in a series of levels, from the worst (chaotic) to the best (optimized - with repeatable processes and a mature operational framework).  VMware's services organization has built an Operational Readiness practice to apply those principles to virtualization, and several Systems Integrators are building similar practices.

If you're really interested in the subject of process optimization, read The Goal.  It's a textbook (poorly) disguised as a novel, and is required reading for any Industrial Engineering student.

So, don't view my lifecycle as a way to optimize processes around virtualization.  It's a way to look at how strategic virtualization is to IT as a whole.




 del.icio.us  Stumbleupon  Technorati  Digg 

The Virtualization Adoption Lifecycle

The technology adoption lifecycle is very important in the high-tech industry.  An entire genre of books, such as Crossing the Chasm (and its sequels) and The Innovator's Dilemma (and solution), have focused on the adoption of new technologies, disruptive innovation, and maturity/obsolescence. 

Technology adoption almost always follows a bell curve, as in Everett Rogers' Technology Adoption Lifecycle model:

DiffusionOfInnovation

By some measures, virtualization is in the "Early Majority" or even "Late Majority" phase: for example, VMware has 100% of the Fortune 100 as customers, which says that large enterprises are using virtualization.  At the same time, however, various studies pinpoint the number of virtual servers as something less than 10% of the addressable population (putting us in the "Early Adopter" phase - not really applicable for a 6th generation product).

This tells me that the technology adoption lifecycle isn't the best framework for categorizing where an organization is in the adoption of virtualization.

Maybe we could look at it as an S-Curve:

S-Curve

...that's better.  One could fit a 5-stage framework into that curve.  However, that only looks at the percentage of addressable servers that are virtual.  It doesn't take into account the way virtualization is actually used (i.e., is the customer using VMotion for maintenance activities, or snapshots for quick roll-backs).

So, I came up with a 5-level framework that could probably fit into some kind of 2x2 matrix, but we'll defer the visuals (for now), and look at five stages, and the corresponding timeframes at which customers reached these stages.

Level 1 - Experimental [2005-2006]

In this phase, an organization uses physical (non-virtual) infrastructure for all new builds and refreshes of existing assets.  Virtualization is only used in pilot, proof-of-concept, or limited development deployments.

Most organizations are already beyond this level. 

Level 2 - Limited Deployment [2006]

As organizations became a little more comfortable with virtualization, they began to use it to replace actual physical servers -- these servers were normally development, test, or non-critical production servers.  Some of the common themes in Level 2 include:

  • Use of broad rules of thumb instead of actual utilization data to determine virtualization candidates (e.g. "never put a database in a VM; avoid VMs for network-intensive applications)
  • Hesitance to use virtualization for critical applications (ignoring capacity or workload requirements)
  • Resistance to shared infrastructure from business units (normally a chargeback issue, if not purely a perception issue)

Here, the financial effects of server virtualization are mostly limited to capital costs (mainly server hardware) and operational costs associated with power, cooling, and floor space.  It is very difficult to take advantage of the flexibility and operational efficiencies afforded by virtualization when virtual servers are treated as second class citizens.

 

Level 3 - Virtualize First [2006-2007]

Somewhere in the past two years, IT shops began to see server virtualization as strategic.  For a time, this was my definition of "strategic virtualization" - in other words, if a virtual server is the default target for a new deployment or refresh of an existing asset, then it passes the test for whether virtualization was considered "strategic" to an organization.  I now realize that the "strategic" bar should be set higher (see Level 4).

The "Virtualize First" policy means that at the time a decision point is made on a server (for example, a new deployment, refresh, migration, or event caused by  power/cooling constraints), the default target is a VM, unless a logical counter-case can be made.

Examples of logical counter-cases might include:

  • Server requires direct access to hardware (e.g. fax server, USB dongles), and no cost-effective workaround is available
  • An inordinate amount of capacity is required, e.g.24 GHz [ xxx SPEC] of CPU cycles (keeping in mind differences in throughput between legacy and state-of-the art technology)

There are still a number of poor counter-cases in organizations' decision trees, for a variety of reasons.  For example, VMware ESX Server 2.x was limited to 3.6 GB of RAM per VM - whereas a 3.6 GB limit was applicable in 2005, the ESX 3.0.x limit is 16 GB, and the current (ESX 3.5 limit) is 64 GB.  At the same time, processing power (mainly due to multi-core processors) and I/O bandwidth capabilities have kept pace.

Capacity Planner studies, time and again, show that 80-90% of Wintel workloads are virtualization candidates in organizations large and small.  Finally, whereas application vendors were loath to support their applications in a virtual environment in the past, much of that has gone away due to customer pressure and general industry maturity.

As promising as the "Virtualize First" policy is, it does not, on its own, provide much operational efficiency.  Even if an organization moves 100% of its servers to VMs, it will not see much operational efficiency if it continues to manage its newly-virtualized servers as if they are the same physical assets that they replaced. 

Level 4 - Operational Transformation [2007]

Somewhere between Levels 4 and 5 is where the term "strategic virtualization" could be applied.

Level 4 can only be achieved with some level of executive buy-in.  Whereas it is relatively easy migrate a large number of physical assets into virtual machines without changing an organization, it requires real executive sponsorship to drive a change in the way systems are managed.

Examples of the types of activities that can be transformed because of virtualization may include:

  • Provisioning: builds of new servers is made easier with hardware-independent template-based provisioning.  This can include not only the OS and core tools (the traditional "image-based" deployment approach, but also entire application stacks
  • Standardization: template-based provisioning drives a higher level of standardization, and higher standardization is the number one driver of a high server-to-admin ratio
  • Configuration changes: if a VM snapshot is taken before a change, it allows for a quick, low-impact rollback to a known trusted state when necessary
  • Server maintenance: using VMotion, a server can be taken offline for upgrades, part-swap, or replacement without impacting the applications running on it - during business hours, and without requiring off-hours work and overtime pay
  • Patching: even before VMware Update Manager, organizations were able to use clones or snapshots to test patches(clones) or roll back from bad patches (snapshots) - now, patching of hosts, guest OSs (online or offline) and applications is automated

Level 5 - "Business Transformation [2008]

This is the year of Business Transformation through virtualization. 

Some organizations got a head start on Business Transformation in 2006-2007 through the uses of Virtual Desktop Infrastructure (VDI) and Virtual Software Lifecycle Automation (VSLA) tools like Lab Manager.  These tools allowed them to change the way development infrastructure or desktop services were built and managed, without transforming their production server infrastructure.  We're seeing an increased adoption of those solutions, but my focus here is on the transformation of the way production servers are architected, built, managed, and commercialized.

Another key development is the concept of "Transient VMs".  Traditional (physical) server architecture means static servers, built for a purpose, and those servers typically "stick around" for a long, long time.  A server that is powered on and placed on the network is a server that must be patched and managed.  Transient VMs are those that are created for a purpose, but then may be archived or destroyed as required. 

Examples may include test VMs (clones of production, perhaps), development environments in Lab Manager, or legacy applications in VMs that can be powered off and archived for compliance reasons.  Plus, the great thing about powered-off VMs these days is that they can be patched with VMware Update Manager (in other words, the best of both worlds: a patchable VM that doesn't need to be managed on a day-to-day basis).

A third development is the use of advanced automation with virtualization.  VMs, because of their portability and flexibility, are a much better object for automation than physical machines.  This has given rise to solutions such as Dunes VS-O (now part of VMware), which automates hundreds of workflows that normally would be performed manually.

A fourth development is the fact that organizations (both outsourcers and internal IT shops) are offering new services that use virtualization at their core.  These includes Disaster Recovery (using internal assets instead of third parties), on-demand computing, or hosted virtual desktop.  For outsourcers in particular, it means that they can get new revenue streams, without increasing their customer's overall IT spend (usually, the customer gets more revenue, and the customer reduces costs).

A combination of organizational experience, experience in the systems integrator community, and product capability has given us the "perfect storm" in 2008 for business transformation.


 del.icio.us  Stumbleupon  Technorati  Digg 

The Breakeven Point

[Reposted 2/11/08 - somehow this post disappeared from the blog]


Sometimes it amazes me when large organizations tell me that the "entry costs" for server virtualization are too high.

I shouldn't be amazed. There are two good reasons when even large enterprises may build out in small pieces.

  • Branch or remote offices that do not benefit from centralization of infrastructure (sites like this number in the hundreds for some organizations)
  • Small, tactical, project-based deployments with self-contained budgets

In either case, the question of breakeven point comes up: what is the minimum number of servers required for financial breakeven of virtualization (relative to traditional, physical servers)?

So, of course, I built a little spreadsheet to find out.

Most of the assumptions are outlined in the corresponding spreadsheet, but here are the high-level assumptions:

  • Costs are a 3-year TCO, including server hardware, shared storage (where applicable), VMware ESX host software, and respective 3-year 24x7 support & subscription
  • No discounts applied to hardware or software (discounts are generally better for software, which would improve breakeven point for virtualization
  • The average cost per kWh of input power is $0.10 (power) plus $0.12 (cooling), but can be adjusted based upon location (the model contains data for all 50 US states, plus various European countries
  • Costs are calculated with and without energy cost savings
  • "Soft" dollar savings in the form of management and refresh costs are excluded

It turns out that without High Availability, the breakeven point is two servers. With two hosts, some basic replication and manual restart of VMs, the breakeven point is 4 VMs. With two hosts, entry-level shared storage, and VMware HA, the breakeven point is 5 VMs.

The full analysis is here, and it includes a detailed Bill of Materials, energy costs, and cost-per-VM detail for each option.



One might argue that with only two hosts, you would only want to run up to 50% utilization - but 50% utilization could still be 20 VMs on two hosts, with redundancy. I could also extend the calculator to determine the breakeven point when a third host is added.

Those of you who were at VMworld 2007 can view the session "IP29: Virtualization of Remote Sites", which explored two main ways that customers are handling remote offices. The first is to bring infrastructure into a centralized data center (leveraging things like VDI, WAN Acceleration (e.g. RiverBed), and cheap bandwidth). The second is to leave infrastructure at the remote sites, but use virtualization to reduce costs and simplify management.

 del.icio.us  Stumbleupon  Technorati  Digg 

About oversubscription


In the airline industry, they have this concept of "load management" - and besides figuring out how to charge us last minute travelers at quadruple the rate of leisure travelers, they have also gotten pretty good at oversubscribing seats.

Certain features of server (and storage) virtualization allow us to not only oversubscribe our resources, we can do it without offering a $300 travel voucher when we're oversold.  What we can do is analogous to having two or more people sit in the same seat at the same time (comfortably), or force one person to give up part of his/her body that isn't being used, to make room for someone else.

Oversubscription, basically, is when the sum of all allocated resources is greater than what is actually available.  In the case of memory, for example, it means that you may have 20 VMs, each with 1 GB of allocated memory (for a total of 20 GB), but consume only 10 GB of physical memory. 

Oversubscribing Memory

Memory oversubscription (or overcommit) in a hypervisor can come from four main sources:
  1. Powered-off VMs - many of our VMs may be "transient" and not always powered on.  VMware Lab Manager is one example that makes heavy use of transient VMs.
  2. Transparent Page Sharing - this is unique to VMware, and is a low-overhead way of oversubscribing memory.  Common pages (or zero-pages) in VMs are stored in physical memory only once.
  3. Balloon Driver - another VMware technology, built into VMware Tools.  The balloon driver "tricks" a VM into giving up memory that it doesn't actually need.
  4. Swap - data is taken out of physical memory, and sent to disk storage.  Swap isn't necessarily a bad thing: for example, certain parts of a VM are used once, and never accessed after boot time.
It is normal for VMware customers to achieve 2:1 (or greater) overcommit ratios for memory - mainly due to Transparent Page Sharing and Balloon Driver.  Without those technologies, a hypervisor needs to dedicate the full amount of RAM for each VM.  Since CPU horsepower is "cheap" compared to memory (e.g. due to dual- and quad-core processors), memory is the first resource to hit a constraint.

So, let's get to the financials:



The above spreadsheet shows the cost-per-VM difference between VMware ESX Server 3.x (with the console OS), 3i (without the console OS), and a hypervisor without memory overcommit capabilities (e.g. Xen-based or Microsoft HyperV).  It turns out that the difference in software license costs is more than outweighed by the memory requirement per VM, and the cost per VM is 60% that of a non-oversubscribed host.

To validate this against your own VI3 environment, go to the "Hosts and Clusters" view in Virtual Center, select a host, and go to the Performance tab.  Click "Change Chart Options...", and pick "Memory...Real Time...".  The "Memory Granted" (sum of all memory that the VMs "think" they have) divided by the "Memory Consumed" (actual physical memory used by host) gives you a rough idea of the memory overcommit rate.

Two white papers from VMware and Kingston, here, and here, give some more detail on memory overcommit.


Oversubscribing Storage

Storage is another resource that can be oversubscribed.  There are three main technologies that can accomplish storage oversubscription:

  1. Linked clones
    • This feature is available in VMware Lab Manager and VMware Workstation at the virtual disk level.  When a linked clone is used, the new VM uses pointers to the original VM for all common data.
    • The additional advantage of linked clones is that whitespace is not stored - for example if an empty data disk is part of a clone operation, the new disk will act as a "thin" disk and only consume the storage that it really requires for data
    • Linked clones can also be accomplished at the datastore level using technologies such as NetApp FlexClone (useful when cloning many VMs at once)
    • Keep in mind: linked clones pay a performance penalty on write operations (using copy-on-write), and put added stress on the source disks on read operations
  2. Thin Disks
    • Thin-provisioned disks are virtual disks that "appear" to the VM as one size, but only consume up to the amount of data that is required by that disk.  So, a 10 GB drive that is 50% utilized will only store 5 GB on disk (a traditional "thick" virtual disk would consume the entire 10 GB on disk)
    • Thin disks are options in VMware Workstation, and are the default disk type when using NFS storage in VMware ESX Server - however, VMs cloned from templates are always thick
    • Storage vendors such as Hitachi and NetApp have LUN-level thin provisioning, but that would only apply to VMware if using RDMs
  3. Deduplication
    • Deduplication is a technology similar to memory page sharing (above), where common data is stored only once.  It is done "after the fact" (ex poste), meaning de-duplication opportunities are scanned using a background process
    • Deduplication is primarily used for backups (e.g. Symantec PureDisk, EMC Avamar, or Quantum DXi-Series), but can also be used on the filesystem itself (today, using NetApp Deduplication, formerly A-SIS)

The following table summarizes some of the cost savings available with storage oversubscription.  I have ignored tape backup savings due to de-duplication, and have only focused on online disk storage (NFS, iSCSI, or Fibre Channel).




One of the biggest obstacles to low-cost VDI deployments these days is the cost of storage, because traditional "thick" storage requires each virtual desktop to consume its full complement of storage.  These technologies go a long way in solving that problem.


Summary

We've looked at two main oversubscription opportunities (memory and storage), and shown how the use of common technologies for sharing and/or thin provisioning of those resources can reduce the unit cost per VM.

Other resources, such as CPU, bandwidth, and people, can be oversubscribed as well.  We don't have the same de-duplication or thin provisioning options with those resources, but we can still use the airline-like approach of load management (in other words, make intelligent assumptions about how many applications will be busy at the same time).



 del.icio.us  Stumbleupon  Technorati  Digg 

Introduction

Welcome to my blog!

It seems that a week doesn't go by in which I don't talk to a customer or partner about cost allocation for virtual machines.  Cost allocation, in essence, is placing costs in their respective "buckets".  In fact, there is an entire subset of accounting (cost accounting) that differentiates itself from financial accounting (balance sheets, income statements, cash flow statements, etc).

The IT industry is actually a little behind the manufacturing industry in how we allocate costs.  The manufacturing industry has had to deal with things like shared vs. dedicated components and fixed vs. variable costs for many years now.  Typically, they use an approach called Activity Based Costing (ABC) refined over the years since 1987, and typically associated with Harvard Business School professor Robert S. Kaplan.

Whether we use ABC or not, we are still faced with the challenge of properly allocating the costs of our infrastructure (and its management), and, hopefully, finding a way of properly recovering those costs.  Virtualization gives us a platform to carve up our resources in a much more flexible manner than before, but now we need to recover the costs of those resources in a way that maps to how they are "carved up".

There are a number of tools that can report on resource utilization in a virtual server environment.  Examples include software from V-Kernel, VAlign, Tivoli, VizionCoreSatoriTech, and Evident.  Those are valuable tools, but they all depend on one basic assumption: we know what our unit costs are.  We will probably never see a tool that can do that for us.

Each virtual infrastructure environment has a set of costs that can be put into one of four quadrants, as in the diagram below:


Shared+Fixed: This is typically our initial build-out.  In a VMware Infrastructure world, it includes the first Virtual Center server, the initial set of ESX Server hosts, and the initial set of shared storage.  It also includes the administration and engineering costs for the core infrastructure

Shared+Variable: Once the initial platform is built, most of the cost benefits are here.  Every subsequent ESX Server host added to the farm is a shared, variable cost.  Additional storage capacity on a modular or expandable array is shared/variable.  

Dedicated+Variable:
Depending on how standardized the management and administration procedures are, some element of dedicated work is associated with an individual VM.  This element is becoming more "shared" as we move to a more standardized, templated-based infrastructure, and take advantage of tools like Update Manager.  Also included in dedicated/variable are certain software licenses are that are instance-based.

Dedicated+Fixed:
I like to avoid this category as much as possible with virtual infrastructure. 

Once the costs are carved up into these quadrants, we need to do the following:
  1. Decide upon one or more core units of measurement.  This may be a VM "slice" (if all VMs are considered "equal", a CPU unit (typically GHz), a memory unit (typically GB), and/or a storage unit (typically GB).  I like to focus on two measures - memory (because it is typically the constraint, more than CPU), and storage (which has a weak correlation with memory and CPU and can grow independently).
  2. For all shared infrastructure, calculate the total number of the above units available
  3. Subtract any applicable overhead
  4. Subtract any applicable excess capacity required for growth or spikes
  5. Multiply the total available capacity by the oversubscription ratio (this is an important topic best left for another blog entry - suffice it to say that oversubscription of memory, disk, and CPU can usually more than make up for overhead)
  6. Divide the total cost of shared infrastructure by the above number, and the result is your per-unit cost of shared infrastructure.
  7. Add in the costs of dedicated infrastructure.
As it is getting to be tax time, this approach may look a little familiar, and just as complicated.  Here is the basic formula:



Note that I have not split out fixed vs. variable costs.  That, again, is a subject for a later blog entry.  For now, let's include fixed and variable costs in the same bucket, and focus on how to carve up shared costs. 

The challenge with fixed costs comes in when our environment grows - the fixed portion becomes a smaller and smaller component as the virtual environment grows - and thus, our total per-unit cost gets smaller as we move to a more variable model.  To keep the model simply, let's assume a steady-state size of the environment, and build all cost estimates based on that.

I am working on an Excel spreadsheet using this methodology, so stay tuned.  For now, you may want to check out VKernel's methodology and calculator

[Edited 1/21: fixed VKernel link]









 del.icio.us  Stumbleupon  Technorati  Digg