leveraging virtualization to optimize high availability...

Leveraging Virtualization to Optimize High Availability System Configurations

<Authors>

D. Beyer, P. F. Chan, E. M. Dow, F. Lefevre, S. Loveland

<Abstract>

Leveraging redundant resources is a common means of addressing availability

requirements, but it often implies redundant costs as well. At the same time,

virtualization technologies are bursting into data centers with the promise of

reducing costs through resource consolidation. Can virtualization and High

Availability (HA) technologies be combined to optimize availability while

minimizing costs? Yes, but merging them properly introduces new challenges.

Drawing on practical experiences from an IBM development test lab, this paper

looks at how virtualization technologies, methodologies, and techniques can

augment and amplify traditional HA approaches while avoiding potential pitfalls.

Special attention is paid to resource sharing, applying conceptual models (such as

Active/Active and Active/Passive) to virtualized environments, stretching virtual

environments across physical machine boundaries, and avoiding hazards that await

the unwary.

A computer system must be up and operational to be useful. The more available it is, the

more value it can provide to its users. So why isn’t every configuration highly available?

The main inhibitors are typically cost and complexity. Achieving tolerance for planned

and unplanned outages often involves configuring redundant physical resources. Those

resources increase system costs, leading vendors, consultants, and system administrators

to ask, “How much availability are you willing to pay for?” There is also inherent

complexity in setting up and managing high availability (HA) solutions. Redundant

resources imply redundant configuration work. The software configuration activity

required to assemble and coordinate all elements of the HA solution into a single, well-

functioning unit can be daunting as well. In addition, the task of establishing and

maintaining a monitoring infrastructure to track the availability and flow of work across

many discrete physical assets can be challenging. All of this complexity further inhibits

the adoption of HA solutions.

Virtualization technology offers a reduction in cost and complexity through the

abstraction or emulation of physical resources into logical, shareable entities.

Virtualization has existed on mainframes for decades,1,2,3 and more recently has been

adopted by other classes of servers as well. But virtualization is not a homogeneous

concept with uniform characteristics; it encompasses a range of technologies with

different attributes, resources sharing opportunities, and associated risks. These

techniques include system virtualization and resource virtualization.

System virtualization divides a single physical computer into multiple logical computers,

each of which executes its own guest operating system instance. A thin hypervisor layer

sits between the physical hardware resources and the operating systems. It exposes a set

of virtual platform interfaces that form a logical partition or virtual machine. The

hypervisor multiplexes and arbitrates access to the host platform’s resources so that they

can be shared across multiple virtual machines while enforcing a level of isolation that

keeps each guest independent of the others. This isolation is an important aspect of

virtual machine high availability, as it ensures a failure within one virtual machine can’t

compromise the availability of others.4 Hypervisors can be implemented in server

firmware or in software. Firmware hypervisors are implemented directly in the server

hardware’s microcode, and the virtual machines that are created are often called logical

partitions, or LPARs. Examples include the IBM* System z* Processor Resource/System

Manager* (PR/SM*) hypervisor, and the IBM POWER Hypervisor* firmware. Software-

based hypervisors, sometimes called virtual machine monitors (VMMs), are either booted

natively on host hardware or run on top of a host operating system. Examples include the

IBM z/VM* hypervisor, the VMware ESX-Server hypervisor, the Microsoft** Virtual

Server hypervisor, the XenSource Xen hypervisor, and the Linux** Kernel Virtual

Machine, or KVM, support. Multiple layers of virtualization are also possible, where one

hypervisor runs within the virtual machine created by another, adding even more

flexibility. For example, it’s common to divide an IBM System z mainframe into multiple

LPARs, and run a different instance of the z/VM hypervisor within each LPAR, as

depicted in Figure 1.

Physical Machine

Firmware LPAR Hypervisor

LPAR 1 LPAR 2 LPAR 3 LPAR 4

GuestOperatingSystem

GuestOperatingSystem

Native VMM Hypervisor


Virtual Machine

Gue

st O

pera

ting

Syst

em

Gue

st O

pera

ting

Syst

em

Gue

st O

pera

ting

Syst

em

Gue

st O

pera

ting

Syst

em

Gue

st O

pera

ting

Syst

em

Gue

st O

pera

ting

Syst

em

Gue

st O

pera

ting

Syst

em

Gue

st O

pera

ting

Syst

em

Virtual Machine

Virtual Machine

Virtual Machine

Virtual Machine

Virtual Machine

Virtual Machine

Virtual Machine

Figure 1 System Virtualization Layers

Resource virtualization, on the other hand, operates at a lower level than that of

hypervisors. It consists of technologies independent of (but complementary to)

hypervisors that exist specifically to virtualize individual resources, such as network

adapters, host bus adapters, or even heterogeneous disk farms.

HA + virtualization: opportunities and challenges. At first glance, there may appear to

be a dichotomy between HA and virtualization. As shown in Figure 2, traditional

multiple-machine deployments rely heavily upon redundant yet discrete physical

hardware to achieve HA.5,6 This focus on resource redundancy at all costs seems at odds

with virtualization’s emphasis on resource sharing. We aim to show that in fact the

opposite is true: virtualization offers significant opportunities to augment and amplify

HA implementations.

Figure 2 Physical HA Configuration

However, virtual HA configurations also introduce new challenges and risks that must be

addressed for an implementation to succeed. Many of these challenges are subtle and may

not be apparent during initial, small-scale pilot deployments. Only when a pilot project

has been deemed successful and is being scaled to full production levels do many of these

issues rise to the surface and risk derailing it. These challenges will also be discussed,

along with approaches for surmounting them.

System Component Virtualization and HA Many system components can be virtualized, including processors, memory,

cryptographic co-processors, networks, and storage. Let’s examine system and resource

virtualization approaches for these components in the context of how they can be

leveraged to support high availability deployments.

Virtualized Processors, Memory, and HA

To achieve true continuous availability, a server cluster must not only have multiple

nodes; it must also contain sufficient capacity so that when at least one node fails, the

surviving nodes are able to absorb its work without any loss of throughput. For traditional

physically discrete server clusters, this implies that each node must be over-configured

with processor and memory resources. This could be accomplished with two large servers

that normally run at a peak of 50% or less of their capacity. However, it’s probably more

common to find larger clusters of smaller servers, each somewhat over-configured with

resources to provide failover headroom that is spread across the entire cluster (this

proliferation of smaller nodes may result from the incremental expansion of throughput

demands, concerns about underutilization, or other factors). There are downsides to this

approach. Over-configuration obviously increases asset costs, but that’s not all.

Ironically, larger clusters sizes can also negatively affect availability. Why? Each node in

a cluster represents a potential point of failure – not a single point of failure, but a point

of failure nonetheless. This is important because while all clustering technologies aim to

minimize the window in which the outage of a node is visible to end users, many to do

not eliminate that window altogether. TCP/IP timeout thresholds, cluster

resynchronization activity, and (in the worst case) node restart techniques can all lead to

outage time that is visible to end users. Lab testing of various cluster technologies has

shown outage times on busy environments ranging from seconds to minutes. Clusters

with larger numbers of nodes have more potential points of failure, so as cluster size

grows, so do the odds that on any given day, end users will experience some sort of

outage. Greater cluster size can mean greater system management costs as well.

System virtualization can address both of these downsides. When a virtual server fails, its

processing capacity and often its real memory are immediately made available to the

surviving virtual servers in the cluster. Hence when one virtual cluster node fails, the

surviving virtual nodes not only absorb the failing node’s workload but its resources as

well, so there is no drop in available cluster capacity. This ability of the hypervisor to

dynamically and automatically flow resource among virtual machines largely eliminates

the need to over-configure individual cluster nodes, thereby reducing some of the costs

associated with a highly available cluster.

Likewise, in a system virtualization environment there’s often little need for large

clusters. As just noted, over-configured, underutilized servers are less of an issue. Also,

incremental capacity growth requirements can often be met by increasing the share of

physical resources available to virtual machines, adding physical resources to the

underlying host server, or both. Indeed, for many HA implementations a two-node virtual

cluster is sufficient to protect against unplanned node outages. However, for managing

planned outages, a three-node cluster is often better as it allows one node to be brought

down for service while the two remaining nodes continue to back each other up. In any

case, the smaller cluster sizes that system virtualization encourages tend to be easier to

manage and have fewer potential points of failure.

Challenges

As powerful as system virtualization can be in optimizing resource allocations and

reducing cluster sizes, it has limits. A particular middleware implementation may impose

its own software-oriented resource limits that cannot be addressed by the hypervisor’s

ability to redirect resources. For instance, application servers may become constricted by

Java** Virtual Machine (JVM) heap size restrictions, or by the number of allowed socket

connections in a connection pool.7 In such cases, additional virtual cluster nodes may be

needed to overcome the constraint.

Physical resource consumption across virtual machines is cumulative, which has a

number of implications. For instance, an operating system such as Linux may

automatically fill up any unused memory available to it with an I/O cache. Such a

technique is reasonable in a standalone server environment, but in a virtual one it can

needlessly consume resources that could better be used by other virtual cluster nodes.

Hence, “right-sizing” virtual machine memory capacity becomes an important

configuration consideration. Likewise, availability management (and other system

management) tools often depend on small management agents running on each node to

gather and report status back to a management server. These agents will typically

consume some portion of a CPU even when the system they are monitoring is idle. On

discrete systems, this is negligible. But even if one agent on each of 100 virtual machines

is using only 1% of a physical processor, cumulatively the agents consume a full

processor at idle.

One solution is the use of management tools that operate at the hypervisor level rather

than the individual guest level. An example is the IBM Operations Manager for z/VM, an

automation and monitoring tool that can periodically execute health-checking routines

and trigger user-defined actions. Another example is the IBM z/VM Performance

Toolkit, which allows users to monitor both system-wide and guest-by-guest resource

consumption (among other things).

A second possible solution for systems management is to use a tool whose agents operate

with extremely low overhead (or only run if a particular resource has been consumed),

such that even the accumulation of their expended processor usage will be tolerable. In

the absence of such tools, if processor consumption becomes an issue then agents can be

selectively deployed to only those virtual machines deemed the most critical.

Virtualized Cryptographic Processing and HA

Cryptographic acceleration hardware offers another opportunity for virtualization to

complement HA. This hardware can not only support and optimize other HA

technologies, it can itself be made highly available through virtualization.

One example is the IBM System z cryptographic coprocessor (accelerator) devices

accessed from Linux guests executing under the z/VM hypervisor. Such coprocessors are

often referred to as adjunct processors (APs). APs for offloading cryptographic functions

can reduce the general-purpose computational burden on the processors being utilized by

the virtual guests that would otherwise need to perform those functions via software. This

can result in significant performance improvements for Secure Sockets Layer (SSL)

transactions, which makes it desirable in its own right.8,9,10 The benefits include securely

encrypting transmission of state data between virtual nodes and other cluster-

synchronization mechanisms, when more than one host is used.

The expense of those few APs is greatly offset by the saved processor time on discrete

hosts. These savings are even more important in virtualized environments, where

physical processors are expected to have higher utilization. Plus, the benefits are

cumulative when executing many virtual machines within a single physical machine.

Beyond pure cost savings derived from resource sharing, virtualization also can enable

transparent AP failover. For example, consider how z/VM virtualizes hardware

cryptographic devices. The virtualization layer enables any guest to use any underlying

cryptographic accelerator at any time.11,12 When a single cryptographic accelerator fails or

is brought offline, guests using that device will transparently move to another without

user intervention. The z/VM hypervisor also balances the load across the cryptographic

cards of a given type. In this example, virtualization enables the distribution of the cost of

crypto acceleration across many guests, securely encrypting HA-related network traffic

among virtual machines on disparate hosts, and transparently handling failover in the

event of a cryptographic hardware outage.

Challenges

When virtualizing the cryptographic accelerator, it is shared among numerous guests.

But there are times when this is not appropriate, such as when using secure key

cryptography for Linux on System z. When using secure key cryptography with these

APs, the keys are stored inside the coprocessor card, which is tamper-proof. When using

secure key cryptography, one must dedicate the APs to the guest13 and so cannot reap the

benefits explained earlier. In fact, losing one of these dedicated secure key crypto cards is

a single point of failure if no additional user configuration is performed since the keys

will be lost.12 But there is an HA approach that may be taken for virtual machines

requiring secure key cryptography. In this situation it is possible to dedicate two distinct

APs and instrument your software to store the secure keys on each of the cards.

Virtualized Networks and HA

Networking is a fundamental part of many HA architectures, enabling virtual machines to

stay in contact with one another so they can make decisions regarding availability.

Hypervisors often support the creation of virtual Local Area Network (LAN) segments in

which physical network connectivity is emulated through memory-to-memory constructs

(not to be confused with VLANs – virtual LANs as specified in IEEE 802.1Q, a logical

network construct often implemented in networking switches). This virtual networking

infrastructure consumes some amount of memory and processor cycles to implement

message passing systems. These message passing systems appear to the virtual machine

to be like more-sophisticated networking protocols, such as the TCP/IP ISO model, that

are found in discrete-system networking.14,15,16

Virtual LANs have definite advantages when compared to traditional LANs. These

benefits include simplicity of operation, robustness, and security. In addition, they can

often add HA mechanisms with very little additional cost.

Traditional physical networks often become quite complex. Ad hoc expansion can lead to

a networking infrastructure that is difficult to manage, and this complexity can lead to

administrative errors that become a source of downtime. In fact, such complexity can

even lead to an increase in planned outages (in support of hardware life cycles,

configuration changes, and software upgrades), decreasing system availability.17 Virtual

LANs, on the other hand, eliminate much of the traditional interconnect infrastructure

(switches, hubs, cabling, etc.). It follows that this vastly simpler infrastructure can be

managed more easily, maintaining a higher degree of availability. In terms of robustness,

less physical infrastructure means fewer physical points of failure. At a minimum, in an

entirely virtualized network the virtual machines will never suffer the outage of a

physical Network Interface Card (NIC).

However, some classes of virtual network devices do not lend themselves to traditional

HA solutions based on redundant resources (virtual or otherwise). Guest LANs and

Hipersockets* found on IBM System z mainframes are examples. Both are “in-the-box”

networking solutions, in that they provide guest-to-guest networking within the bounds of

a particular hypervisor but don’t directly connect to an external network. Although Guest

LANs (managed by the z/VM hypervisor) do not provide a failover solution per se, they

do not suffer from any of the faults real networking hardware can introduce. Because

guests using a Guest LAN need to be authorized to use the LAN, it provides a high-

speed, secure TCP/IP transport within z/VM for sensitive applications such as the inter-

node communication required for HA architectures. Similarly, Hipersockets (managed by

the System z PR/SM hypervisor) do not have built-in failover capabilities, but they also

have no physical component that can fail -- they inherit and depend on the reliability of

System z microcode to provide extremely reliable high-speed communication to the

logical partitions (LPARs) they serve. In both cases, traffic never flows over a physical

wire where it could be sniffed, so the overhead associated with traffic encryption can

often be avoided. An example inter-node communication that might use such a network

infrastructure is the exchange of state data between Oracle Real Application Cluster or

IBM WebSphere* Application Server cluster members.

Automatic HA with Virtual Networking

However, an entirely virtual network is a special case. More typically, virtual machines

contained within a given host communicate with each other over a virtual network, but

also must communicate with external systems through a traditional network. This

connection between the virtual and physical networks represents a potential failure point.

Even here, system virtualization offers availability benefits. Hypervisors often make use

of built-in capabilities to create redundant network interfaces and network controllers

which typically require much less investment in configuration than the equivalent

redundant network infrastructure needed in the physical world. For instance, on the z/VM

hypervisor, a VSWITCH virtual LAN can be bound to one or more physical NIC devices

(in parallel using IEEE 802.3ad Link Aggregation, if desired) to enable external

communication. By defining a VSWITCH with two real NIC devices and two VSWITCH

network controllers (the TCP/IP software stack that enables VSWITCH operation) the

loss of a single NIC or controller causes the VSWITCH to seamlessly and transparently

failover to the surviving resource so it can provide guests with uninterrupted access to the

physical network. Providing this level of HA to all machines in a discrete environment

would add complexity and cost.

Other environments such as those provided with the IBM POWER5* and XenSource Xen

hypervisors also offer virtual network support. They both have the ability to bridge the

physical NIC to the virtualized network by utilizing a management node. The POWER5

hypervisor can provide high availability and virtualization similar in concept to the z/VM

VSWITCH by connecting two virtual I/O servers, the management nodes that bridge

physical NIC to a virtual network, to a single virtual network.

Resource Virtualization and Network HA

A type of resource virtualization available with Linux systems is called channel

bonding.18 This approach uses a special device driver that joins multiple NICs together to

make a single, logically bonded interface. In the wake of a NIC failure, as long as one of

the underlying NICs remains active, the bonded interface will still be available. Bonding

is particularly useful for systems not operating under a hypervisor that supports a higher

level of NIC virtualization, such as the VSWITCH described earlier.

Bonding can be configured in an active-backup network path arrangement that is

transparent to the LAN infrastructure. There is no need to do any special router

configuration, or to use an interior gateway routing protocol to handle path failover.

However, in this mode only one NIC is being actively used -- the backup is not utilized

until the primary fails. It is also possible to configure load balancing across the bonded

adapters. This approach allows both NICs to contribute to throughput, but typically

requires some external router configuration to group the underlying NICs together and to

define a transmit policy.

Challenges

A common approach for insulating a traditional server from NIC failures is to define a

Virtual IP Address, or VIPA, that is not associated with any particular NIC. The VIPA is

reachable through multiple NICs attached to the operating system instance. Should one

NIC fail, traffic can be rerouted through a surviving NIC, keeping the VIPA available.

The catch in this scheme is that routers in the network must be informed of the change or

they will continue attempting to route messages through the failed NIC. One traditional

means of addressing this requirement is to use interior gateway protocols such as Open

Shortest Path First (OSPF), which relies upon agents on every VIPA-owning operating

system instance to exchange “heartbeat” messages with each other and with routers to

confirm accessibility, and to exchange routing information in the wake of a detected

disruption. Unfortunately, OSPF can be complex to configure and consumes CPU

resources on each server instance. In a virtual environment, the frequent heartbeating may

make idle virtual machines appear busy, limiting the hypervisor’s ability to efficiently

manage resources. In addition, the cumulative processing resource consumed by

numerous OSPF daemons can become noticeable. This can be mitigated by deploying

virtual network failover approaches that do not require dynamic routing daemons to

function. For example, in a z/VM environment the guests could eliminate the use of a

VIPA and each have only a single virtual NIC that is attached to a VSWITCH. The

VSWITCH can then manage NIC failover as described earlier.

Another issue that can arise with virtual networking is problem determination difficulty.

Some virtual networks operate at layer 3 of the Open System Interconnection (OSI)

reference model, but many commonly used network tracing tools do not work with layer

3. In addition, some virtual LANs do not allow the common practice of enabling a guest

to operate in promiscuous mode to inspect network traffic. Vendor enhancements are

required to address these weaknesses.

Virtualized Storage and HA

Storage infrastructure is another critical area for high availability implementations where

virtualization can play a useful role. There are at least three areas where availability

considerations must be factored into the storage picture: host attachment, the disk

subsystem itself, and the storage network in between. Virtualization offers opportunities

for efficiency or enhancing availability in all three areas, as well as presenting some

challenges.

Host attachments

Hosts communicate to the storage network through I/O adapters. As with other

components, resiliency is typically provided through redundant physical adapters and

appropriate failover/failback support. While fine for the physically discrete server

environment, in a virtualized environment where dozens or even hundreds of virtual

systems are contained within one physical server, the number of redundant physical

adapters required could become prohibitive. Fortunately, system virtualization

implementations usually provide adapter sharing capabilities that can dramatically reduce

the number of physical adapters required.

On IBM System z mainframes, a single Fiber Connection (FICON*) or Fibre Channel

Protocol (FCP) channel adapter can be directly, and transparently, virtualized and shared

among multiple virtual machines. This capability leverages adapter capacity to the fullest

and dramatically reduces the number of physical adapters needed. One physical FCP

adapter, for instance, can be abstracted into hundreds of virtual adapters. With this model,

a mere two physical adapters can offer the redundancy needed to provide a highly

available interface for hundreds of virtual systems.19

Disk subsystems

A common means of resource virtualization for disk volumes is with a Redundant Array

of Independent Disks (RAID). RAID technology abstracts multiple physical disks into a

single virtual disk in a way that improves its performance, availability, or both. The

techniques employed include striping (spreading chunks of a file across multiple disks)

for a performance boost, and mirroring (maintaining synchronized copies of a disk) for

increased availability, as well as the use of error correction techniques (parity) to add

further fault tolerance. Various combinations of these techniques can be used to achieve

the desired balance of performance, availability, and cost. RAID can be implemented in

the storage subsystem itself or in host software. Any RAID devices providing storage for

virtual machines should be configured for availability, and not simply performance.

Virtual machines should also not depend on any single path to the RAID device.

Storage Networks

Sitting between host adapters and disk subsystems is the storage area network (SAN), and

it too is a potent area for injecting resource virtualization that can benefit HA

implementations. An abstraction layer decouples the physical disks from the hosts’

logical operations against them. This abstraction can either be done in-band (also known

as block aggregation), or out-of-band (also known as file aggregation), and as we will

see, this decoupling via virtualization can lead to increased availability.

In-band virtualization. With the in-band approach, all data and control traffic flow

through an appliance that abstracts heterogeneous physical storage subsystems into

virtual disks that it makes available to hosts. In effect, this shifts storage management

intelligence from individual disk controllers into the network. From an availability

perspective, storage area network virtualization layers help address planned outages. The

administrator can add or change the underlying physical disks while maintaining

availability of the virtual disks to users and applications. As an example, consider the

IBM SAN Volume Controller (SVC). SVC is a deployable appliance consisting of pairs

of IBM System x* servers with high-bandwidth host bus adapters (HBAs), gigabytes of

mirrored read/write cache memory, an uninterruptible power supply, and special-purpose

software. Hosts connect only to the SVC, and have no direct knowledge of the physical

disk subsystems behind it. In this way, virtual machines are insulated from changes to the

underlying physical hardware.20

In order to eliminate the in-band appliance as a single point of failure it must be able to

failover and recover gracefully. SVC combines its servers into an HA cluster for that

purpose, and has concurrent maintenance and upgrade capabilities. It also provides

multiple paths to a given virtual disk. However, SVC does nothing to protect against the

failure of individual disks; for that, it relies upon the RAID capabilities of the back-end

disk controllers.

Out-of-band virtualization. With this approach, data flow is separated from control

flow. It typically involves file sharing across a SAN, where hosts consult metadata (data

about data) and employ the locking services of a file system or Network-Attached

Storage (NAS) controller, but then access the data directly. In terms of availability, out-

of-band virtualization can offer abilities similar to those of in-band appliances for

administering the SAN without interrupting access to data as seen by the host.21 Also

similar to the in-band approach, the out-of-band control servers can typically be clustered

to eliminate a single point of failure. Indeed, the control function can even be distributed

across the host file-serving nodes rather than requiring a separate metadata server.22

Examples include the IBM Global Parallel File System (GPFS*) and the Veritas Cluster

File System.

Challenges

While virtualization offers benefits in boosting the availability of storage infrastructure, it

also presents some pitfalls to avoid. A virtualization layer normally obscures underlying

physical components, which can lead to unintended single points of failure. For instance,

dozens or hundreds of virtual HBAs might be virtualized from a single physical HBA. It

would be quite easy to accidentally define redundant attachments from the host to two

virtual adapters that map to the same physical adapter – effectively providing no

redundancy at all. Care must be taken in tracking virtual-to-physical mappings to avoid

configuration oversights.

The sharing of physical resources enabled by virtualization can also raise security

concerns. When multiple operating system instances execute on a single hypervisor and

share a common FCP adapter, by default they will all have access to each other’s storage.

This is not always permissible. For example, consider a hosting environment where

systems are owned by different users that must be isolated from one another. Recently a

new standard called N_Port Identifier Virtualization (NPIV) was introduced to address

the security issue, and implementations are beginning to appear.23 Although NPIV solves

the access control problem, it creates an explosion in World Wide Port Names (WWPNs)

that can itself be problematic to manage.23

At the physical disk layer, hardware-based RAID protects against disk failure within a

storage subsystem but not against failures of the storage subsystem itself. The

combination of hardware-based RAID and a host-based virtualization approach for

mirroring across disk subsystems, such as the Logical Volume Manager (LVM) or

software-based RAID, can fill this hole. As with the virtual HBA case, it can be easy to

overlook the fact that all guests under a single hypervisor are actually mapping their

virtual disks to the same physical disk. Care must be taken to ensure that host-based

software is mirroring disks across physically discrete storage subsystems.

Within the storage network virtualization layer, in-band appliance nodes should be spread

across redundant SAN fabrics for maximum availability. With block aggregation, hosts

are unaware of the physical location of storage they are using. Host-based system

management tools may have difficulty piercing the virtualization layer to keep track of

virtual-to-physical mappings. Addressing this issue may require cooperative

enhancements to both storage network virtualization and systems management products.

Virtualization and Conceptual Approaches to HA Architectures When talking about high availability architectures there are multiple conceptual

approaches to consider. In this section, we will discuss Active/Cold Standby,

Active/Passive, and Active/Active. Since we are interested in the subject matter from a

virtualization point of view, we provide a working definition for each, which may differ

from common literature that describes the physically discrete server case. We will also

point out the optimal situations in which each technique should be employed.

Active/Cold Standby

One approach to an HA architecture is known as Active/Cold Standby. With this

solution, the target service is only allowed to run on one node at a time (the active node).

It is installed and configured on the cold standby system, but is not running. In some

implementations the cold standby server is actually powered off; in others the server and

its operating system are up but the service itself is not. If the active system fails, the cold

standby system is then brought up (if needed), the service is started, and traffic is

redirected to it. Automation products can be used to monitor the active system for an

outage and failover to the cold standby. One example implementation of an Active/Cold

Standby HA solution is IBM Tivoli* System Automation for Multiplatforms (SA). With

SA deployed in a virtualized cluster environment, each member of the cluster is

monitored by SA’s Reliable Scalable Cluster Technology (RSCT) service, which

manages a service IP on the active node. If RSCT detects an active node outage, it moves

the service IP from the failed node to the cold standby node and then starts the target

service or application.

This type of HA architecture is ideally suited to virtualized environments. Because the

cold standby is largely inactive, it consumes very little resource until called upon. When a

failure occurs, the system resources that had been consumed by the active node are

instantly made available by the hypervisor to the standby. Thus, the resource cost of this

approach in a virtual environment is not significantly higher than the case where no

backup is provided at all.

Active/Passive

A second type of HA architecture is Active/Passive. Unlike Active/Cold Standby, in

Active/Passive environments the target service is up and running on both the active and

passive nodes in the cluster and may be exchanging state data. However, work is only

being dispatched to the active node, thus minimizing the amount of resources used by the

passive node. The health of the active node is determined either by partner nodes

monitoring each other, or by a trusted third party monitoring all nodes. In the event of a

failure, a passive node will be brought to the active state. Typically, a single active node

is paired with a single passive node. Examples of this approach include IBM WebSphere

Edge Components Load Balancer (when configured for HA), IBM DB2* HADR, and

Oracle Real Application Cluster Guard.

This solution is also well suited to a virtualized environment. Failover time is improved

over Active/Cold Standby, since the service does not need to be started during failover.

However, more resource is used by the idle server – the amount of which varies by

implementation. Only one server is processing a workload at a given time, which results

in low computation overhead during normal operation, though the degree of state data

exchange between nodes will affect the overhead on the passive node. Still, the passive

node is not typically as busy as the active one, so the hypervisor can flow a portion of its

system resources to the active node until the remaining one is needed for failover. Thus,

in a virtual environment efficiencies are gained over a physically discreet server

environment.

Active/Active

Unlike Active/Cold Standby and Active/Passive, in Active/Active environments all

servers are running and processing transactions, minimizing outage time even further.

Often only in-flight transactions are lost during node failure in an Active/Active

configuration. The peer nodes typically are in constant communication with one another

to remain in sync and verify the overall status of the cluster. Should any peer be

discovered inoperable, the remainder of the cluster will continue processing as normal

without sending updates or work to the offending peer. Examples of products that exploit

this approach include IBM WebSphere Application Server, Oracle Real Application

Cluster, and IBM DB2 operating in a z/OS* Parallel Sysplex.*

At first glance, an Active/Active environment may not seem like an attractive

configuration in a virtualized environment. After all, there are no idle or low-usage

servers involved, so where is the opportunity for resource sharing? There is also

additional overhead in running the hypervisor, and the cluster-synchronization overhead

of the HA processing itself (and associated virtual networking resource consumption)

becomes cumulative. In addition, deployments such as these may cause simultaneous

peaks and valleys in virtual server usage for all members of the Active/Active cluster. In

a virtual environment one would ideally strive for asynchronous peaks and valleys, which

enables more simultaneous guests to execute using the same resource pool, increasing

efficiencies. Nonetheless, there can indeed be benefits to leveraging virtualization for

this type of architecture.

As described earlier in the discussion of virtualized processors and memory,

virtualization greatly reduces the need for active cluster members to be over-configured

with system resources in order to absorb the work of a failed node. This reduces total

system resources across the cluster when compared to the physically discrete server case.

Also, the previously noted HA benefits associated with small cluster sizes that

virtualization enables are particularly applicable here, since it is in Active/Active

architectures where the number of cluster nodes often grows quickly.

On the other hand, there’s the issue of load balancing. Active/Active clusters are typically

fronted by some form of load-balancing agent. In the discrete case, an Active/Active HA

configuration with a load balancer allows work to be spread across all servers, such that

each is contributing to overall system throughput. This benefit often evaporates in a

single virtualized environment since all virtual servers are drawing from the same pool of

underlying physical resources. However, a load-balancing approach can still increase

throughput in both physical and virtual environments in certain cases, such as when

multiple application instances are needed to overcome an application-based bottleneck.

There are certainly differences associated with running an Active/Active cluster in a

virtual or physically discrete environment. While an Active/Active architecture is not as

ideally suited to virtualization as are the Active/Cold Standby and Active/Passive

architectures, it still is able to deliver worthwhile benefits.

Single Versus Multiple Host Selection What are the availability implications of consolidating virtual machines onto a single

physical host? Is this a recipe for disaster? Are there times when this approach makes

sense? Once again we seem faced with an apparent dichotomy between the philosophies

of virtualization and HA. So how does one make the physical hardware platform choice

for virtual machine high availability? Some are content to virtualize their HA

environment on a single physical machine. This is typically the case when the underlying

physical machine has a robust track record for uptime and resilience. Often those who opt

for a single host machine for critical infrastructure incur greater cost for the hardware and

networking platforms they select. These enterprise-class systems contain several levels of

physical level redundancy in a single chassis, along with multiple power sources and

numerous modes of connectivity.

Others opt to deploy multiple physical machines acting as the hosts for their virtualized

environments. This approach may be adequate in non-enterprise environments to

compensate for inferior hardware, yet in enterprise-class systems this implementation

methodology is beneficial as well.24,7

We will now take a moment to discuss some of the advantages and disadvantages of

single- and multiple-host environments.

Single-Host Approaches

A single physical machine HA environment can take full advantage of virtualization’s

resource-sharing capabilities across all guests. Any guest may have access to some or all

of the locally available resources. Systems such as this often have a reduced need to incur

the overhead associated with internal network traffic encryption because all the

communication is done internally on the same physical machine, rather than traveling

over a physical wire where it is susceptible to network-sniffing exploits. There is often a

reduction in network latency as well.

Additionally, such systems are cheaper than solutions requiring more instances of the

same hardware in terms of both initial cost and overall systems management expenses.

These types of deployments typically demonstrate lower physical and environmental

footprints.25,26 There are also other cost benefits that may not be so obvious. For example

on some platforms, such as System z, applications can be installed on each instance

without incurring additional software costs7.

Though a single-host deployment of virtual platforms introduces a single point of failure,

this can be partially mitigated by purchasing highly available hardware. However,

planned outages for host maintenance are largely inescapable in such situations. To

minimize single points of failure, the consolidation of physical servers into single-host,

virtualized HA configurations should include running multiple virtual server guest OS

instances, deploying redundant virtual networks, ensuring redundant connectivity to

external network infrastructure, and leveraging virtualized storage pools. On hosts that

support it, multiple software hypervisor environments can be deployed across multiple

logical partitions (LPARs) for further redundancy. This approach is relatively cost

effective when compared to multiple-host approaches.

Multiple-Host Approaches

Eliminating the potential planned and unplanned outages associated with a virtualized

single-host implementation requires expanding the virtual environment across multiple

hosts, and distributing the guests such that no cluster has all nodes on a single physical

host.

Physical Machine 2


LPAR 1 LPAR 2

DiscretionaryWork


Virt

ual M

achi

ne

Virt

ual M

achi

ne

Virt

ual M

achi

ne

Virt

ual M

achi

neDiscretionary

Work

LPAR 3

Physical Machine 1


LPAR 1 LPAR 2

DiscretionaryWork

LPAR 3


Virt

ual M

achi

ne

Virt

ual M

achi

ne

Virt

ual M

achi

ne

Virt

ual M

achi

ne

Dis

cret

iona

ry w

ork

Dis

cret

iona

ry w

ork

Crit

ical

wor

k (H

A)

DiscretionaryWork

Dis

cret

iona

ry w

ork

Dis

cret

iona

ry w

ork

Crit

ical

wor

k (H

A)

Crit

ical

wor

k (H

A)

Crit

ical

wor

k (H

A)

NIC NIC NIC NIC

Figure 3 Virtualized Environment Spanning Multiple Hosts

Consider the configuration shown in Figure 3 that involves two physical hosts, each with

a hypervisor supporting two guests that execute some critical service deployed in

Active/Active mode. With four virtual machines running critical work distributed evenly

across two hypervisors on discrete physical systems, disabling an entire hypervisor or

host, along with all guests executing on it, is tolerated and still leaves the surviving guests

in an HA configuration (though now susceptible to hardware as a single point of failure).

Resource sharing within each hypervisor’s domain still provides considerable efficiencies

for the most-likely outage scenarios (resource, application, or virtual machine), so over-

configuration of virtual servers is minimized and cluster sizes can remain modest. Even

in the host outage case, a hypervisor can often be configured to automatically redirect

system resources away from guests running discretionary work over to the surviving

cluster nodes when their demand increases– again reducing the need for over-

configuration. Still, the reduction in hardware resources is not as large as in a single-host

deployment.

Multiple-host configurations naturally incur additional costs for hardware and associated

physical and environmental footprints. There are also hidden costs such as increased

complexity of networking, cryptographic burden between the hosts, and increased

networking latency. There is also typically a need for load balancing to keep both

physical hosts busy while simultaneously ensuring the load on each is within tolerances.

One last interesting aspect of multi-host virtualized HA architectures involves the use of

virtual machine migration technologies such as VMWare VMotion, IBM z/VM Cross

System Extensions (CSE), and XenSource XenMotion. These technologies allow the

relocation of virtual machines from one host to another.27 Using this technology a virtual

machine may be transported -- in some cases without interruption in service -- to another

physical host. Since this functionality is relatively new there is a lot of interest in the area.

It is expected that ongoing research into the implementation of HA solutions that use

these technologies is forthcoming. Though this particular capability can enable high

availability when outages are expected or hardware/physical infrastructure failures are

observed to be impending, it is not yet clear how this functionality will affect the choice

of deployment when selecting Active/Active or Active/Passive configurations, or if it is

even a relevant consideration.

Even though multiple-host deployments eliminate single points of failure, provide more

flexibility for planned outages, and enable easier incremental growth by adding another

server’s capacity, they are not without their own challenges. Virtual HA environments

that have been designed from the beginning to span multiple physical hosts will be

strategically positioned to surmount those challenges and incrementally add other hosts

and corresponding virtual machines as needed. The increased cost of complexity and

hardware allows for host failures and scheduled maintenance windows that will not affect

system availability.

Conclusion Virtualization offers significant opportunities to augment and amplify HA

implementations. Single points of failure can be eliminated through redundant virtual

systems and resources, rather than physical ones, keeping acquisition and ongoing costs

low. The flow of physical resources between virtual systems can keep cluster sizes small,

thereby reducing potential points of failure. Complexity associated with configuring and

maintaining fault-tolerant storage pools can be minimized by adding a virtualization

layer. Network infrastructure can be virtualized, eliminating it as a physical failure point

while simultaneously reducing management complexity. Availability monitoring can be

brought up a level to the virtualization layer, reducing overall management complexity.

Even testing of service and other changes prior to introducing them into production to

minimize potential disruptions can be simplified.28

However, virtual HA configurations are not identical to their physical counterparts. In

addition to opportunities, virtual HA architectures also introduce new challenges and

risks that must be addressed for an implementation to succeed. Over-consolidation is a

clear concern -- system virtualization implies that a single physical server’s failure will

now affect multiple server images, and the issue also extends to resource virtualization.

But other, more subtle risks lurk as well. Heartbeating, a common technique for

implementing and monitoring HA clusters, can be ill-advised in a virtual environment.

Commonly deployed dynamic routing algorithms such as Open Shortest Path First

(OSPF), useful for automating failover in conjunction with certain redundant network

adapter configurations, may be overly burdensome in a virtualized environment. The

cumulative processing cost of per-system health monitoring agents can be an issue when

employing system virtualization. Various solutions exist to overcome these and other

challenges; the first step is to recognize that they exist and account for them.

Virtualization technology continues to evolve and grow. IBM’s Blue Cloud initiative is

one example that promises to further spread the virtual umbrella across a broader array of

heterogeneous systems. Grid computing is another. The live migration of virtual

machines across physical host boundaries needs more exploration as well. Future work is

needed to understand how best to marry these expanded virtualization approaches with

the latest high availability technologies and techniques to maximize the value of both.

* Trademark or registered trademark of International Business Machines Corporation.

**Trademark or registered trademark of Microsoft Corporation, Linus Torvalds, or Sun

Microsystems, Inc.

Cited References 1. L. Seawright, R. Mackinnon, “VM/370 – A study of multiplicity and usefulness,”

IBM Systems Journal, 18, No 1, 4-17 (1979).

2. R. Creasy, “The Origin of the VM/370 Time-Sharing System,” IBM J. Res. & Dev.,

25, No. 5, 483-490 (1981).

3. W. Fischofer, “VM/ESA: A single system for centralized and distributed computing,”

IBM Systems Journal, 30, No. 1, 4-13 (1991).

4. J. Matthews, W. Hu, et al., "Quantifying the Performance Isolation Properties of

Virtualization Systems," Proceedings of 2007 Workshop on Experimental Computer

Science, San Diego, California, ACM (June 2007).

5. B. Randell, P.A. Lee, P. C. Treleaven, "Reliability Issues in Computing System

Design," ACM Computing Surveys (CSUR) 10, Issue 2, 123–165 (June 1978).

6. P. J. Denning, "Fault tolerant operating systems," ACM Computing Surveys (CSUR)

8, Issue 4, 359–389 (December 1976).

7. S. Wehr, S. Loveland, et al, “High Availability Architectures For Linux on IBM

System z,” IBM Corporation, see http://www-

03.ibm.com/servers/eserver/zseries/library/whitepapers/pdf/HA_Architectures_for_Li

nux_on_System_z.pdf (March 2006).

8. “Linux Guest Crypto Support”, z/VM Performance Report, IBM Corporation, see

http://www.vm.ibm.com/perf/reports/zvm/html/crypto.html

9. “Linux on System z Tuning Hints and Tips: Cryptographic support,” IBM

Corporation, see

http://www.ibm.com/developerworks/linux/linux390/perf/tuning_res_security_crypto

.html

10. T. W. Arnold, A. Dames, et al, “Cryptographic system enhancements for the IBM

System z9,” IBM J. Res.& Dev., 51, No. 1/2, 87-102 (January/March 2007).

11. “Technical Security Advantages of IBM System z9 and Virtualization on z/VM,”

Sine Nomine Associates, page 11, see http://www-

03.ibm.com/systems/z/security/pdf/IBM3651z9SecurityTechPaper041107[1].pdf

(April 2007).

12. P. Bari, H. Almeida, et al, Security on z/VM, SG24-7471-00, IBM Corporation

(November 2007)

13. Secure Key Solution with the Common Cryptographic Architecture Application

Programmer's Guide, SC33-8294-00, IBM Corporation (May 2007)

14. A. Menon, J. R. Santos, et al, "Diagnosing Performance Overheads in the Xen Virtual

Machine Environment,” Proceedings of the 1st ACM/USENIX international

conference on Virtual execution environments, Chicago, IL, ACM, 13 – 23 (2005).

15. A. Menon, A. L. Cox, W. Zwaenepoel, “Optimizing Network Virtualization in Xen,”

Proceedings of the Annual Technical Conference on USENIX’06, Boston, MA,

USENIX Association (2006)

16. M. Ferrer, “Measuring Overhead Introduced by VMWare Workstation Hosted Virtual

Machine Monitor Network Subsystem,” Technical University of Catalonia, see

http://studies.ac.upc.edu/doctorat/ENGRAP/Miquel.pdf

17. T. S. E. Ng, A. L. Cox, “Towards an Operating Platform for Network Control

Management,” PRESTO: Workshop on Programmable Routers for the Extensive

Services of Tomorrow, Princeton University, Princeton NJ (May 2007)

18. T. Davis, W. Tarreau, et al, ”Linux Ethernet Bonding Driver HOWTO,” see

http://sourceforge.net/projects/bonding/ (April 2006)

19. I. Adlung, G. Banzhaf, et al, “FCP for the IBM eServer zSeries systems: Access to

distributed storage,” IBM J. Res & Dev., 46, No. 4/5, 487-502 (July/September 2002)

20. J. Tate, K. Geguhr, et al, IBM System Storage SAN Volume Controller, SG24-6423-

05, IBM Corporation (September 2007)

21. A. Barrett, “IBM unveils storage tank,” Storage Magazine (November 2003)

22. S. Fadden, “An Introduction to GPFS Version 3.2,” IBM Corporation, see

http://www-03.ibm.com/systems/clusters/software/whitepapers/gpfs_intro.html

(September 2007)

23. J. Srikrishnan, S. Amann, et al, “Sharing FCP adapters through virtualization,” IBM J.

Res. & Dev., 51, No. 1/2, 103-118 (January/March 2007)

24. z/OS V1R8.0 System z Platform Test Report for z/OS and Linux Virtual Servers, IBM

Corporation, section 13.4, see

http://publibz.boulder.ibm.com/epubs/pdf/e0z1p150.pdf (June 2007)

25. “IBM's Project Big Green Spurs Global Shift to Linux on Mainframe,” IBM

Corporation, see http://www-03.ibm.com/press/us/en/pressrelease/21945.wss (August

2007)

26. C. Ollinger, “A powerful strategy: HP grows green data centers,” Manufacturing

Business Technology, see http://www.mbtmag.com/article/CA6481103.html

(September 25, 2007)

27. J. Sugerman, G. Venkitachalam, and B-H. Lim, “Virtualizing I/O Devices on

VMware Workstation’s Hosted Virtual Machine Monitor,” Proceedings of the 2001

USENIX Annual Technical Conference, Boston, MA (June 2001)

28. S. Loveland, G. Miller, et al, ”Testing with a Virtual Computer,” Software Testing

Techniques: Finding the Defects that Matter, chapter 16, Charles River Media (2005)

Biographies Duane Beyer IBM Server and Technology Group, 2455 South Road, Poughkeepsie, New

York, 12601-5400 (electronic mail: [email protected]). Mr. Beyer is an Advisory

Software Engineer and member of the IBM Test and Integration Center for Linux. He

holds a B.S. degree in computer science from Marist College. He has 21 years of

experience with z/VM and IBM mainframe hardware.

Philip Chan IBM Systems and Technology Group, 2455 South Road, Poughkeepsie, New

York, 12601-5400 (electronic mail: [email protected]). Mr. Chan is a Staff Software

Engineer and member of the IBM Test and Integration Center for Linux. His areas of

expertise include runtime configuration and troubleshooting. He also has several years of

experience with WebSphere Application Server, WebSphere MQ, DB2, Tivoli Access

Manager and Federated Identity Manager.

Eli M. Dow IBM Systems and Technology Group, 2455 South Road, Poughkeepsie, New

York, 12601-5400 (electronic mail: [email protected]). Mr. Dow is a Software

Engineer and member of the IBM Test and Integration Center for Linux. He holds a B.S.

in computer science and psychology and a master's of computer science from Clarkson

University. He is an alumnus of the Clarkson Open Source Institute. He is the coauthor of

an IBM Redbooks entitled "Linux for IBM System z9 and IBM zSeries", and

"Introduction to the New Mainframe: z/VM Basics".

Frank LeFevre IBM Systems and Technology Group, 2455 South Road, Poughkeepsie,

New York, 12601-5400 (electronic mail: [email protected]). Mr. LeFevre is a Senior

Software Engineer with over 29 years of experience with IBM mainframe hardware and

operating systems. He is currently the team lead for the Linux Virtual Server Platform

Evaluation Test group responsible for integration testing middleware products and

implementing high availability solutions.

Scott Loveland IBM Systems and Technology Group, 2455 South Road, Poughkeepsie,

New York, 12601-5400 (electronic mail: [email protected]). Mr. Loveland is a

Senior Technical Staff Member in the IBM Integrated Solutions Test laboratory. He

joined IBM in 1982 after receiving a B.A. in computer science from the University of

California, Berkeley. He has held various technical leadership positions in mainframe

operating system product testing spanning the z/OS and Linux platforms. Mr. Loveland’s

areas of specialization include reliability and high availability, disciplines upon which he

has lectured at universities and industry conferences. He is currently the Chief Test

Architect for the IBM Customer Test organization.

leveraging virtualization to optimize high availability...

Documents