leveraging virtualization to optimize high availability...
TRANSCRIPT
Leveraging Virtualization to Optimize High Availability System Configurations
<Authors>
D. Beyer, P. F. Chan, E. M. Dow, F. Lefevre, S. Loveland
<Abstract>
Leveraging redundant resources is a common means of addressing availability
requirements, but it often implies redundant costs as well. At the same time,
virtualization technologies are bursting into data centers with the promise of
reducing costs through resource consolidation. Can virtualization and High
Availability (HA) technologies be combined to optimize availability while
minimizing costs? Yes, but merging them properly introduces new challenges.
Drawing on practical experiences from an IBM development test lab, this paper
looks at how virtualization technologies, methodologies, and techniques can
augment and amplify traditional HA approaches while avoiding potential pitfalls.
Special attention is paid to resource sharing, applying conceptual models (such as
Active/Active and Active/Passive) to virtualized environments, stretching virtual
environments across physical machine boundaries, and avoiding hazards that await
the unwary.
A computer system must be up and operational to be useful. The more available it is, the
more value it can provide to its users. So why isn’t every configuration highly available?
The main inhibitors are typically cost and complexity. Achieving tolerance for planned
and unplanned outages often involves configuring redundant physical resources. Those
resources increase system costs, leading vendors, consultants, and system administrators
to ask, “How much availability are you willing to pay for?” There is also inherent
complexity in setting up and managing high availability (HA) solutions. Redundant
resources imply redundant configuration work. The software configuration activity
required to assemble and coordinate all elements of the HA solution into a single, well-
functioning unit can be daunting as well. In addition, the task of establishing and
maintaining a monitoring infrastructure to track the availability and flow of work across
many discrete physical assets can be challenging. All of this complexity further inhibits
the adoption of HA solutions.
Virtualization technology offers a reduction in cost and complexity through the
abstraction or emulation of physical resources into logical, shareable entities.
Virtualization has existed on mainframes for decades,1,2,3 and more recently has been
adopted by other classes of servers as well. But virtualization is not a homogeneous
concept with uniform characteristics; it encompasses a range of technologies with
different attributes, resources sharing opportunities, and associated risks. These
techniques include system virtualization and resource virtualization.
System virtualization divides a single physical computer into multiple logical computers,
each of which executes its own guest operating system instance. A thin hypervisor layer
sits between the physical hardware resources and the operating systems. It exposes a set
of virtual platform interfaces that form a logical partition or virtual machine. The
hypervisor multiplexes and arbitrates access to the host platform’s resources so that they
can be shared across multiple virtual machines while enforcing a level of isolation that
keeps each guest independent of the others. This isolation is an important aspect of
virtual machine high availability, as it ensures a failure within one virtual machine can’t
compromise the availability of others.4 Hypervisors can be implemented in server
firmware or in software. Firmware hypervisors are implemented directly in the server
hardware’s microcode, and the virtual machines that are created are often called logical
partitions, or LPARs. Examples include the IBM* System z* Processor Resource/System
Manager* (PR/SM*) hypervisor, and the IBM POWER Hypervisor* firmware. Software-
based hypervisors, sometimes called virtual machine monitors (VMMs), are either booted
natively on host hardware or run on top of a host operating system. Examples include the
IBM z/VM* hypervisor, the VMware ESX-Server hypervisor, the Microsoft** Virtual
Server hypervisor, the XenSource Xen hypervisor, and the Linux** Kernel Virtual
Machine, or KVM, support. Multiple layers of virtualization are also possible, where one
hypervisor runs within the virtual machine created by another, adding even more
flexibility. For example, it’s common to divide an IBM System z mainframe into multiple
LPARs, and run a different instance of the z/VM hypervisor within each LPAR, as
depicted in Figure 1.
Physical Machine
Firmware LPAR Hypervisor
LPAR 1 LPAR 2 LPAR 3 LPAR 4
GuestOperatingSystem
GuestOperatingSystem
Native VMM Hypervisor
Native VMM Hypervisor
Virtual Machine
Gue
st O
pera
ting
Syst
em
Gue
st O
pera
ting
Syst
em
Gue
st O
pera
ting
Syst
em
Gue
st O
pera
ting
Syst
em
Gue
st O
pera
ting
Syst
em
Gue
st O
pera
ting
Syst
em
Gue
st O
pera
ting
Syst
em
Gue
st O
pera
ting
Syst
em
Virtual Machine
Virtual Machine
Virtual Machine
Virtual Machine
Virtual Machine
Virtual Machine
Virtual Machine
Figure 1 System Virtualization Layers
Resource virtualization, on the other hand, operates at a lower level than that of
hypervisors. It consists of technologies independent of (but complementary to)
hypervisors that exist specifically to virtualize individual resources, such as network
adapters, host bus adapters, or even heterogeneous disk farms.
HA + virtualization: opportunities and challenges. At first glance, there may appear to
be a dichotomy between HA and virtualization. As shown in Figure 2, traditional
multiple-machine deployments rely heavily upon redundant yet discrete physical
hardware to achieve HA.5,6 This focus on resource redundancy at all costs seems at odds
with virtualization’s emphasis on resource sharing. We aim to show that in fact the
opposite is true: virtualization offers significant opportunities to augment and amplify
HA implementations.
Figure 2 Physical HA Configuration
However, virtual HA configurations also introduce new challenges and risks that must be
addressed for an implementation to succeed. Many of these challenges are subtle and may
not be apparent during initial, small-scale pilot deployments. Only when a pilot project
has been deemed successful and is being scaled to full production levels do many of these
issues rise to the surface and risk derailing it. These challenges will also be discussed,
along with approaches for surmounting them.
System Component Virtualization and HA Many system components can be virtualized, including processors, memory,
cryptographic co-processors, networks, and storage. Let’s examine system and resource
virtualization approaches for these components in the context of how they can be
leveraged to support high availability deployments.
Virtualized Processors, Memory, and HA
To achieve true continuous availability, a server cluster must not only have multiple
nodes; it must also contain sufficient capacity so that when at least one node fails, the
surviving nodes are able to absorb its work without any loss of throughput. For traditional
physically discrete server clusters, this implies that each node must be over-configured
with processor and memory resources. This could be accomplished with two large servers
that normally run at a peak of 50% or less of their capacity. However, it’s probably more
common to find larger clusters of smaller servers, each somewhat over-configured with
resources to provide failover headroom that is spread across the entire cluster (this
proliferation of smaller nodes may result from the incremental expansion of throughput
demands, concerns about underutilization, or other factors). There are downsides to this
approach. Over-configuration obviously increases asset costs, but that’s not all.
Ironically, larger clusters sizes can also negatively affect availability. Why? Each node in
a cluster represents a potential point of failure – not a single point of failure, but a point
of failure nonetheless. This is important because while all clustering technologies aim to
minimize the window in which the outage of a node is visible to end users, many to do
not eliminate that window altogether. TCP/IP timeout thresholds, cluster
resynchronization activity, and (in the worst case) node restart techniques can all lead to
outage time that is visible to end users. Lab testing of various cluster technologies has
shown outage times on busy environments ranging from seconds to minutes. Clusters
with larger numbers of nodes have more potential points of failure, so as cluster size
grows, so do the odds that on any given day, end users will experience some sort of
outage. Greater cluster size can mean greater system management costs as well.
System virtualization can address both of these downsides. When a virtual server fails, its
processing capacity and often its real memory are immediately made available to the
surviving virtual servers in the cluster. Hence when one virtual cluster node fails, the
surviving virtual nodes not only absorb the failing node’s workload but its resources as
well, so there is no drop in available cluster capacity. This ability of the hypervisor to
dynamically and automatically flow resource among virtual machines largely eliminates
the need to over-configure individual cluster nodes, thereby reducing some of the costs
associated with a highly available cluster.
Likewise, in a system virtualization environment there’s often little need for large
clusters. As just noted, over-configured, underutilized servers are less of an issue. Also,
incremental capacity growth requirements can often be met by increasing the share of
physical resources available to virtual machines, adding physical resources to the
underlying host server, or both. Indeed, for many HA implementations a two-node virtual
cluster is sufficient to protect against unplanned node outages. However, for managing
planned outages, a three-node cluster is often better as it allows one node to be brought
down for service while the two remaining nodes continue to back each other up. In any
case, the smaller cluster sizes that system virtualization encourages tend to be easier to
manage and have fewer potential points of failure.
Challenges
As powerful as system virtualization can be in optimizing resource allocations and
reducing cluster sizes, it has limits. A particular middleware implementation may impose
its own software-oriented resource limits that cannot be addressed by the hypervisor’s
ability to redirect resources. For instance, application servers may become constricted by
Java** Virtual Machine (JVM) heap size restrictions, or by the number of allowed socket
connections in a connection pool.7 In such cases, additional virtual cluster nodes may be
needed to overcome the constraint.
Physical resource consumption across virtual machines is cumulative, which has a
number of implications. For instance, an operating system such as Linux may
automatically fill up any unused memory available to it with an I/O cache. Such a
technique is reasonable in a standalone server environment, but in a virtual one it can
needlessly consume resources that could better be used by other virtual cluster nodes.
Hence, “right-sizing” virtual machine memory capacity becomes an important
configuration consideration. Likewise, availability management (and other system
management) tools often depend on small management agents running on each node to
gather and report status back to a management server. These agents will typically
consume some portion of a CPU even when the system they are monitoring is idle. On
discrete systems, this is negligible. But even if one agent on each of 100 virtual machines
is using only 1% of a physical processor, cumulatively the agents consume a full
processor at idle.
One solution is the use of management tools that operate at the hypervisor level rather
than the individual guest level. An example is the IBM Operations Manager for z/VM, an
automation and monitoring tool that can periodically execute health-checking routines
and trigger user-defined actions. Another example is the IBM z/VM Performance
Toolkit, which allows users to monitor both system-wide and guest-by-guest resource
consumption (among other things).
A second possible solution for systems management is to use a tool whose agents operate
with extremely low overhead (or only run if a particular resource has been consumed),
such that even the accumulation of their expended processor usage will be tolerable. In
the absence of such tools, if processor consumption becomes an issue then agents can be
selectively deployed to only those virtual machines deemed the most critical.
Virtualized Cryptographic Processing and HA
Cryptographic acceleration hardware offers another opportunity for virtualization to
complement HA. This hardware can not only support and optimize other HA
technologies, it can itself be made highly available through virtualization.
One example is the IBM System z cryptographic coprocessor (accelerator) devices
accessed from Linux guests executing under the z/VM hypervisor. Such coprocessors are
often referred to as adjunct processors (APs). APs for offloading cryptographic functions
can reduce the general-purpose computational burden on the processors being utilized by
the virtual guests that would otherwise need to perform those functions via software. This
can result in significant performance improvements for Secure Sockets Layer (SSL)
transactions, which makes it desirable in its own right.8,9,10 The benefits include securely
encrypting transmission of state data between virtual nodes and other cluster-
synchronization mechanisms, when more than one host is used.
The expense of those few APs is greatly offset by the saved processor time on discrete
hosts. These savings are even more important in virtualized environments, where
physical processors are expected to have higher utilization. Plus, the benefits are
cumulative when executing many virtual machines within a single physical machine.
Beyond pure cost savings derived from resource sharing, virtualization also can enable
transparent AP failover. For example, consider how z/VM virtualizes hardware
cryptographic devices. The virtualization layer enables any guest to use any underlying
cryptographic accelerator at any time.11,12 When a single cryptographic accelerator fails or
is brought offline, guests using that device will transparently move to another without
user intervention. The z/VM hypervisor also balances the load across the cryptographic
cards of a given type. In this example, virtualization enables the distribution of the cost of
crypto acceleration across many guests, securely encrypting HA-related network traffic
among virtual machines on disparate hosts, and transparently handling failover in the
event of a cryptographic hardware outage.
Challenges
When virtualizing the cryptographic accelerator, it is shared among numerous guests.
But there are times when this is not appropriate, such as when using secure key
cryptography for Linux on System z. When using secure key cryptography with these
APs, the keys are stored inside the coprocessor card, which is tamper-proof. When using
secure key cryptography, one must dedicate the APs to the guest13 and so cannot reap the
benefits explained earlier. In fact, losing one of these dedicated secure key crypto cards is
a single point of failure if no additional user configuration is performed since the keys
will be lost.12 But there is an HA approach that may be taken for virtual machines
requiring secure key cryptography. In this situation it is possible to dedicate two distinct
APs and instrument your software to store the secure keys on each of the cards.
Virtualized Networks and HA
Networking is a fundamental part of many HA architectures, enabling virtual machines to
stay in contact with one another so they can make decisions regarding availability.
Hypervisors often support the creation of virtual Local Area Network (LAN) segments in
which physical network connectivity is emulated through memory-to-memory constructs
(not to be confused with VLANs – virtual LANs as specified in IEEE 802.1Q, a logical
network construct often implemented in networking switches). This virtual networking
infrastructure consumes some amount of memory and processor cycles to implement
message passing systems. These message passing systems appear to the virtual machine
to be like more-sophisticated networking protocols, such as the TCP/IP ISO model, that
are found in discrete-system networking.14,15,16
Virtual LANs have definite advantages when compared to traditional LANs. These
benefits include simplicity of operation, robustness, and security. In addition, they can
often add HA mechanisms with very little additional cost.
Traditional physical networks often become quite complex. Ad hoc expansion can lead to
a networking infrastructure that is difficult to manage, and this complexity can lead to
administrative errors that become a source of downtime. In fact, such complexity can
even lead to an increase in planned outages (in support of hardware life cycles,
configuration changes, and software upgrades), decreasing system availability.17 Virtual
LANs, on the other hand, eliminate much of the traditional interconnect infrastructure
(switches, hubs, cabling, etc.). It follows that this vastly simpler infrastructure can be
managed more easily, maintaining a higher degree of availability. In terms of robustness,
less physical infrastructure means fewer physical points of failure. At a minimum, in an
entirely virtualized network the virtual machines will never suffer the outage of a
physical Network Interface Card (NIC).
However, some classes of virtual network devices do not lend themselves to traditional
HA solutions based on redundant resources (virtual or otherwise). Guest LANs and
Hipersockets* found on IBM System z mainframes are examples. Both are “in-the-box”
networking solutions, in that they provide guest-to-guest networking within the bounds of
a particular hypervisor but don’t directly connect to an external network. Although Guest
LANs (managed by the z/VM hypervisor) do not provide a failover solution per se, they
do not suffer from any of the faults real networking hardware can introduce. Because
guests using a Guest LAN need to be authorized to use the LAN, it provides a high-
speed, secure TCP/IP transport within z/VM for sensitive applications such as the inter-
node communication required for HA architectures. Similarly, Hipersockets (managed by
the System z PR/SM hypervisor) do not have built-in failover capabilities, but they also
have no physical component that can fail -- they inherit and depend on the reliability of
System z microcode to provide extremely reliable high-speed communication to the
logical partitions (LPARs) they serve. In both cases, traffic never flows over a physical
wire where it could be sniffed, so the overhead associated with traffic encryption can
often be avoided. An example inter-node communication that might use such a network
infrastructure is the exchange of state data between Oracle Real Application Cluster or
IBM WebSphere* Application Server cluster members.
Automatic HA with Virtual Networking
However, an entirely virtual network is a special case. More typically, virtual machines
contained within a given host communicate with each other over a virtual network, but
also must communicate with external systems through a traditional network. This
connection between the virtual and physical networks represents a potential failure point.
Even here, system virtualization offers availability benefits. Hypervisors often make use
of built-in capabilities to create redundant network interfaces and network controllers
which typically require much less investment in configuration than the equivalent
redundant network infrastructure needed in the physical world. For instance, on the z/VM
hypervisor, a VSWITCH virtual LAN can be bound to one or more physical NIC devices
(in parallel using IEEE 802.3ad Link Aggregation, if desired) to enable external
communication. By defining a VSWITCH with two real NIC devices and two VSWITCH
network controllers (the TCP/IP software stack that enables VSWITCH operation) the
loss of a single NIC or controller causes the VSWITCH to seamlessly and transparently
failover to the surviving resource so it can provide guests with uninterrupted access to the
physical network. Providing this level of HA to all machines in a discrete environment
would add complexity and cost.
Other environments such as those provided with the IBM POWER5* and XenSource Xen
hypervisors also offer virtual network support. They both have the ability to bridge the
physical NIC to the virtualized network by utilizing a management node. The POWER5
hypervisor can provide high availability and virtualization similar in concept to the z/VM
VSWITCH by connecting two virtual I/O servers, the management nodes that bridge
physical NIC to a virtual network, to a single virtual network.
Resource Virtualization and Network HA
A type of resource virtualization available with Linux systems is called channel
bonding.18 This approach uses a special device driver that joins multiple NICs together to
make a single, logically bonded interface. In the wake of a NIC failure, as long as one of
the underlying NICs remains active, the bonded interface will still be available. Bonding
is particularly useful for systems not operating under a hypervisor that supports a higher
level of NIC virtualization, such as the VSWITCH described earlier.
Bonding can be configured in an active-backup network path arrangement that is
transparent to the LAN infrastructure. There is no need to do any special router
configuration, or to use an interior gateway routing protocol to handle path failover.
However, in this mode only one NIC is being actively used -- the backup is not utilized
until the primary fails. It is also possible to configure load balancing across the bonded
adapters. This approach allows both NICs to contribute to throughput, but typically
requires some external router configuration to group the underlying NICs together and to
define a transmit policy.
Challenges
A common approach for insulating a traditional server from NIC failures is to define a
Virtual IP Address, or VIPA, that is not associated with any particular NIC. The VIPA is
reachable through multiple NICs attached to the operating system instance. Should one
NIC fail, traffic can be rerouted through a surviving NIC, keeping the VIPA available.
The catch in this scheme is that routers in the network must be informed of the change or
they will continue attempting to route messages through the failed NIC. One traditional
means of addressing this requirement is to use interior gateway protocols such as Open
Shortest Path First (OSPF), which relies upon agents on every VIPA-owning operating
system instance to exchange “heartbeat” messages with each other and with routers to
confirm accessibility, and to exchange routing information in the wake of a detected
disruption. Unfortunately, OSPF can be complex to configure and consumes CPU
resources on each server instance. In a virtual environment, the frequent heartbeating may
make idle virtual machines appear busy, limiting the hypervisor’s ability to efficiently
manage resources. In addition, the cumulative processing resource consumed by
numerous OSPF daemons can become noticeable. This can be mitigated by deploying
virtual network failover approaches that do not require dynamic routing daemons to
function. For example, in a z/VM environment the guests could eliminate the use of a
VIPA and each have only a single virtual NIC that is attached to a VSWITCH. The
VSWITCH can then manage NIC failover as described earlier.
Another issue that can arise with virtual networking is problem determination difficulty.
Some virtual networks operate at layer 3 of the Open System Interconnection (OSI)
reference model, but many commonly used network tracing tools do not work with layer
3. In addition, some virtual LANs do not allow the common practice of enabling a guest
to operate in promiscuous mode to inspect network traffic. Vendor enhancements are
required to address these weaknesses.
Virtualized Storage and HA
Storage infrastructure is another critical area for high availability implementations where
virtualization can play a useful role. There are at least three areas where availability
considerations must be factored into the storage picture: host attachment, the disk
subsystem itself, and the storage network in between. Virtualization offers opportunities
for efficiency or enhancing availability in all three areas, as well as presenting some
challenges.
Host attachments
Hosts communicate to the storage network through I/O adapters. As with other
components, resiliency is typically provided through redundant physical adapters and
appropriate failover/failback support. While fine for the physically discrete server
environment, in a virtualized environment where dozens or even hundreds of virtual
systems are contained within one physical server, the number of redundant physical
adapters required could become prohibitive. Fortunately, system virtualization
implementations usually provide adapter sharing capabilities that can dramatically reduce
the number of physical adapters required.
On IBM System z mainframes, a single Fiber Connection (FICON*) or Fibre Channel
Protocol (FCP) channel adapter can be directly, and transparently, virtualized and shared
among multiple virtual machines. This capability leverages adapter capacity to the fullest
and dramatically reduces the number of physical adapters needed. One physical FCP
adapter, for instance, can be abstracted into hundreds of virtual adapters. With this model,
a mere two physical adapters can offer the redundancy needed to provide a highly
available interface for hundreds of virtual systems.19
Disk subsystems
A common means of resource virtualization for disk volumes is with a Redundant Array
of Independent Disks (RAID). RAID technology abstracts multiple physical disks into a
single virtual disk in a way that improves its performance, availability, or both. The
techniques employed include striping (spreading chunks of a file across multiple disks)
for a performance boost, and mirroring (maintaining synchronized copies of a disk) for
increased availability, as well as the use of error correction techniques (parity) to add
further fault tolerance. Various combinations of these techniques can be used to achieve
the desired balance of performance, availability, and cost. RAID can be implemented in
the storage subsystem itself or in host software. Any RAID devices providing storage for
virtual machines should be configured for availability, and not simply performance.
Virtual machines should also not depend on any single path to the RAID device.
Storage Networks
Sitting between host adapters and disk subsystems is the storage area network (SAN), and
it too is a potent area for injecting resource virtualization that can benefit HA
implementations. An abstraction layer decouples the physical disks from the hosts’
logical operations against them. This abstraction can either be done in-band (also known
as block aggregation), or out-of-band (also known as file aggregation), and as we will
see, this decoupling via virtualization can lead to increased availability.
In-band virtualization. With the in-band approach, all data and control traffic flow
through an appliance that abstracts heterogeneous physical storage subsystems into
virtual disks that it makes available to hosts. In effect, this shifts storage management
intelligence from individual disk controllers into the network. From an availability
perspective, storage area network virtualization layers help address planned outages. The
administrator can add or change the underlying physical disks while maintaining
availability of the virtual disks to users and applications. As an example, consider the
IBM SAN Volume Controller (SVC). SVC is a deployable appliance consisting of pairs
of IBM System x* servers with high-bandwidth host bus adapters (HBAs), gigabytes of
mirrored read/write cache memory, an uninterruptible power supply, and special-purpose
software. Hosts connect only to the SVC, and have no direct knowledge of the physical
disk subsystems behind it. In this way, virtual machines are insulated from changes to the
underlying physical hardware.20
In order to eliminate the in-band appliance as a single point of failure it must be able to
failover and recover gracefully. SVC combines its servers into an HA cluster for that
purpose, and has concurrent maintenance and upgrade capabilities. It also provides
multiple paths to a given virtual disk. However, SVC does nothing to protect against the
failure of individual disks; for that, it relies upon the RAID capabilities of the back-end
disk controllers.
Out-of-band virtualization. With this approach, data flow is separated from control
flow. It typically involves file sharing across a SAN, where hosts consult metadata (data
about data) and employ the locking services of a file system or Network-Attached
Storage (NAS) controller, but then access the data directly. In terms of availability, out-
of-band virtualization can offer abilities similar to those of in-band appliances for
administering the SAN without interrupting access to data as seen by the host.21 Also
similar to the in-band approach, the out-of-band control servers can typically be clustered
to eliminate a single point of failure. Indeed, the control function can even be distributed
across the host file-serving nodes rather than requiring a separate metadata server.22
Examples include the IBM Global Parallel File System (GPFS*) and the Veritas Cluster
File System.
Challenges
While virtualization offers benefits in boosting the availability of storage infrastructure, it
also presents some pitfalls to avoid. A virtualization layer normally obscures underlying
physical components, which can lead to unintended single points of failure. For instance,
dozens or hundreds of virtual HBAs might be virtualized from a single physical HBA. It
would be quite easy to accidentally define redundant attachments from the host to two
virtual adapters that map to the same physical adapter – effectively providing no
redundancy at all. Care must be taken in tracking virtual-to-physical mappings to avoid
configuration oversights.
The sharing of physical resources enabled by virtualization can also raise security
concerns. When multiple operating system instances execute on a single hypervisor and
share a common FCP adapter, by default they will all have access to each other’s storage.
This is not always permissible. For example, consider a hosting environment where
systems are owned by different users that must be isolated from one another. Recently a
new standard called N_Port Identifier Virtualization (NPIV) was introduced to address
the security issue, and implementations are beginning to appear.23 Although NPIV solves
the access control problem, it creates an explosion in World Wide Port Names (WWPNs)
that can itself be problematic to manage.23
At the physical disk layer, hardware-based RAID protects against disk failure within a
storage subsystem but not against failures of the storage subsystem itself. The
combination of hardware-based RAID and a host-based virtualization approach for
mirroring across disk subsystems, such as the Logical Volume Manager (LVM) or
software-based RAID, can fill this hole. As with the virtual HBA case, it can be easy to
overlook the fact that all guests under a single hypervisor are actually mapping their
virtual disks to the same physical disk. Care must be taken to ensure that host-based
software is mirroring disks across physically discrete storage subsystems.
Within the storage network virtualization layer, in-band appliance nodes should be spread
across redundant SAN fabrics for maximum availability. With block aggregation, hosts
are unaware of the physical location of storage they are using. Host-based system
management tools may have difficulty piercing the virtualization layer to keep track of
virtual-to-physical mappings. Addressing this issue may require cooperative
enhancements to both storage network virtualization and systems management products.
Virtualization and Conceptual Approaches to HA Architectures When talking about high availability architectures there are multiple conceptual
approaches to consider. In this section, we will discuss Active/Cold Standby,
Active/Passive, and Active/Active. Since we are interested in the subject matter from a
virtualization point of view, we provide a working definition for each, which may differ
from common literature that describes the physically discrete server case. We will also
point out the optimal situations in which each technique should be employed.
Active/Cold Standby
One approach to an HA architecture is known as Active/Cold Standby. With this
solution, the target service is only allowed to run on one node at a time (the active node).
It is installed and configured on the cold standby system, but is not running. In some
implementations the cold standby server is actually powered off; in others the server and
its operating system are up but the service itself is not. If the active system fails, the cold
standby system is then brought up (if needed), the service is started, and traffic is
redirected to it. Automation products can be used to monitor the active system for an
outage and failover to the cold standby. One example implementation of an Active/Cold
Standby HA solution is IBM Tivoli* System Automation for Multiplatforms (SA). With
SA deployed in a virtualized cluster environment, each member of the cluster is
monitored by SA’s Reliable Scalable Cluster Technology (RSCT) service, which
manages a service IP on the active node. If RSCT detects an active node outage, it moves
the service IP from the failed node to the cold standby node and then starts the target
service or application.
This type of HA architecture is ideally suited to virtualized environments. Because the
cold standby is largely inactive, it consumes very little resource until called upon. When a
failure occurs, the system resources that had been consumed by the active node are
instantly made available by the hypervisor to the standby. Thus, the resource cost of this
approach in a virtual environment is not significantly higher than the case where no
backup is provided at all.
Active/Passive
A second type of HA architecture is Active/Passive. Unlike Active/Cold Standby, in
Active/Passive environments the target service is up and running on both the active and
passive nodes in the cluster and may be exchanging state data. However, work is only
being dispatched to the active node, thus minimizing the amount of resources used by the
passive node. The health of the active node is determined either by partner nodes
monitoring each other, or by a trusted third party monitoring all nodes. In the event of a
failure, a passive node will be brought to the active state. Typically, a single active node
is paired with a single passive node. Examples of this approach include IBM WebSphere
Edge Components Load Balancer (when configured for HA), IBM DB2* HADR, and
Oracle Real Application Cluster Guard.
This solution is also well suited to a virtualized environment. Failover time is improved
over Active/Cold Standby, since the service does not need to be started during failover.
However, more resource is used by the idle server – the amount of which varies by
implementation. Only one server is processing a workload at a given time, which results
in low computation overhead during normal operation, though the degree of state data
exchange between nodes will affect the overhead on the passive node. Still, the passive
node is not typically as busy as the active one, so the hypervisor can flow a portion of its
system resources to the active node until the remaining one is needed for failover. Thus,
in a virtual environment efficiencies are gained over a physically discreet server
environment.
Active/Active
Unlike Active/Cold Standby and Active/Passive, in Active/Active environments all
servers are running and processing transactions, minimizing outage time even further.
Often only in-flight transactions are lost during node failure in an Active/Active
configuration. The peer nodes typically are in constant communication with one another
to remain in sync and verify the overall status of the cluster. Should any peer be
discovered inoperable, the remainder of the cluster will continue processing as normal
without sending updates or work to the offending peer. Examples of products that exploit
this approach include IBM WebSphere Application Server, Oracle Real Application
Cluster, and IBM DB2 operating in a z/OS* Parallel Sysplex.*
At first glance, an Active/Active environment may not seem like an attractive
configuration in a virtualized environment. After all, there are no idle or low-usage
servers involved, so where is the opportunity for resource sharing? There is also
additional overhead in running the hypervisor, and the cluster-synchronization overhead
of the HA processing itself (and associated virtual networking resource consumption)
becomes cumulative. In addition, deployments such as these may cause simultaneous
peaks and valleys in virtual server usage for all members of the Active/Active cluster. In
a virtual environment one would ideally strive for asynchronous peaks and valleys, which
enables more simultaneous guests to execute using the same resource pool, increasing
efficiencies. Nonetheless, there can indeed be benefits to leveraging virtualization for
this type of architecture.
As described earlier in the discussion of virtualized processors and memory,
virtualization greatly reduces the need for active cluster members to be over-configured
with system resources in order to absorb the work of a failed node. This reduces total
system resources across the cluster when compared to the physically discrete server case.
Also, the previously noted HA benefits associated with small cluster sizes that
virtualization enables are particularly applicable here, since it is in Active/Active
architectures where the number of cluster nodes often grows quickly.
On the other hand, there’s the issue of load balancing. Active/Active clusters are typically
fronted by some form of load-balancing agent. In the discrete case, an Active/Active HA
configuration with a load balancer allows work to be spread across all servers, such that
each is contributing to overall system throughput. This benefit often evaporates in a
single virtualized environment since all virtual servers are drawing from the same pool of
underlying physical resources. However, a load-balancing approach can still increase
throughput in both physical and virtual environments in certain cases, such as when
multiple application instances are needed to overcome an application-based bottleneck.
There are certainly differences associated with running an Active/Active cluster in a
virtual or physically discrete environment. While an Active/Active architecture is not as
ideally suited to virtualization as are the Active/Cold Standby and Active/Passive
architectures, it still is able to deliver worthwhile benefits.
Single Versus Multiple Host Selection What are the availability implications of consolidating virtual machines onto a single
physical host? Is this a recipe for disaster? Are there times when this approach makes
sense? Once again we seem faced with an apparent dichotomy between the philosophies
of virtualization and HA. So how does one make the physical hardware platform choice
for virtual machine high availability? Some are content to virtualize their HA
environment on a single physical machine. This is typically the case when the underlying
physical machine has a robust track record for uptime and resilience. Often those who opt
for a single host machine for critical infrastructure incur greater cost for the hardware and
networking platforms they select. These enterprise-class systems contain several levels of
physical level redundancy in a single chassis, along with multiple power sources and
numerous modes of connectivity.
Others opt to deploy multiple physical machines acting as the hosts for their virtualized
environments. This approach may be adequate in non-enterprise environments to
compensate for inferior hardware, yet in enterprise-class systems this implementation
methodology is beneficial as well.24,7
We will now take a moment to discuss some of the advantages and disadvantages of
single- and multiple-host environments.
Single-Host Approaches
A single physical machine HA environment can take full advantage of virtualization’s
resource-sharing capabilities across all guests. Any guest may have access to some or all
of the locally available resources. Systems such as this often have a reduced need to incur
the overhead associated with internal network traffic encryption because all the
communication is done internally on the same physical machine, rather than traveling
over a physical wire where it is susceptible to network-sniffing exploits. There is often a
reduction in network latency as well.
Additionally, such systems are cheaper than solutions requiring more instances of the
same hardware in terms of both initial cost and overall systems management expenses.
These types of deployments typically demonstrate lower physical and environmental
footprints.25,26 There are also other cost benefits that may not be so obvious. For example
on some platforms, such as System z, applications can be installed on each instance
without incurring additional software costs7.
Though a single-host deployment of virtual platforms introduces a single point of failure,
this can be partially mitigated by purchasing highly available hardware. However,
planned outages for host maintenance are largely inescapable in such situations. To
minimize single points of failure, the consolidation of physical servers into single-host,
virtualized HA configurations should include running multiple virtual server guest OS
instances, deploying redundant virtual networks, ensuring redundant connectivity to
external network infrastructure, and leveraging virtualized storage pools. On hosts that
support it, multiple software hypervisor environments can be deployed across multiple
logical partitions (LPARs) for further redundancy. This approach is relatively cost
effective when compared to multiple-host approaches.
Multiple-Host Approaches
Eliminating the potential planned and unplanned outages associated with a virtualized
single-host implementation requires expanding the virtual environment across multiple
hosts, and distributing the guests such that no cluster has all nodes on a single physical
host.
Physical Machine 2
Firmware LPAR Hypervisor
LPAR 1 LPAR 2
DiscretionaryWork
Native VMM Hypervisor
Virt
ual M
achi
ne
Virt
ual M
achi
ne
Virt
ual M
achi
ne
Virt
ual M
achi
neDiscretionary
Work
LPAR 3
Physical Machine 1
Firmware LPAR Hypervisor
LPAR 1 LPAR 2
DiscretionaryWork
LPAR 3
Native VMM Hypervisor
Virt
ual M
achi
ne
Virt
ual M
achi
ne
Virt
ual M
achi
ne
Virt
ual M
achi
ne
Dis
cret
iona
ry w
ork
Dis
cret
iona
ry w
ork
Crit
ical
wor
k (H
A)
DiscretionaryWork
Dis
cret
iona
ry w
ork
Dis
cret
iona
ry w
ork
Crit
ical
wor
k (H
A)
Crit
ical
wor
k (H
A)
Crit
ical
wor
k (H
A)
NIC NIC NIC NIC
Figure 3 Virtualized Environment Spanning Multiple Hosts
Consider the configuration shown in Figure 3 that involves two physical hosts, each with
a hypervisor supporting two guests that execute some critical service deployed in
Active/Active mode. With four virtual machines running critical work distributed evenly
across two hypervisors on discrete physical systems, disabling an entire hypervisor or
host, along with all guests executing on it, is tolerated and still leaves the surviving guests
in an HA configuration (though now susceptible to hardware as a single point of failure).
Resource sharing within each hypervisor’s domain still provides considerable efficiencies
for the most-likely outage scenarios (resource, application, or virtual machine), so over-
configuration of virtual servers is minimized and cluster sizes can remain modest. Even
in the host outage case, a hypervisor can often be configured to automatically redirect
system resources away from guests running discretionary work over to the surviving
cluster nodes when their demand increases– again reducing the need for over-
configuration. Still, the reduction in hardware resources is not as large as in a single-host
deployment.
Multiple-host configurations naturally incur additional costs for hardware and associated
physical and environmental footprints. There are also hidden costs such as increased
complexity of networking, cryptographic burden between the hosts, and increased
networking latency. There is also typically a need for load balancing to keep both
physical hosts busy while simultaneously ensuring the load on each is within tolerances.
One last interesting aspect of multi-host virtualized HA architectures involves the use of
virtual machine migration technologies such as VMWare VMotion, IBM z/VM Cross
System Extensions (CSE), and XenSource XenMotion. These technologies allow the
relocation of virtual machines from one host to another.27 Using this technology a virtual
machine may be transported -- in some cases without interruption in service -- to another
physical host. Since this functionality is relatively new there is a lot of interest in the area.
It is expected that ongoing research into the implementation of HA solutions that use
these technologies is forthcoming. Though this particular capability can enable high
availability when outages are expected or hardware/physical infrastructure failures are
observed to be impending, it is not yet clear how this functionality will affect the choice
of deployment when selecting Active/Active or Active/Passive configurations, or if it is
even a relevant consideration.
Even though multiple-host deployments eliminate single points of failure, provide more
flexibility for planned outages, and enable easier incremental growth by adding another
server’s capacity, they are not without their own challenges. Virtual HA environments
that have been designed from the beginning to span multiple physical hosts will be
strategically positioned to surmount those challenges and incrementally add other hosts
and corresponding virtual machines as needed. The increased cost of complexity and
hardware allows for host failures and scheduled maintenance windows that will not affect
system availability.
Conclusion Virtualization offers significant opportunities to augment and amplify HA
implementations. Single points of failure can be eliminated through redundant virtual
systems and resources, rather than physical ones, keeping acquisition and ongoing costs
low. The flow of physical resources between virtual systems can keep cluster sizes small,
thereby reducing potential points of failure. Complexity associated with configuring and
maintaining fault-tolerant storage pools can be minimized by adding a virtualization
layer. Network infrastructure can be virtualized, eliminating it as a physical failure point
while simultaneously reducing management complexity. Availability monitoring can be
brought up a level to the virtualization layer, reducing overall management complexity.
Even testing of service and other changes prior to introducing them into production to
minimize potential disruptions can be simplified.28
However, virtual HA configurations are not identical to their physical counterparts. In
addition to opportunities, virtual HA architectures also introduce new challenges and
risks that must be addressed for an implementation to succeed. Over-consolidation is a
clear concern -- system virtualization implies that a single physical server’s failure will
now affect multiple server images, and the issue also extends to resource virtualization.
But other, more subtle risks lurk as well. Heartbeating, a common technique for
implementing and monitoring HA clusters, can be ill-advised in a virtual environment.
Commonly deployed dynamic routing algorithms such as Open Shortest Path First
(OSPF), useful for automating failover in conjunction with certain redundant network
adapter configurations, may be overly burdensome in a virtualized environment. The
cumulative processing cost of per-system health monitoring agents can be an issue when
employing system virtualization. Various solutions exist to overcome these and other
challenges; the first step is to recognize that they exist and account for them.
Virtualization technology continues to evolve and grow. IBM’s Blue Cloud initiative is
one example that promises to further spread the virtual umbrella across a broader array of
heterogeneous systems. Grid computing is another. The live migration of virtual
machines across physical host boundaries needs more exploration as well. Future work is
needed to understand how best to marry these expanded virtualization approaches with
the latest high availability technologies and techniques to maximize the value of both.
* Trademark or registered trademark of International Business Machines Corporation.
**Trademark or registered trademark of Microsoft Corporation, Linus Torvalds, or Sun
Microsystems, Inc.
Cited References 1. L. Seawright, R. Mackinnon, “VM/370 – A study of multiplicity and usefulness,”
IBM Systems Journal, 18, No 1, 4-17 (1979).
2. R. Creasy, “The Origin of the VM/370 Time-Sharing System,” IBM J. Res. & Dev.,
25, No. 5, 483-490 (1981).
3. W. Fischofer, “VM/ESA: A single system for centralized and distributed computing,”
IBM Systems Journal, 30, No. 1, 4-13 (1991).
4. J. Matthews, W. Hu, et al., "Quantifying the Performance Isolation Properties of
Virtualization Systems," Proceedings of 2007 Workshop on Experimental Computer
Science, San Diego, California, ACM (June 2007).
5. B. Randell, P.A. Lee, P. C. Treleaven, "Reliability Issues in Computing System
Design," ACM Computing Surveys (CSUR) 10, Issue 2, 123–165 (June 1978).
6. P. J. Denning, "Fault tolerant operating systems," ACM Computing Surveys (CSUR)
8, Issue 4, 359–389 (December 1976).
7. S. Wehr, S. Loveland, et al, “High Availability Architectures For Linux on IBM
System z,” IBM Corporation, see http://www-
03.ibm.com/servers/eserver/zseries/library/whitepapers/pdf/HA_Architectures_for_Li
nux_on_System_z.pdf (March 2006).
8. “Linux Guest Crypto Support”, z/VM Performance Report, IBM Corporation, see
http://www.vm.ibm.com/perf/reports/zvm/html/crypto.html
9. “Linux on System z Tuning Hints and Tips: Cryptographic support,” IBM
Corporation, see
http://www.ibm.com/developerworks/linux/linux390/perf/tuning_res_security_crypto
.html
10. T. W. Arnold, A. Dames, et al, “Cryptographic system enhancements for the IBM
System z9,” IBM J. Res.& Dev., 51, No. 1/2, 87-102 (January/March 2007).
11. “Technical Security Advantages of IBM System z9 and Virtualization on z/VM,”
Sine Nomine Associates, page 11, see http://www-
03.ibm.com/systems/z/security/pdf/IBM3651z9SecurityTechPaper041107[1].pdf
(April 2007).
12. P. Bari, H. Almeida, et al, Security on z/VM, SG24-7471-00, IBM Corporation
(November 2007)
13. Secure Key Solution with the Common Cryptographic Architecture Application
Programmer's Guide, SC33-8294-00, IBM Corporation (May 2007)
14. A. Menon, J. R. Santos, et al, "Diagnosing Performance Overheads in the Xen Virtual
Machine Environment,” Proceedings of the 1st ACM/USENIX international
conference on Virtual execution environments, Chicago, IL, ACM, 13 – 23 (2005).
15. A. Menon, A. L. Cox, W. Zwaenepoel, “Optimizing Network Virtualization in Xen,”
Proceedings of the Annual Technical Conference on USENIX’06, Boston, MA,
USENIX Association (2006)
16. M. Ferrer, “Measuring Overhead Introduced by VMWare Workstation Hosted Virtual
Machine Monitor Network Subsystem,” Technical University of Catalonia, see
http://studies.ac.upc.edu/doctorat/ENGRAP/Miquel.pdf
17. T. S. E. Ng, A. L. Cox, “Towards an Operating Platform for Network Control
Management,” PRESTO: Workshop on Programmable Routers for the Extensive
Services of Tomorrow, Princeton University, Princeton NJ (May 2007)
18. T. Davis, W. Tarreau, et al, ”Linux Ethernet Bonding Driver HOWTO,” see
http://sourceforge.net/projects/bonding/ (April 2006)
19. I. Adlung, G. Banzhaf, et al, “FCP for the IBM eServer zSeries systems: Access to
distributed storage,” IBM J. Res & Dev., 46, No. 4/5, 487-502 (July/September 2002)
20. J. Tate, K. Geguhr, et al, IBM System Storage SAN Volume Controller, SG24-6423-
05, IBM Corporation (September 2007)
21. A. Barrett, “IBM unveils storage tank,” Storage Magazine (November 2003)
22. S. Fadden, “An Introduction to GPFS Version 3.2,” IBM Corporation, see
http://www-03.ibm.com/systems/clusters/software/whitepapers/gpfs_intro.html
(September 2007)
23. J. Srikrishnan, S. Amann, et al, “Sharing FCP adapters through virtualization,” IBM J.
Res. & Dev., 51, No. 1/2, 103-118 (January/March 2007)
24. z/OS V1R8.0 System z Platform Test Report for z/OS and Linux Virtual Servers, IBM
Corporation, section 13.4, see
http://publibz.boulder.ibm.com/epubs/pdf/e0z1p150.pdf (June 2007)
25. “IBM's Project Big Green Spurs Global Shift to Linux on Mainframe,” IBM
Corporation, see http://www-03.ibm.com/press/us/en/pressrelease/21945.wss (August
2007)
26. C. Ollinger, “A powerful strategy: HP grows green data centers,” Manufacturing
Business Technology, see http://www.mbtmag.com/article/CA6481103.html
(September 25, 2007)
27. J. Sugerman, G. Venkitachalam, and B-H. Lim, “Virtualizing I/O Devices on
VMware Workstation’s Hosted Virtual Machine Monitor,” Proceedings of the 2001
USENIX Annual Technical Conference, Boston, MA (June 2001)
28. S. Loveland, G. Miller, et al, ”Testing with a Virtual Computer,” Software Testing
Techniques: Finding the Defects that Matter, chapter 16, Charles River Media (2005)
Biographies Duane Beyer IBM Server and Technology Group, 2455 South Road, Poughkeepsie, New
York, 12601-5400 (electronic mail: [email protected]). Mr. Beyer is an Advisory
Software Engineer and member of the IBM Test and Integration Center for Linux. He
holds a B.S. degree in computer science from Marist College. He has 21 years of
experience with z/VM and IBM mainframe hardware.
Philip Chan IBM Systems and Technology Group, 2455 South Road, Poughkeepsie, New
York, 12601-5400 (electronic mail: [email protected]). Mr. Chan is a Staff Software
Engineer and member of the IBM Test and Integration Center for Linux. His areas of
expertise include runtime configuration and troubleshooting. He also has several years of
experience with WebSphere Application Server, WebSphere MQ, DB2, Tivoli Access
Manager and Federated Identity Manager.
Eli M. Dow IBM Systems and Technology Group, 2455 South Road, Poughkeepsie, New
York, 12601-5400 (electronic mail: [email protected]). Mr. Dow is a Software
Engineer and member of the IBM Test and Integration Center for Linux. He holds a B.S.
in computer science and psychology and a master's of computer science from Clarkson
University. He is an alumnus of the Clarkson Open Source Institute. He is the coauthor of
an IBM Redbooks entitled "Linux for IBM System z9 and IBM zSeries", and
"Introduction to the New Mainframe: z/VM Basics".
Frank LeFevre IBM Systems and Technology Group, 2455 South Road, Poughkeepsie,
New York, 12601-5400 (electronic mail: [email protected]). Mr. LeFevre is a Senior
Software Engineer with over 29 years of experience with IBM mainframe hardware and
operating systems. He is currently the team lead for the Linux Virtual Server Platform
Evaluation Test group responsible for integration testing middleware products and
implementing high availability solutions.
Scott Loveland IBM Systems and Technology Group, 2455 South Road, Poughkeepsie,
New York, 12601-5400 (electronic mail: [email protected]). Mr. Loveland is a
Senior Technical Staff Member in the IBM Integrated Solutions Test laboratory. He
joined IBM in 1982 after receiving a B.A. in computer science from the University of
California, Berkeley. He has held various technical leadership positions in mainframe
operating system product testing spanning the z/OS and Linux platforms. Mr. Loveland’s
areas of specialization include reliability and high availability, disciplines upon which he
has lectured at universities and industry conferences. He is currently the Chief Test
Architect for the IBM Customer Test organization.