validating heuristics for virtual machines consolidation · analysis. this experimental study...

Validating Heuristics for Virtual Machines Consolidation

Sangmin Lee∗ Rina Panigrahy Vijayan Prabhakaran Venugopalan RamasubramanianKunal Talwar Lincoln Uyeda† and Udi Wieder

Microsoft Research Silicon Valley, Mountain View, CA 94043

Abstract

This paper examines two fundamental issues pertain-ing to virtual machines (VM) consolidation. Currentvirtualization management tools, both commercialand academic, enable multiple virtual machines to beconsolidated into few servers so that other servers canbe turned off, saving power. These tools determineeffective strategies for VM placement with the helpof clever optimization algorithms, relying on two in-puts: a model of resource utilization vs performancetradeoff when multiple VMs are hosted together andestimates of resource requirements for each VM interms of CPU, network and storage. This paper in-vestigates the following key questions: What factorsgovern the performance model that drives VM place-ment, and how do competing resource demands inmultiple dimensions affect VM consolidation? It es-tablishes a few basic insights about these questionsthrough a combination of experiments and empiricalanalysis. This experimental study points out poten-tial pitfalls in the use of current VM managementtools and identifies promising opportunities for moreeffective performance consolidation algorithms. Inaddition to providing valuable guidance to practition-ers, we believe this paper will serve as a starting pointfor research into next-generation virtualization plat-forms and tools.

1 Introduction

Virtualization has become an important trend in datacenters [11] and cloud computing [15, 16]. One ofits attractive features is the ability to utilize com-pute power more efficiently. Specifically, virtualiza-tion provides an opportunity to consolidate multiplevirtual machine (VM) instances running on under-utilized computers into fewer hosts, enabling many of

∗Sangmin Lee is a graduate student at The University ofTexas, Austin and was an intern at Microsoft Research SiliconValley.†Lincoln Uyeda is a member of Microsoft’s VMM product

group.

the computers to be turned-off, and thereby result-ing in substantial energy savings. In fact, commercialproducts such as the VMware vSphere DistributedResource Scheduler [19] (DRS), Microsoft SystemCenter Virtual Machine Manager [18] (VMM), andCitirix XenServer [20] offer VM consolidation as theirchief functionality.

VM consolidation tools typically have optimizationalgorithms to decide which virtual machine should beplaced on which physical host. Research on VM con-solidation has generated several clever heuristics foreffective VM consolidation [12, 13, 14]. These heuris-tics take as input estimated resource requirements ofthe virtual machines and the capacities of the phys-ical hosts, and decide which VM instance should beplaced on which host, while trying to minimize thetotal number of physical hosts utilized.

In this paper, we take a step away from designingthe most appropriate heuristic. Instead, we take alook at deeper, fundamental aspects that lie at theheart of VM consolidation. Specifically, we ask twoquestions about VM consolidation heuristics. First,what assumptions do these heuristics typically makeabout how virtual machines operate when hosted to-gether, and to what extent do these assumptions holdin practice? Second, how sophisticated should theheuristics be in dealing with resource requirementsthat span different dimensions such as CPU, memory,and I/O bandwidth; in other words, to what extentcan sophisticated heuristics improve the benefits ofVM consolidation?

These questions and their answers have impor-tant implications for understanding how VM consol-idation will work in practice. For instance, mostheuristics assume that the virtualization platformprovides perfect performance isolation; that is, re-source utilization of a VM, and correspondingly itsperformance, will remain the same irrespective ofwhether it is run in isolation or along with other VMs.However, when this assumption is violated—the re-source utilization drops because of adverse interfer-ence from other VMs, the performance of the applica-tions hosted on that VM might be severely degraded.Even though a VM consolidation tool might be able

to tolerate a small degree of performance degrada-tion by holding small amounts of resources as a re-serve [9], severe degradation in performance can becounter productive to the goals of consolidation.

Similarly, VM consolidation heuristics vary in in-creasing sophistication depending on how they treatdemands for multiple resource types. At one extreme,they might choose a single primary resource dimen-sion to optimize for and ignore others; for example,a simple heuristic might only consider CPU require-ments and ignore memory and disk usage. At theother extreme, heuristics might be influenced by re-source demands across every resource dimension suchas CPU, memory, and disk. The latter, more so-phisticated heuristics have higher overhead, both forobtaining accurate estimates of the information theyneed and the time they take to compute the results.It is essential to understand what workload charac-teristics will lead to benefits commensurate with theadditional overhead.

We answer these questions in this paper througha combination of experiments and empirical analysis.To answer the first question, we run controlled ex-periments on machines running Microsoft’s Hyper-Vhypervisor, using micro-benchmarks to examine howeach of CPU, cache, network, and storage resourcesaggregate. The second question is answered by study-ing the effectiveness of multi-dimensional heuristics.We start with resource utilization data collected froma real compute cluster running Dryad jobs [8] andanalyze the performance of five VM consolidationheuristics drawn from existing proposals.

This experimental study—even though specific tothe selected workloads, heuristics, and virtualizationenvironment—highlights many interesting propertiesof a virtualized system. We find that the perfor-mance of a virtualized environment depends heav-ily on many factors such as the resource type, num-ber of virtual machines, and the quality of the work-loads. For example, for a random write workload,disk utilization decreases as more VMs are co-hosted,whereas utilization improves in a solid state drive forthe same workload. We also find that sophisticated,multi-dimensional heuristics provide significant re-turns for specific scenarios where the VM workloadshave a mixture of complementing resource demands,but simpler heuristics suffice for other scenarios.

This paper is organized as follows. It starts with abrief background on a few representative VM consol-idation heuristics in the next section. The two fun-damental questions are then taken up in Sections 3and 4 respectively. Finally, Section 5 summarizes theimplications of our findings and concludes.

Host capacity Remaining capacity

VM

Memory

CP

U

Memory underutilized here A more balanced placement

Figure 1: Intuition for 2D-bin packing. The outerrectangle is the capacity of a host. A tight packingwill get the VMs (smaller rectangles) as close tothe top right corner as possible.

2 VM Consolidation Heuristics

The problem of VM consolidation maps to the clas-sical optimization problem called vector bin packing,where the hosts are conceived as bins and the VMsas objects that need to be packed into the bins. Eachhost is characterized by a d-dimensional vector calledthe host’s vector of capacities H = (h1, h2, ..., hd).Each dimension represents the host’s capacity corre-sponding to a different resource such as CPU utiliza-tion, memory utilization, or disk bandwidth. Simi-larly, each VM is represented by its vector of demandsV = (v1, v2, .., vd). The goal is to place all the VMsin as few hosts as possible, ensuring that, across anydimension, the total demand of VMs placed in a hostdoes not exceed the capacity of the host.

Vector bin packing is an NP hard problem. How-ever, good heuristics with measurable performanceare well known and appear in surveys and text-books [5, 7]. We next outline a set of the com-mon heuristics for the problem, describing them inincreased order of sophistication. The heuristics wedescribe, or their variants, are used by research pro-totypes as well as real VM management tools (sum-marized in Section 2.3). This description helps usunderstand the inherent assumptions that underlieVM consolidation heuristics.

First Fit Decreasing: A natural heuristic for one-dimensional bin packing is FFD (First Fit Decreas-ing). This heuristic orders the bins and the objects indecreasing order of size. Starting with the first bin,it iterates over the objects, placing any object it caninto the first bin till no more objects can be placedinto it. It then considers the first bin to be filled andproceeds to the second bin, repeating the same pro-cedure. It is known that this algorithm gives a 11/9worst case approximation [5] if all bins are of equalsize.

2

2.1 FFD-based heuristics

If the placement problem is effectively constrained bya single resource, say all VM’s are CPU bound, thenthe problem is in essence one dimensional and FFDis the algorithm of choice. If the problem instanceis constrained by more than one dimension then weneed some generalization of FFD for multiple dimen-sions. The typical approach is to map the vector ofcapacities and demands into a single scalar called theVolume of the vector, and then perform a one dimen-sional FFD based on the calculated volumes. We callthese types of algorithm FFD-based.

There are many ways one can compute the volume.One option is to set the volume as the product of thevalues; i.e., V olume(V ) = Πivi. We call the FFDalgorithm based on this function the FFDProd algo-rithm. An attractive feature of this algorithm is thatthe order in which the VM’s are sorted is indepen-dent of the units used for measuring each dimension(as long as the same unit is used across all VMs). Asecond approach, which we call FFDSum, is to set thevolume to be some weighted sum of the values; i.e., foreach dimension i, V olume(V ) =

∑i wivi, where wi

is the assigned weight to that resource. The weightsreflect the scarcity of the resource. A suitable defi-nition for the weight could be the ratio between thetotal demand for resource i and the capacity of thehost, expressed as

∑VM vi/hi.

The main drawback of FFD-based algorithms isthat they do not take into account correlations acrossdimensions. Consider a case where there are twokinds of VMs, one is memory intensive and one isCPU intensive. An FFD-based algorithm would or-der the VMs such that all VMs of a certain type arefirst, followed by all the VMs of the second type. Asa result it would try and pack the maximum numberof CPU intensive VMs in each host, possibly prevent-ing the host from accepting any other VMs. A bettersolution would have been to pack memory intensiveand CPU intensive VMs on the same host, keepingresource allocation across dimensions as balanced aspossible.

2.2 Dimension-aware heuristics

The next set of heuristics, which we call dimension-aware, take advantage of such complimentary require-ments for different resources. Their key intuition canbe understood from the two dimensional example il-lustrated in Figure 1. The capacity of a host is rep-resented using a rectangle, its sides representing thecapacity across the two dimensions. Each VM is alsoa vector and as a VM is added to a host, their cor-

responding demand vectors are added with the con-straint that the point corresponding to the total loadgiven by the sum of these vectors must lie within therectangle. To best utilize the capacity of the hostalong both dimensions (resources) it is desirable thatthe final sum of the demand vectors end as close aspossible to the top right corner of the rectangle; it isundesirable to end in a situation where the combinedvector hits the upper edge or the side edge while wast-ing space in the other dimension. This implies thatat any point it is desirable to choose the next VMin such a way that the point representing the currentload moves towards the top right corner.

We present two heuristics that achieve this, onepicks the next vector by comparing its direction tothat of the direction to the top right corner. Thesecond heuristic picks the next vector by comparingits distance to the top right corner.

Dot-Product (DotProduct) When choosing thenext VM to place into an open host, this heuristictakes into account not only the VMs total demand,but also how well its demands align with the remain-ing capacities. It does this by looking at the angle(dot product) between the vector of remaining capac-ities along the dimensions and the vector of demandsfor a VM. At time t let H(t) denote the vector ofremaining or residual capacities of the current openhost, i.e. subtract from the host’s capacity the totaldemand of all VMs currently assigned to it. It placesthe VM that maximizes the dot product

∑i wivih(t)i

with the vector of remaining capacities without vio-lating the capacity constraint. The weights wi arecalculated in the same manner as described in theFFDSum algorithm.

Norm-based Greedy: (L2) This heuristic looks atthe difference between the vectors v and h under acertain distance metric (norm), instead of the dotproduct. For example, for the `2 norm distance met-ric, from all unassigned VMs, it places the VM v thatminimizes the quantity

∑i wi(vi−hi(t))2 and the as-

signment does not violate the capacity constraints.

2.3 Related heuristics from literature

There are several VM consolidation heuristics cur-rently used in research prototypes and real VM man-agement tools. We briefly outline how they are re-lated to the heuristics described in this section. Thereare also other heuristics for server consolidation innon-virtualized environments [2, 4, 3, 10], which weomit from this study.

Sandpiper [14], a research system that enables livemigration of VMs around overloaded hosts, uses aheuristic corresponding to FFDProd, taking the prod-

3

uct of CPU, memory, and network loads. Anotherwork from IBM research [13], an application place-ment controller for data centers, performs applicationassignment/consolidation handling resource demandsfor CPU and memory. This work combines the twodimensions into a scalar by taking the ratio of theCPU demand to the memory demand. Even thoughthis heuristic does not fit into the FFDSum or FFD-Prod, it shares the similar characteristic of ignoringcomplementary distributions of resource demands indifferent dimensions.

Microsoft’s Virtual Machine Manager [18] inter-nally uses the dimension-aware, Dot-Product andNorm-based Greedy, heuristics described above. Arecent research work [12] proposes to use Euclideandistance between resource demands and residual ca-pacity as a metric for consolidation, a heuristic anal-ogous to Norm-based Greedy.

3 Validating Assumptions

The heuristics we just described make critical as-sumptions about how resource utilization and perfor-mance of virtual machines aggregate when they arehosted together. As mentioned earlier, a basic as-sumption they make is perfect performance isolation:Each VM’s resource consumption even when hostedwith others would be the same as estimated from anindividual execution; or, in terms of performance,the throughput realized by each VM when hostedtogether is the same as the throughput it achieveswhen hosted individually. If this assumption holds,then the residual capacity in a host for a given re-source can indeed be computed by subtracting thesum of the estimated resource requirement of each ofthe VMs assigned to that host from the host’s totalcapacity.

In practice, however, the amount of resources aVM utilizes (and the performance it realizes) mightnot be preserved when co-hosted with other VMs.Performance degradation occurs due to multiple rea-sons. 1. Virtualization overhead: There might beoverhead or bottlenecks introduced by the hypervi-sor that consumes some of the host’s capacity. Forinstance, the scheduler and other background pro-cesses in the hypervisor and the host OS might con-sume CPU cycles and I/O bandwidth. 2. Cross-interference: Multiple VMs running together couldinterfere with each other, degrading their perfor-mance or consuming more of the resources than theywould if hosted individually. For instance, contentionfor a shared L2 cache or cache flushes that occur as aresult of context-switching decreases the effectiveness

of caching and adversely impacts VM performance.3. Resource over-subscription: Even in the absenceof overhead and interference, a VM’s resource utiliza-tion during actual execution might be different fromthe estimate due to errors in the estimation processor unpredictable changes in the VM’s workload.

Since the above causes are often unavoidable, VMconsolidation tools expect and even tolerate someamount of performance degradation for co-hostedworkloads. The key concern then is the extent of per-formance degradation. The most common method totolerate performance degradation, employed in Mi-crosoft System Center VMM for example, is to pro-vide for some slack while allocating resources. Forexample, if the residual capacity of a host goes belowa certain threshold, say 20% of the total capacity,then VMs are not allocated to that host anymore.This slack accommodates for certain amount of de-viation in how resources are consumed and toleratesperformance degradation to a limited extent. How-ever, if the performance degradation due to cross-interference or resource oversubscription is too steep,the system could start thrashing, and the VMs wouldbe unable to complete their tasks in time. A patho-logical scenario occurs, for example, when the co-hosted VMs take even longer time to finish than thetime taken to run the VMs sequentially in isolation—entirely defeating the goal of saving power throughconsolidation.

Finally, it is also essential to understand whetherresource allocation, especially when performancedegradation kicks in, is fair across multiple VMs.For example, assume that a VM consumes excessiveamount of computation resources, say 50%, than itsestimated value of say 20%. When computation re-sources are scarce, the hypervisor would meet thisVMs requirement by stealing resources allocated toother VMs. If the overall resource allocation is notfair, an unfortunate VM could be starved off its re-source cycles. In general, accommodating a non-uniform degree of performance degradation betweenVMs is hard as simple solutions, such as reserving aslack, do not produce desired results1.

In summary, the above assumptions, critical forcurrent VM consolidation heuristics to work well,translate to the following three questions: 1. Doresource utilization and performance aggregate ad-ditively, according to their estimates, when multiple

1A sophisticated hypervisor might handle this by reservingresource to each VM according to its estimated needs (per-haps with a slack) and ensuring that each VM gets its promisedquantity of each resource. In that case, oversubscription wouldonly affect the VM that has exceeded its resource requirement.Unfortunately, current hypervisors only support resource reser-vation for some resource types.

4

Windows Hyper‐V Hypervisor

Host OS

Hyper‐V Aware Windows Server 2008

Log

C0

L2 Cache

C1 C2

L2 Cache

C3

8 GB DRAM

NetworkCard

Windows Server 2008

Workload

Log

Guest 1

Windows Server 2008

Workload

Log

Guest 6

…

Figure 2: Experimental setup.

VMs are hosted together? 2. When does performancedegradation kick in, and is the degradation gradual?3. Under high utilization, are resources shared fairlyso that any performance degradation would affect allVMs equally? In the rest of this section, we ex-amine these questions separately for four types ofresources—namely CPU, cache, network, and stor-age.

3.1 Experiments setup

We performed our experiments on a computer withthe following hardware specification: a 2.83 GHz In-tel Core 2 Quad Processor, 8 GB RAM, a BroadcomNetXtreme Gigabit Ethernet card, and 4 spare SATAports. The processor has four cores, of which, eachpair shares an L2 cache of 6 MB. We added addi-tional storage hardware to this computer, consistingof either four Seagate Momentus 7200 RPM SATAdrives or three Intel X-25M MLC SSDs. We also re-peated some experiments on an alternative setup, acomputer with four 1.9 GHz AMD Opteron 6168 pro-cessors with 12 cores each (total 48 cores) and 128 GBtotal RAM. We used the second setup, to verify theobservations of our experiments in a multi-processormachine with much a larger number of cores.

We setup the computers with a 64-bit Windows2008 Server host operating system, which supportsvirtualization through Microsoft’s Hyper-V platform.Hyper-V uses the native virtualization support (In-tel VT and AMD-V) provided by the processor. Theguest virtual machines were also configured with the64-bit Windows 2008 Server OS. We statically allo-cated 1 GB RAM to each guest virtual machine be-cause Hyper-V does not yet support dynamic memoryallocation. To avoid interference, we used a separate

0%

100%

200%

300%

400%

0 10 20 30 40 50 60 70 80 90 100 110 120

No

rmal

ize

d T

hro

ugh

pu

t

Time (min)

VM1 VM2VM3 VM4VM5 VM6Aggregate

Figure 3: Normalized CPU throughput of eachVM with time. A new VM workload was startedafter every 10 minutes for up to 6 VMs.

hard disk to store the root volume of the host andguest virtual machines.

We used the Windows Logman to collect and man-age performance counter values. We collected a num-ber of performance counters including the percentageof total processor runtime, bytes sent and receivedover the network interface, and average disk usage asbytes per second both from the host and the virtualmachines. Figure 2 illustrates the architecture of themachine and the experimental setup.

3.2 CPU

Workload. We first present results from the CPUexperiments. To study CPU performance, we useda matrix multiplication program written in C as amicro-benchmark. This program repeatedly performsa matrix multiplication of the form ATA on a matrixA of R rows and C columns. We chose this workloadbecause it allows us to precisely control the tradeoffbetween memory/cache utilization and computationcycles. Choosing a small matrix, and executing manyiterations of the multiplication, enables us to mini-mize the effects of cache contention. On the otherhand, increasing the matrix size allows us to studythe effects of cache contention in a controlled man-ner.

Procedure. We examined how computation resourcesaggregates by hosting up to six simultaneous VM in-stances on the four-core machine. Each instance ranthe matrix multiplication workload on a 128x128 ma-trix for 60 minutes. However, we staggered the starttime of the workload by 10 minutes for each VM.That is, at the start of the experiment, only one VMstarts running the workload; the second VM starts 10minutes later, and so on. The staggering allows us toexamine resource aggregation for different numbersof active VMs.

5

Results. We measure and plot the throughput real-ized by each VM workload in terms of the numberof matrix multiplications completed per second. Thethroughput of different VMs is an indication of howresources are shared between the different VMs. Ifresources combine additively then the throughput ofthe existing VMs remains the same as new VMs areadded to the system, and the new VMs also realizethe same throughput. However, if there are perfor-mance bottlenecks then the throughput will decrease.

Figure 3 plots the throughput versus time as eachof the six VMs run their workload. We normalize thethroughput with respect to the maximum throughputseen by a single VM. Since the machine is quad-corewe expect to achieve, in the best case, four timesthe maximum throughput achieved by a single VM.In fact, the figure shows this additive increase inthroughput for the first two VMs. They both achieveclose to 100% normalized throughput each with min-imal interference from other overheads.

However, as more VMs are added to the system,the throughput decreases for all the VMs. Through-put degradation is expected to occur when there aremore than four VMs because the system needs toshare its four cores with 5 or 6 VMs. However, thethroughput degradation for the 3- and 4-VM cases issomewhat surprising. Figure 3 shows that sharing 4VMs only increases the aggregate throughput 3.2 foldcompared to 1 VM. It appears that there are bottle-necks that decrease the effective sharing of 4 coreswith 4 VMs. Subsequent experiments showed thatthis bottleneck is due to cache contention created byprocesses running on the host OS. When there arethree VMs running on one core each, host OS pro-cesses run on the fourth core and contend for the L2cache that the fourth core shares with another coreallocated to one of the VMs.

Figure 4 plots the average normalized throughputseen by each VM. Each cluster of bars in the figurecorresponds to a certain number of active VMs. Thebars plot the normalized throughput realized by eachVM in that cluster; the exact identification of eachVM is not relevant. Figure 4 summarizes the trendscaptured in Figure 3. It also shows that where perfor-mance degradation occurs, it affects all VMs almostequally—that is, CPU allocation is fair. Note thatHyper-V does not pin VMs to specific cores. Con-sequently, the performance degradation affected allVMs equally over time even though at specific mo-ments different VMs saw different effects.

Multi-processors experiments. We also performedCPU experiments on the 4-CPU, 48 cores machineusing the same procedure. In this case, we ran up to20 simultaneous VM instances. Moreover, each VM

0%

20%

40%

60%

80%

100%

1 VM 2 VMs 3 VMs 4 VMs 5 VMs 6 VMs

No

rmal

ize

d T

hro

ugh

pu

t p

er

VM

Figure 4: Normalized CPU throughput realized byeach VM as the number of co-hosted VMs vary.Each clustered set of bars shows the variation inthroughput across VMs; the exact identificationof the VMs—hence the color of the bars—is notrelevant.

0%

25%

50%

75%

100%

4 VMs 8 VMs 12 VMs 16 VMs 20 VMs

No

rmal

ize

d T

hro

ugh

pu

t p

er

VM

Figure 5: Normalized CPU throughput realized byeach VM as the number of co-hosted VMs vary inthe multi-processors experiments. As in Figure 4,the color of the bars is not relevant.

instance internally ran four parallel instances of thematrix multiplication workload, resulting in 80 par-allel workloads on the 48 core machine. We startedthe VM instances in the same staggered manner asearlier, but in groups of 4.

Figure 5 plots the average normalized throughputseem by each VM in the multi-core experiment, eachcluster of bars in the figure corresponding to a certainnumber of active VMs. As before, the exact identifi-cation of each VM (bar in the graph) is not relevant.Figure 5 predominantly confirms the trends capturedin the previous experiments. There is no real perfor-mance degradation for up to 12 VMs, which has 48parallel workloads, equal to the number of cores inthe machine. A gradual but proportional degrada-tion in throughput is incurred when more VMs areadded, about 75% on average for 16 VMs case and60% average for 20 VMs case.

6

0%

20%

40%

60%

80%

100%

128 256 512 1024 2048 4096

De

cre

ase

in T

hro

ugh

pu

t (%

)

Number of Rows in the Matrix

128 Columns

256 Columns

Figure 6: Throughput degradation experienced bythe shared workload compared to the correspond-ing isolated workload for different matrix sizes.

However, fairness in resource utilization and thecorresponding uniformity in performance degrada-tion, seen in the small-cores experiments, is notice-ably absent. Figure 5 shows that there is a discretegranularity at which resources are utilized, causingabout 12 VMs (in the 20-VMs case) to aggregatearound the 50% throughput degradation point whilethe other 8 VMs aggregate around the 75% point,as if the first 12 VMs share the cores on two pro-cessors while the remaining 8 VMs shared the othertwo processors, evenly in both cases. This indicatesHyper-V’s allocation of cores maintains fairness foreach processor, but perhaps not as effectively acrossmultiple processors.

3.3 Cache

Workload and procedure. We next examine the ef-fects of cache contention on shared VM workloads.We performed this study using the same matrix mul-tiplication workload as in the CPU experiments, butwith varied matrix sizes. In each experiment, we ranup to four simultaneous VM instances, staggering thestart time of the workload by 10 minutes as before.Since each pair of cores in our test machine has ashared 6 MB L2 cache, VM instances hosted on coreswith shared cache experience cache contention. Wevaried the matrix size, changing the number of rowsas well as columns, to simulate different severity ofcache contention.

Results. We measure and plot the degree of through-put degradation caused by cache contention. We de-fine this degree as a ratio of the average through-put of the shared, four-VMs workload to the averagethroughput of running the workload on a single iso-lated VM.

Figure 6 plots the percentage of throughput degra-dation for different sizes of the matrix. It shows thata small degree of performance degradation (less than20%) occurs even for small matrixes; these matrixesfit into the L2 cache and should not ideally incur thepenalty of cache misses. But as we already observedthis performance degradation in the CPU experi-ments, it stems from sharing the CPU and cache withthe root VM. However, a remarkably larger percent-age of performance degradation, where the through-put drops to about a third of the throughput in theisolated case, can be noticed as the matrix size in-creases beyond a certain threshold. This experimentevidently shows that cache contention could degradeperformance of shared workloads precipitously.

A closer look at a specific point in Figure 6 helps tounderstand this cache-related behavior better. Fig-ure 7 plots the normalized throughput produced byeach VM running the matrix multiplication workloadfor a 2048 x 256 matrix, when different number ofVMs are co-hosted. Corresponding to the respectivepoint in the previous figure, the normalized through-put of each VM when four-VMs are co-hosted can beseen to be remarkably low, about 30% of the peakthoroughput. Essentially this brings out a patholog-ical scenario—the shared, 4-VMs workload takes alonger time to complete compared to running two in-stances of 2-VMs workload sequentially.

Figure 7 also provides additional insights by show-ing the behavior when two or three VMs are co-hosted. In the two-VM case, one of the VM expe-riences significant throughput degradation (due tocache contention from the root VM running in theother core that shares its cache) while other does not.Similarly, the three-VM case also shows non-uniformseverity in throughput degradation. Here, the VMinstance that shares the cache with the root VM ex-periences less throughput degradation compared tothe other two VMs that contend for the shared cachewith each other.

In summary, these experiments confirm that thehypervisor does not provide performance isolationfrom cache contention, and neither does it allocatecache resources fairly to ensure proportional resourceutilization.

3.4 Network

Workload. To measure the performance of the net-work components, we modified the same matrix mul-tiplication benchmark by interspersing it with net-work packet transfers. Specifically, the network-benchmark transmits one UDP packet to a second

7

0%

20%

40%

60%

80%

100%

1 VM 2 VMs 3 VMs 4 VMs

No

rmal

ize

d T

hro

ugh

pu

t p

er

VM

Figure 7: Normalized throughput of each VM forthe 2048 x 256 matrix multiplication workload.

computer after each iteration of the inner-most loopof the matrix multiplication. We created two versionsof this network workload. A light workload whichsends a packet of size 4096 bytes in a single thread.This workload is not sufficient to saturate the net-work. Therefore, we created a multi-threaded heavyworkload, which sends packets of size 65000 bytes(largest allowed) in 5 parallel threads.

Procedure. We examined the aggregation of the net-work resource by hosting 5 simultaneous VM in-stances. The workload duration for each instance was25 minutes, and the start-time of the workload on dif-ferent VMs was staggered by 5 minutes, similar to theCPU and cache experiments.

Results. We again measure how resource allocationoccurs as a function of the throughput realized byeach VM. Figure 8 plots the network throughput seenby each VM during different phases of the experi-ment, where different number of VMs were hosted.The throughput plotted in Figure 8 is normalizedwith respect to the maximum throughput seen bya single VM, which was about 935 Mbps measuredat the application level—quite close to the maximumcapacity of the gigabit network card.

Figure 8 shows that throughput degrades as moreVMs are added. This behavior is expected becauseVMs are running the high workload, which saturatesthe network card. The throughput degradation is alsoalmost optimal; for 5 VMs, each VM achieves about20%, a fair share of the total network throughput.

Figure 9, on the other hand, plots the throughput(not normalized) for the low workload. In this work-load, each VM sends less than 120 Mbps of networktraffic. Therefore, the network card has sufficient ca-pacity to support the combined network requirementsof all 5 VMs. Yet, there is a clear throughput degra-

0%

20%

40%

60%

80%

100%

1 VM 2 VMs 3 VMs 4 VMs 5 VMs

No

rmal

ize

d T

hro

ugh

pu

t p

er

VM

Figure 8: Normalized throughput realized by eachVM as the number of VMs vary for the high uti-lization network workload, which saturates thenetwork card.

dation visible even when two VMs are active; thethroughput drops from about 120 Mbps per VM toabout 90 Mbps per VM. The throughput degradationcontinues as more VMs are added, albeit more grad-ually, settling at about 75 Mbps for the 4- and 5-VMscases.

The above result for network throughput is surpris-ing especially because the previous high-workload ex-periment indicated fair sharing with little throughputloss. Closer examination of the anomalous through-put degradation for the low workload indicates thefollowing reason: Once a network send is issued bya process, there is a certain delay before the sendreturns control back to the process2. This delay in-creases sharply as another VM is added to the sys-tem. Subsequent additions of VMs increase this delayfurther, albeit to a lesser extent. Overall, the delayvaries from about 35 µs to 50 µs per packet. The pre-cise cause of this behavior requires further examina-tion. For the purposes of this study, this experimentindicates that network throughput does not aggregateperfectly but incurs a mild performance degradation.

3.5 Storage

Workload. Finally, we measure the utilization of stor-age devices under multiple VMs using a simple se-quential and random write workload. The sequentialwrite benchmark writes 10 MB chunks sequentiallyto a 20 GB file from start to end. The random writebenchmark uses two threads to issue 4 KB writes torandom locations on a 20 GB file. In order to ensurethat the file system cache is cold, all the benchmarks

2We observed this from the fact that we were unable to satu-rate the network using a single thread issuing network packets,even sending packets continuously with no intermittent com-putation and at the maximum packet size.

8

0

20

40

60

80

100

120

1 VM 2 VMs 3 VMs 4 VMs 5 VMs

Net

wo

rk T

hro

ugh

pu

t (

Mb

ps)

Figure 9: Normalized throughput realized by eachVM as the number of co-hosted VMs vary for thelow utilization network workload.

are run after a clean mount of the volume on whichthe accessed file is stored. We run these workloadson disks and Solid State Drives (SSD).

Procedure. In all our experiments, virtual disks areexported as storage devices to the VMs. Virtual disksare simply files that are pre-allocated on the physicaldisk or SSD on the host. We evaluated the storageperformance by running between 1 to 4 VMs underdifferent configurations: (a) all the virtual disks arestored on a simple volume on a single disk or SSD,(b) all the virtual disks are stored on a striped volumeover a set of 2 to 4 disks or 2 to 3 SSDs, and (c) eachvirtual disk is stored on a separate physical disk. Weran each experiment 5 times and averaged the results.

Disk results. Figure 10 presents the results for disks,where the Y-axis plots the average disk aggregatebandwidth observed by the host. X-axis presents5 clusters of bars one for each of our configuration,where each cluster has 4 bars; each bar in a clusterrepresents the aggregate host disk bandwidth when aspecific number of virtual machines were running thesame workload.

We make two observations from this figure. First,on configurations where the disks are shared, the diskbandwidth decreases with more VMs. For example,2 striped disks give about 86 MB/s with a single VMwhereas it drops to as low as 45 MB/s with 2 VMs.This happens because of the competing sequentialstream of writes from two VMs, which result in arandom write workload to the underlying disk stor-age. We see a similar effect on all the configurationswhere the disks are shared. However, in our last con-figuration, where each disk is allocated separately perVM, no I/O interference occurs and therefore, the ag-gregate disk bandwidth increases.

Second, we notice that the disk-bandwidth dropfrom 1 VM to 2 VMs is much greater than the dropfrom 2 to 3 VMs or 3 to 4 VMs. Specifically, for the

0

50

100

150

200

1 Disk 2 DisksStriped

3 DisksStriped

4 DisksStriped

4 DisksSeparate

Aggre

gat

e H

ost

Ban

dw

idth

(M

B/s

)

1 VM2 VMs3 VMs4 VMs

Figure 10: Sequential write disk bandwidth.

striped disks configurations, the average bandwidthreduction from 1 to 2 VMs is 49%, whereas the dropfrom 2 to 3 VMs is 10-20%.

Figure 11 presents results for the random writeworkload. Surprisingly, even for random workloadoverall disk bandwidth drops with more VMs. Weanalyzed the disk offsets to which random access re-quests are issued and noticed that on an average, theseek distance between two random writes increasedwith more virtual machines, which could result in thedecreased disk bandwidth. Overall for the randomwrite workload, we observe trends similar to that ofsequential write workload.

SSD results. Figures 12 and 13 present the sequentialand random write results from similar experimentson SSDs. From Figure 12, we observe that the aggre-gate sequential bandwidth drops with multiple VMseven when running on SSDs, even though SSDs havebetter random access IOPS than disks; however, thepercentage of reduction in bandwidth from 1 VM to2 VMs is smaller than that of the disks. Specifically,for striped SSDs, the average reduction in bandwidthfrom 1 VM to 2 VMs is about 20% whereas it is 49%on disks. From Figure 13, we notice that the aggre-gate random write bandwidth increases with multipleVMs (as opposed to the decreasing trend on disks).This is because of the better random access perfor-mance of SSDs.

In summary, since hard disk drives are sensitiveto the sequentiality of the workload, sharing disksamong multiple VMs will result in bandwidth reduc-tion. Surprisingly, this applies for random workloadsand even for SSDs, to a certain degree. In terms offairness, I/O bandwidth measured at each VM guestsfor each of the above configuration shows that thebandwidth allocation is proportional.

3.6 Implications

The above experiments analyzed resource aggrega-tion, performance degradation, and fairness of re-

9

Resource Is resource consumption Is performance degradation Is resource allocationtype additive at low utilization? gradual at peak utilization? fair at peak utilization?CPU Yes Yes YesCache Yes No No

Network No Yes YesStorage No No Yes

Figure 14: Summary of characteristics of resource-performance tradeoffs in a virtualized system.

0

2000

4000

6000

8000

10000

12000

1 Disk 2 DisksStriped

3 DisksStriped

4 DisksStriped

4 DisksSeparate

Ag

gre

gat

e H

ost

Ban

dw

idth

(K

B/s

)

1 VM2 VMs3 VMs4 VMs

Figure 11: Random write disk bandwidth.

0

50

100

150

200

1 SSD 2 SSDsStriped

3 SSDsStriped

Aggre

gat

e H

ost

Ban

dw

idth

(M

B/s

)

1 VM2 VMs3 VMs4 VMs

Figure 12: Sequential write SSD bandwidth.

source allocation for four important resource types.The key conclusion we can draw from this study isthat the specific behavior is dependent on the type ofresource and the quality of the workload. For compu-tation, resources are additive at low utilization lev-els. But performance degradation kicks in graduallywhen the number of co-hosted VMs or their CPU de-mand exceed. However, when the memory foot-printof the workload is large and the pressure on cacheis higher, it appears that, cache contention can leadto precipitous drops in performance. For network,a mild performance degradation occurs due to addi-tional delay in network packet processing. However,VMs are able to utilize the entire available capacityeven under high levels of resource contention. Fi-nally, for storage, performance degradation is quitesteep, especially for workloads with sequential I/O.The degradation is milder for random workload orwhen solid-state drives are used instead of the con-ventional mechanical disks. Overall, VMs are unableto share storage and cache resources as aggressivelyas network and computation resources. Finally, re-

0

10000

20000

30000

40000

50000

60000

1 SSD 2 SSDsStriped

3 SSDsStriped

Aggre

gat

e H

ost

Ban

dw

idth

(K

B/s

)

1 VM2 VMs3 VMs4 VMs

Figure 13: Random write SSD bandwidth.

source allocation for most resources and workloadswe studied was fair, except for cache and perhaps formulti-processor systems with large number of cores,indicating that starvation is mostly a non-issue.

These observations have the following implicationsfor VM consolidation. On the positive side, simpletechniques that can account for mild performancedegradation, such as reserving an unallocated slackcapacity as proposed in [9], might be sufficient forhosting VMs with computation and network intensiveworkloads. Even if the actual resource usage differsfrom the estimate by a big quantity, the resulting per-formance degradation will be commensurate with thedifference and be spread more or less equally over allVMs. On the negative side, consolidating VMs whichrely heavily on the storage component or sensitive onthe size of cache might be counter productive.

One approach to handle resource types that showsteep performance degradation when shared is to en-force strict reservations on the quantity of resourcesallocated to each VM. In current virtualization plat-forms, reservation can be done at a coarse granularityby allocating one or more cores with shared caches toeach VM exclusively or by allocating a separate diskto each VM. This approach also comes with a limita-tion on the maximum number of VMs a computer canhost, determined by the number of cores or numberof disks the hardware supports. Resource reservationat a finer granularity, such as a fraction of a device,is supported only for CPU and memory.

The insights drawn from this study calls for bet-ter I/O and cache performance isolation in virtual-ization platforms. A recent work called mClock [6]addresses the variability in I/O throughput in hy-

10

pervisors and proposes new mechanisms to enforceproportional fair-sharing for I/O. While this work isa step in the right direction, it needs to account forlocality in I/O requests in order to avoid performanceinterference when I/O workloads are combined.

Cache performance isolation is an even more diffi-cult problem to address. Current VM consolidationtools do not take into account the cache architectureof the servers and cache-sensitivity of their workloads.While a few hardware-level solutions have been pro-posed for cache performance isolation, dynamic pagecoloring for example, they are difficult to be incor-porated into hypervisors. Even so, the processorand cache architecture are rapidly evolving: newlyplanned systems have a high degree of asymmetryin cache layout, more tiers of shared caches, and ir-regular features, for instance no cache coherency [17].Effective VM consolidation in future data centers andcloud computing platforms will depend even more onaddressing this issue.

4 Evaluating Heuristics

In this section, we empirically compare the variousheuristics presented in Section 2. We seek to ad-dress the following questions, to guide the choice ofheuristic in any given setting. 1. How well do thesimpler, FFD-based heuristics perform? 2. Are thedimension-aware heuristics better, and if so, by howmuch? 3. What properties of the inputs affect theanswers to these questions?

Towards this end, we evaluate the heuristics FFD-Prod, FFDSum, DotProduct, and L2 on some real,and some synthetic workloads.

4.1 Real workloads

We use data from the Dryad computing cluster [8]for this study. Dryad represents a typical map-reducetype of computational framework, a common sourcefor virtualized workload in could computing clusterssuch as Amazon EC2. We recorded average usage in 5resource dimensions namely CPU, Memory, Disk, In-bound Network traffic, and Outbound Network traf-fic, for four different sets of processes running on theDryad cluster. The four sets of processes come fromfour large Dryad jobs implementing the pagerank al-gorithm (PgRank), a wordcount application (Word-Count), a seive algorithm for finding primes (Primes),and a machine learning based click-bot detection al-gorithm (Clkbot). For each process in each job, theresource usage is averaged over measurements takenevery 10 seconds. We consider a VM for each process.

Table 15 summarizes the number of VMs and theaverage normalized demands for each of the four setsof VMs. The capacities of all hosts are set to 400%CPU (4 cores), 4 GB of memory, 50 MBps Disk band-width, and 128 MBps network bandwidth. The de-mands in Table 15 are shown as a fraction of these ca-pacities. Thus we can see, for example that PgRankis memory-intensive while Primes is CPU-intensive.

In order to analyze the performance of the heuris-tics, we need a base line. An immediate problem inthis approach is that vector bin packing problem isNP-hard and thus it is difficult to find the optimumsolution (OPT). Instead we use a known lower bound.For any dimension, the ratio of the total demand inthe dimension, to the bin capacity in that dimen-sion can be shown to be a lower bound on OPT. Welet LB denote the maximum over all dimensions, ofthis ratio. We then chose the metric percentage over-head defined as 100 ∗ (ALG − LB)/LB, where ALGis the number of bins used by a specific algorithm, tocompare the effectiveness of the heuristics. Since theexperiments we do will have widely varying optimalnumber of hosts, we normalize our results relative tothe lower bound, and compare how much worse theheuristics perform relative to the best possible.

We first compare the heuristics on each set of VMscorresponding to each Dryad job. Figure 16 reportsthe results. It shows that for the pagerank set, theFFDSum does much better than FFDProduct, and ispretty close to the lower bound. The results for theother sets are similar. The dimension-aware heuris-tics do not offer further improvement.

To study the efficacy of the heuristics for moreheterogenous inputs, we construct inputs by mix-ing these sets of VMs together. To account for thevariation in the number of VMs per set, we repli-cate smaller sets so that each set contributes ap-proximately the same number of VMs to the input.The merged input consists of 1,2,10 and 7 copiesof PgRank, WordCount, Primes and Clkbot respec-tively, leading to a total of 12021 VMs. We also con-sider Clkbot+Primes consisting of 1 copy each of Clk-bot and Primes, giving us 692 jobs. Since the Clkbotis disk-intensive and Primes is CPU intensive, thesegive us negatively correlated dimensions. Similarly,we consider Clkbot+PgRank, consisting of 7 copiesof PgRank and 1 copy of Primes, leading to 6034VMs.

Figure 16 shows that for the mixed sets of pro-cesses, dimension-aware heuristics, DotProduct andL2, indeed make a difference and show less per-formance overhead compared to the dimension-lessheuristics, FDProd and FDSum. Although, there isno clear winner between DotProduct and L2. The

11

Process #VMs Avg. CPU Avg. Memory Avg.Disk Avg. Network tx Avg.Network rxPgRank 3137 0.2845 0.3804 0.0253 0.0533 0.0572WordCount 1598 0.3000 0.2702 0.0138 0.0007 0.0003Primes 279 0.3364 0.0144 0.0085 0.0078 0.0004Clkbot 414 0.1469 0.0308 0.1713 0.0574 0.0753

Figure 15: Characteristics of set of Inputs. Average values above are fractions of the capacities chosen,which are 400% of CPU, 4GB of memory, 50MBps disk bandwidth, and 128MBps network bandwidth.

difference indeed looks substantial in relative terms.The values of the lower bound LB, 596, 3217, 104, and1319 respectively in the four cases shown in the graphadds more insight in to the absolute performance. Inthe case of Clkbot+PgRank, the 10% difference be-tween FFDSum and L2 translates to 132 fewer activehosts for L2. On the other hand, in the case of Clk-bot+Prime the absolute benefit of using dimension-aware heuristics over dimension-less heuristics is onlyabout 7 hosts.

4.2 Synthetic workloads

To further understand the relation between the per-formance of the heuristics and the correlations be-tween the demands in the various dimensions, weevaluate the heuristics on some synthetic workloadsas well, where we can control the correlations care-fully. We consider four classes of synthetic inputs.We fix the number of dimensions to two. The firstthree classes are randomly generated with indepen-dent dimensions, positively correlated dimensions,and negatively correlated dimensions respectively.

Caprara and Toth [1] proposed ten classes of inputsto benchmark the effectiveness of two-dimensionalbin-packing algorithms. Our three randomly gener-ated classes are chosen from those. The first inputclass, denoted Random has the size of each bin set toh = 150. The demand for each VM, in each of the twodimensions is drawn uniformly and independently atrandom from [20, 100]. The next two input class de-noted Positive and Negative, have correlated dimen-sions. In both cases, h = 150 as in Random, and thedemand in the first dimension v1 is drawn randomlyand independently from [20, 100]. Demand in the sec-ond dimension v2 is sampled from [v1−10, v1+10] forclass Positive and from [110 − v1, 130 − v1] for classNegative. We sample 200 VMs in each case whichdefines our input. Finally, the fourth input class Neg-ativeSynth is an artificially created example to showthe extent of the worst case difference between theheuristics. Here capacity is set to a 100 in each of thetwo dimensions. There are two sets of jobs: a set of1000 jobs with sizes (20, 1) and another set of 1000jobs with sizes (3, 20). The results are resistant to

0

4

8

12

16

20

PgRank Merged Clkbot+Prime Clkbot+PgRankP

erc

en

tage

Ove

rhe

ad

FFDProd FFDSum

DotProduct L2

Figure 16: Empirical results on inputs generatedfrom Dryad data.

small amounts of noise being added to these sizes. Itturns out that the optimal solution (OPT) can indeedbe computed for these classes of input. Therefore, weuse OPT as the lower bound LB in the definition ofpercentage overhead. Figure 17 summarizes the re-sults.

4.3 Implications

We see that in all cases, FFDProd is outperformed byother heuristics. This is due to the fact that dimen-sions for which demand is much smaller than capac-ity contribute significantly to the ordering when wetake the product. While we would like to ignore suchdimensions and sort by decreasing sizes in the rele-vant dimensions, FFDProd ends up with a very noisyversion of this ordering, leading to higher overhead.FFDSum gives higher weight to to the important di-mensions and hence is less affected by the irrelevantdimensions3.

In the case of homogenous VMs, the multi-

3In our experiments we tested three different ways to cal-culate weights. In the first, wi = Di/(nhi), where Di is thetotal demand across all VMs for resource i and hi is the ca-pacity of the host for dimension i. Thus the first weight func-tion is simply the average demand as a fraction of capacity.The second and third approaches aim to make the weight in-crease more aggressively as the resource becomes scarce. Tothis end, we set wi = exp(1/Di). In the third experiment weset wi = exp(10/Di)

12

0

2

4

6

8

10

Random Positive Negative NegativeSynth

Pe

rce

nta

ge O

verh

ead

FFDProd FFDSum

DotProduct L2

Figure 17: Empirical results on two dimensionalsynthetic inputs. The overhead for NegativeSynthfor the two FFD-based algorithms is 60%.

dimensional heuristics do not necessarily improveon FFDSum. In the first case, this is due to thefact that the various VMs are very similar to eachother, and have demand in one dimension much largerthan the others. Hence the input is effectively one-dimensional. When we have a mix of jobs withdifferent dominant dimensions, the dimension-awareheuristics outperform FFDSum. While we do notknow OPT in these cases, it is noteworthy that thebetter heuristics come very close to the trivial lowerbound; their performance relative to OPT is likely tobe even better.

The results for the correlated input cases (Posi-tive and Negative) are interesting. In the positivecase, all algorithms do equally well and have a lessthan 3% overhead. This is not surprising since thisclass is effectively one-dimensional, which makes al-gorithms the same. In the negatively correlated case,DotProduct and L2 all do better than FFDSum.While we have presented results for three random 2-dimensional input classes, the results are similar forother random input classes studied in literature, andfor higher dimensional inputs. The fourth input classNegativeSynth above is designed to show the starkestcontrast between the dimension aware and the single-dimensional algorithms and shows a 60 percent differ-ence. With higher dimensional data, this worst-casedifference can be close to d.

In summary, the experiments suggest that FFD-Prod is dominated by FFDSum, which performsreasonably well on some classes of inputs. Thedimension-aware heuristics can give up to 10% im-provement over FFDSum on realistic workloads. Theexperiments suggest that when there is more thanone dominant dimension, and there is mix of VMswith different dominant resources, the dimension-aware heuristics show the largest gain.

5 Discussions and Conclusions

This paper addressed two inter-related questions thatare critical to the design and use of VM consolida-tion heuristics. The first question pertains to howresource utilization and performance aggregate whenVMs are co-hosted, trying to identify bottlenecks thatmight have adverse impacts on consolidation. Thesecond question pertains to how resource demandsand scarcities that span across different dimensions—such as computation, network, and storage—shouldbe treated with reference to VM consolidation.

Through a combination of experiments and empir-ical analysis, we draw the following insights on thesetwo questions. First, performance (and resource) ag-gregation depends on the type of resource and thequality of the workload hosted on the VMs. For com-putation and network resources, performance and re-source utilization stay close to their estimates whileperformance degradation in the presence of high re-source contention is gradual and fair. For cache andstorage, however, consolidation of cache-sensitive andstorage-intensive VMs is likely to lead to severely de-graded performance. Second, it appears that heuris-tics that have simplistic methods of treating multi-dimensional resource requirements are reasonably ef-fective in many common practical situations. Yet,more sophisticated, dimension-aware heuristics pro-vide additional benefits, especially for a mixed work-loads with negatively correlated dependence on re-sources.

These broad conclusions are evident even from thelimited experiments and empirical analysis presentedin this paper. Further experimentation would addmore details to these insights. 1. The experimentscould be repeated on different types of hardware andhypervisor platforms with more intermediate pointsof resource utilization. That would generate more de-tailed performance models, which can be used to de-sign better heuristics. 2. More complex benchmarksthat mix different resource types and run on multi-ple hosts would add further insights into consolida-tion of distributed workloads, throwing light on issuessuch as data locality and network synchronization. 3.Time-based models that capture the variation of re-source demands with time is another important di-rection to experiment with.

Overall, the insights presented here provide usefulguidance to both the customers of VM managementtools as well as the designers of new consolidationheuristics. These observations validate the followingeffective and easy-to-deploy approach to VM consol-idation: Employ a simple FFD-based heuristic thatrely on just few critical resources, whose demands

13

can be easily estimated. Pin each VM to a specificcore and a separate disk to minimize the interferenceof one VM on the other and achieve better perfor-mance isolation. For the designers of more sophisti-cated heuristics and customers seeking more aggres-sive benefits from VM consolidation, it points outadditional requirements to be met—namely, a moreprecise model of performance aggregation specific toeach resource type and workload type and more ac-curate estimates of the resource demands and qualityof workload running on each VM.

Finally, this paper also identifies fresh opportuni-ties for engineers to improve current virtualizationplatforms. For instance, it highlights the need formore cache-aware resource allocation in both the vir-tualization platform as well as by VM consolidationtools. Another great improvement needed is betterI/O performance isolation, especially for disk I/O.The recent work called mClock [6] on handling I/Othroughput variability in hypervisors is a step in theright direction. Third, virtualization platforms needto accommodate the increasing heterogeneity in thenumber of processors, the larger number of cores perprocessor, and more complex cache models in evolv-ing processor architectures such as the Intel Single-chip Cloud Computer [17]. Overall, we believe thispaper identifies pitfalls and opportunities in virtualmachines consolidation and serves as a good startingpoint for future research in this area.

References

[1] A. Caprara and P. Toth. Lower bounds and algo-rithms for the 2-dimensional vector packing problem.Discrete Applied Mathematics, 111(3):231–262, 2001.

[2] J. S. Chase, D. C. Anderson, P. N. Thakar, A. M.Vahdat, and R. P. Doyle. Managing energy andserver resources in hosting centers. In Proc. ofACM Symposium on Operating Systems Principles(SOSP), Banff, Canada, Oct. 2001.

[3] G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao,and F. Zhao. Energy-aware server provisioning andload dispatching for connection-intensive Internetservices. In Proc. of the USENIX Symposium on Net-worked Systems Design and Implementation (NSDI),San Francisco CA, Apr. 2008.

[4] Y. Chen, A. Das, W. Qin, A. Sivasubramaniam,Q. Wang, and N. Gautam. Managing server energyand operational costs in hosting centers. In Proc. ofthe ACM SIGMETRICS Conference, Banff, Canada,June 2005.

[5] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson.Approximation algorithms for bin packing: a survey.In Approximation algorithms for NP-hard problems,

pages 46–93, Boston, MA, USA, 1997. PWS Publish-ing Co.

[6] A. Gulati, A. Merchant, and P. J. Warman. mClock:Handling throughput variability for hypervisor IOscheduling. In Proc. of the USENIX Symposiumon Operating Systems Design and Implementation(OSDI), Vancouver, Canada, Oct. 2010.

[7] D. S. Hochbaum, editor. Approximation algorithmsfor NP-hard problems. PWS Publishing Co., Boston,MA, USA, 1997.

[8] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fet-terly. Dryad: Distributed data-parallel programsfrom sequential building blocks. In Proc. of EuroSysConference, Lisbon, Portugal, Mar. 2007.

[9] R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-Clouds: Managing performance interference effectsfor QoS-aware clouds. In Proc. of the Eurosys Con-ference, Paris, France, Apr. 2010.

[10] E. Pinheiro, R. Bianchini, E. V. Carrera, andT. Heath. Load balancing and unbalancing forpower and performance in cluster-based systems. InProc. of the Workshop on Compilers and OperatingSystems for Low Power (COLP), Barcelona, Spain,Sept. 2001.

[11] M. Rosenblum and T. Garfinkel. Virtual machinemonitors: current technology and future trends.IEEE Computer, 38(5):39–47, May 2005.

[12] S. Srikantaiah, A. Kansal, and F. Zhao. Energyaware consolidation for cloud computing. In Proc.of the Workshop on Power Aware Computing andSystems (HotPower), San Diego, CA, Dec. 2008.

[13] C. Tang, M. Steinder, M. Spreitzer, and G. Pacifici.A scalable application placement controller for en-terprise data centers. In Proc. of the InternationalConference on World Wide Web (WWW), Banff,Canada, May 2007.

[14] T. Wood, P. J. Shenoy, A. Venkataramani, and M. S.Yousif. Black-box and gray-box strategies for virtualmachine migration. In Proc. of the USENIX Sympo-sium on Networked Systems Design and Implemen-tation (NSDI), Cambridge, MA, Apr. 2007.

[15] Amazon Elastic Compute Cloud (Amazon EC2).http://aws.amazon.com/ec2/.

[16] Windows Azure. http://www.microsoft.com/

windowsazure/windowsazure/.

[17] Intel research :: Single-chip cloud com-puter. http://techresearch.intel.com/

ProjectDetails.aspx?Id=1.

[18] Microsoft Systems Center Virtual Machine Man-ager. http://www.microsoft.com/systemcenter/

virtualmachinemanager.

[19] VMware DRS - dynamic scheduling of system re-sources. http://www.vmware.com/products/vi/vc/

drs.html.

[20] Citrix XenServer. http://www.citrix.com/

xenserver.

14

validating heuristics for virtual machines consolidation · analysis. this experimental study...

Documents