optimizing virtual machines using hybrid virtualization

11
The Journal of Systems and Software 85 (2012) 2593–2603 Contents lists available at SciVerse ScienceDirect The Journal of Systems and Software jo u rn al hom epage: www.elsevier.com/locate/jss Optimizing virtual machines using hybrid virtualization Qian Lin a , Zhengwei Qi a,, Jiewei Wu a , Yaozu Dong b , Haibing Guan a a Shanghai Key Laboratory of Scalable Computing and Systems Shanghai Jiao Tong University, Shanghai, PR China b Intel Open Source Technology Center, PR China a r t i c l e i n f o Article history: Received 10 October 2011 Received in revised form 31 May 2012 Accepted 31 May 2012 Available online 9 June 2012 Keywords: Hybrid virtualization Hardware-assisted virtualization Paravirtualization a b s t r a c t Minimizing virtualization overhead and improving the reliability of virtual machines are challenging when establishing virtual machine cluster. Paravirtualization and hardware-assisted virtualization are two mainstream solutions for modern system virtualization. Hardware-assisted virtualization is supe- rior in CPU and memory virtualization and becoming the leading solution, yet paravirtualization is still valuable in some aspects as it is capable of shortening the disposal path of I/O virtualization. Thus we propose the hybrid virtualization which runs the paravirtualized guest in the hardware-assisted virtual machine container to take advantage of both. Experiment results indicate that our hybrid solution out- weighs origin paravirtualization by nearly 30% in memory intensive test and 50% in microbenchmarks. Meanwhile, compared with the origin hardware-assisted virtual machine, hybrid guest owns over 16% improvement in I/O intensive workloads. © 2012 Elsevier Inc. All rights reserved. 1. Introduction System virtualization is becoming ubiquitous in contempo- rary datacenter. Consolidating physical server by building virtual machine cluster is universally adopted to maximize the utiliza- tion of hardware resources for computing. Two fundamental but challenging requirements are to minimize virtualization overhead (Mergen et al., 2006) and to guarantee the reliability building virtu- alized infrastructure. Therefore, low level design of VM architecture is of great significance. The conventional x86 architecture is incapable of classical trap- and-emulate virtualization, causing paravirtualization to be the optimal virtualization strategy in the past (Barham et al., 2003; Adams and Agesen, 2006). Recently, hardware-assisted virtual- ization on x86 architecture has become a competitive alternative method. Yet Adams and Agesen (2006) compared the performance between software-only VMM and hardware-assisted VMM, and the statistics showed that HVM suffered from much higher over- head than PVM owing to the frequent context switching, which had to perform an extra host/guest round trip in the early HVM solution. However, the latest hardware-assisted virtualization improvement introduces heavy overhead. Hardware-assisted pag- ing (Neiger et al., 2006) allows hardware to handle the guest MMU operation and translate guest physical address to real machine address dynamically, accelerating memory relevant operations and improving overall performance of the HVM. Corresponding author. Tel.: (+86) 021 34205595. E-mail address: [email protected] (Z. Qi). Although hardware-assisted virtualization performs well with CPU intensive workloads, it manifests low efficiency when pro- cessing I/O events. Our experiment shows that PVM performs up to 20% lower CPU utilization than HVM with the 10 Gbps net- work workload. The interrupt controller of HVM originates in the native environment with fast memory-mapped I/O access but is suboptimal in the virtual environment due to the requirement of trap-and-emulate. Frequent interrupts lead to frequent context switches and high round trip penalty, particularly for multiple vir- tual machines (Menon et al., 2005). Consequently, hardware-assisted virtualization is superior in CPU and memory virtualization, and software-only virtualization owns optimized features for I/O virtualization. In practice, the per- formance issue is very workload-dependent because most real world applications are the mix of CPU and I/O intensive tasks. Therefore, hybrid virtualization techniques (Adams and Agesen, 2006) become promising. Nevertheless, the previous Hybrid VMM prototype (Adams and Agesen, 2006) leveraged the guest behavior- driven heuristics to improve performance. But its performance gain heavily depended on the prediction accuracy, and became marginal for modern workloads. The contribution of this paper is a practical one. We propose a novel hybrid solution which takes both superiority features of PVM and HVM, and implement the prototype on Xen platform. The prin- cipal idea of our hybrid virtualization is to run the paravirtualized guest in the HVM container to reach maximum optimization. The Hybrid VM primarily features less MMU operation latency bene- fited from hardware-assisted paging technique and lower interrupt disposal overhead profited from the paravirtualized event channel. Besides, the original hardware-assisted virtualization environment 0164-1212/$ see front matter © 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jss.2012.05.093

Upload: haibing

Post on 12-Dec-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimizing virtual machines using hybrid virtualization

O

Qa

b

a

ARRAA

KHHP

1

rmtc(ai

aoAimbthhsiioai

0h

The Journal of Systems and Software 85 (2012) 2593– 2603

Contents lists available at SciVerse ScienceDirect

The Journal of Systems and Software

jo u rn al hom epage: www.elsev ier .com/ locate / j ss

ptimizing virtual machines using hybrid virtualization

ian Lina, Zhengwei Qia,∗, Jiewei Wua, Yaozu Dongb, Haibing Guana

Shanghai Key Laboratory of Scalable Computing and Systems Shanghai Jiao Tong University, Shanghai, PR ChinaIntel Open Source Technology Center, PR China

r t i c l e i n f o

rticle history:eceived 10 October 2011eceived in revised form 31 May 2012ccepted 31 May 2012vailable online 9 June 2012

a b s t r a c t

Minimizing virtualization overhead and improving the reliability of virtual machines are challengingwhen establishing virtual machine cluster. Paravirtualization and hardware-assisted virtualization aretwo mainstream solutions for modern system virtualization. Hardware-assisted virtualization is supe-rior in CPU and memory virtualization and becoming the leading solution, yet paravirtualization is still

eywords:ybrid virtualizationardware-assisted virtualizationaravirtualization

valuable in some aspects as it is capable of shortening the disposal path of I/O virtualization. Thus wepropose the hybrid virtualization which runs the paravirtualized guest in the hardware-assisted virtualmachine container to take advantage of both. Experiment results indicate that our hybrid solution out-weighs origin paravirtualization by nearly 30% in memory intensive test and 50% in microbenchmarks.Meanwhile, compared with the origin hardware-assisted virtual machine, hybrid guest owns over 16%improvement in I/O intensive workloads.

. Introduction

System virtualization is becoming ubiquitous in contempo-ary datacenter. Consolidating physical server by building virtualachine cluster is universally adopted to maximize the utiliza-

ion of hardware resources for computing. Two fundamental buthallenging requirements are to minimize virtualization overheadMergen et al., 2006) and to guarantee the reliability building virtu-lized infrastructure. Therefore, low level design of VM architectures of great significance.

The conventional x86 architecture is incapable of classical trap-nd-emulate virtualization, causing paravirtualization to be theptimal virtualization strategy in the past (Barham et al., 2003;dams and Agesen, 2006). Recently, hardware-assisted virtual-

zation on x86 architecture has become a competitive alternativeethod. Yet Adams and Agesen (2006) compared the performance

etween software-only VMM and hardware-assisted VMM, andhe statistics showed that HVM suffered from much higher over-ead than PVM owing to the frequent context switching, whichad to perform an extra host/guest round trip in the early HVMolution. However, the latest hardware-assisted virtualizationmprovement introduces heavy overhead. Hardware-assisted pag-ng (Neiger et al., 2006) allows hardware to handle the guest MMU

peration and translate guest physical address to real machineddress dynamically, accelerating memory relevant operations andmproving overall performance of the HVM.

∗ Corresponding author. Tel.: (+86) 021 34205595.E-mail address: [email protected] (Z. Qi).

164-1212/$ – see front matter © 2012 Elsevier Inc. All rights reserved.ttp://dx.doi.org/10.1016/j.jss.2012.05.093

© 2012 Elsevier Inc. All rights reserved.

Although hardware-assisted virtualization performs well withCPU intensive workloads, it manifests low efficiency when pro-cessing I/O events. Our experiment shows that PVM performs upto 20% lower CPU utilization than HVM with the 10 Gbps net-work workload. The interrupt controller of HVM originates in thenative environment with fast memory-mapped I/O access but issuboptimal in the virtual environment due to the requirementof trap-and-emulate. Frequent interrupts lead to frequent contextswitches and high round trip penalty, particularly for multiple vir-tual machines (Menon et al., 2005).

Consequently, hardware-assisted virtualization is superior inCPU and memory virtualization, and software-only virtualizationowns optimized features for I/O virtualization. In practice, the per-formance issue is very workload-dependent because most realworld applications are the mix of CPU and I/O intensive tasks.Therefore, hybrid virtualization techniques (Adams and Agesen,2006) become promising. Nevertheless, the previous Hybrid VMMprototype (Adams and Agesen, 2006) leveraged the guest behavior-driven heuristics to improve performance. But its performance gainheavily depended on the prediction accuracy, and became marginalfor modern workloads.

The contribution of this paper is a practical one. We propose anovel hybrid solution which takes both superiority features of PVMand HVM, and implement the prototype on Xen platform. The prin-cipal idea of our hybrid virtualization is to run the paravirtualizedguest in the HVM container to reach maximum optimization. The

Hybrid VM primarily features less MMU operation latency bene-fited from hardware-assisted paging technique and lower interruptdisposal overhead profited from the paravirtualized event channel.Besides, the original hardware-assisted virtualization environment
Page 2: Optimizing virtual machines using hybrid virtualization

2594 Q. Lin et al. / The Journal of Systems and Software 85 (2012) 2593– 2603

F 0 playT n Dom

sfgs

dhSdcpitlir

2

asbpt

2

shghaiapm

ig. 1. Xen architecture. The Xen hypervisor is managing three types of VM. Domainhe front-end device drivers in DomainU communicate with the back-end drivers i

uffers from the issue of timer synchronization, which renders dif-erent timer resources hard to keep their relative timing pace touarantee the timing correctness of VMs. We propose a feasibleolution within the hybrid virtualization to solve this problem.

The rest of this paper is organized as follows. Section 2 intro-uces the background of virtualization as well as the software andardware approaches with their advantages and disadvantages.ection 3 presents the hybrid virtualization architecture and designetails. Section 4 deeply analysis the key factor affecting the effi-iency of system call in guest OS, which can be treated as theerformance indicator of VM. Section 5 specifically discusses the

ssue of timer synchronization under virtualized environment andhe solution accessed by the hybrid virtualization. Section 6 ana-yzes the performance evaluation to demonstrate the performancemprovements of the hybrid virtualization. Section 7 summarizeselated work and Section 8 concludes.

. Pros and cons in different virtualization mechanisms

With the promotion of virtualization technology, software-onlynd hardware-assisted virtualization approaches play differentuperiority in various fields. In this section, we firstly introduce theackground of two mainstream virtualization techniques, and thenresent the details about advantages and disadvantages betweenhem.

.1. Paravirtualization

Xen (Barham et al., 2003; Clark et al., 2004) is famous forupporting paravirtualization (Whitaker et al., 2002). The Xenypervisor locates between the physical hardware layer and theuest OS layer, as shown in Fig. 1 (Liu et al., 2006). The Xenypervisor runs at the lowest level and owns the most privilegedccess to hardware. Among various VMs, Domain0 plays an admin-

strator role and provides service for DomainU VMs. Domain0lso extends part of the functionalities of hypervisor. For exam-le, Domain0 hosts back-end device drivers to manage the deviceulti-access from VMs, which utilize front-end device drivers and

s an administrator role and supplies service for DomainU involving PVM and HVM.ain0 through device channel.

device channel to communicate with the back-end foundation(Xen.org, 2008).

The PVM guest kernel requires purposive modifications to adaptefficient software-only virtualization (Barham et al., 2003). Gen-erally, x86 CPU privilege level is distinguished by different rings,where Ring0 is most privileged and Ring3 is least. As the hypervi-sor requires higher privilege level than VMs, the PVM guest kernelyields Ring0 to the hypervisor. Since paravirtualization does notchange the application interfaces, user software can run in the Xenenvironment without any modification. Besides, paravirtualizationuses DPT (Barham et al., 2003) as its memory virtualization strategy.In order to avoid page table switch at the time of hypervisor/guestboundary crossing, DPT modifies guest page table to be suitablefor hardware processor usage as well as guest OS access. By mod-ifying guest kernel, DPT partitions address space between guestOS and hypervisor, utilizing segment limit check to protect hyper-visor from guest access. It reserves certain area of address spacefrom each guest kernel to be dedicated to hypervisor usage. Conse-quently, each PVM shares its page table with the hypervisor so thatthe hypervisor can paravirtualize the guest paging mechanism.

2.2. Hardware-assisted virtualization

Hardware-assisted virtualization technique simplifies thedesign of virtualization management layer, i.e. hypervisor, andenhances the general performance with the help of proces-sor virtualization. The conventional virtualization technique ofdynamical binary translation was a compromising solution forsystem virtualizaiton without guest OS modification. The criticalissue of dynamical binary translation is its low performanceefficiency and design complexity due to the incapability of clas-sical trap-and-emulate virtualization with previous generationof x86 architecture. Nevertheless, modern x86 architecture withhardware-assisted virtualization extension has fixed the trap-

and-emulate virtualization hole on the architecture level, whichhighly reduces the design complexity of hypervisor. Hardware-assisted virtualization becomes an alternative and improvedsolution replacing dynamical binary translation. Furthermore,
Page 3: Optimizing virtual machines using hybrid virtualization

Q. Lin et al. / The Journal of Systems and Software 85 (2012) 2593– 2603 2595

Table 1Comparison summary between paravirtualization and hardware-assisted virtualization.

Paravirtualization Hardware assisted virtualization

Pros Cons Pros Cons

CPU virtualization � No time injection. � Ring compression. � No modification for the guest OS. � Boundary switching betweenhypervisor and VM consumesremarkable extra CPU cycles.

� Hypercall rather thanhardware trapping.

� Kernel modification. � Facilitating hypervisor design.

Memory virtualization � Less translation. � Page table updating requireshypervisor intervention.

� Guest is able to maintain itsunmodified paging mechanism.

� Software-only shadow pagetable introduces muchoverhead.

� Memory share. � Causing considerable TLBflushes.

� Acceleration of hardware assistedpaging.

I/O virtualization � Performance boost. � Requiring specialized driversin the guest.

� Close to native performance. � Exclusive device access.

� Optimal for CPU utilization. � Isolation and stabilitydepend on implementation.

� Using original driver from guest OS. � Scalability drawback withPCI slots limitation in the

urtV

iovacepAthfPpemcpgaamnoagchwi

2h

bTv

nlike paravirtualization, hardware-assisted virtualizationequires no paravirtualized modifications to guest OS kernelo guarantee the trap-and-emulate efficiency because most of theM state transitions are handled by hardware.

However, the unmodified guest OSes using processor virtual-zation alone require frequent VM traps and incur even higherverhead since the VM exit, state transition from VM to hyper-isor, is more costly than the fast system call (e.g., implementeds hypercall in Xen (Barham et al., 2003)). More than 6000 CPUycles are required to process each VM exit and its return (Barhamt al., 2003). So in the old days paravirtualization dominated theerformance against hardware-assisted virtualization (Adams andgesen, 2006). But as advanced hardware-assisted virtualization

echnology continues to be developed, such as enhancement ofardware-assisted paging (Neiger et al., 2006), the overall per-

ormance of HVM catches up with PVM and exceeds it currently.rior to hardware-assisted paging, software-only virtualization ofaging mechanisms like SPT are widely adopted in HVM. SPTntitles guest OS to maintain its page table independently, i.e.,apping relationship from guest virtual address to guest physi-

al address. The guest physical address is not really “physical”, butseudo. Hypervisor maintains another mapping translation fromuest physical address to host physical address, the real machineddress through which accessing physical memory. To acceler-te memory accessing of VM, SPT is adopted to store the directapping from guest virtual address to host physical address. Alter-

atively, hardware-assisted paging offers an additional dimensionf addressing and translates guest physical address to machineddress. No matter with SPT or hardware-assisted paging, the HVMuest OS is enabled to use its original paging mechanism withoutooperating with hypervisor. But hardware-assisted paging couldelp to avoid VM exits within the page table update operation,hich is the major source of virtualization overhead when SPT is

n use.

.3. Comparison between paravirtualization andardware-assisted virtualization

Table 1 summarizes the main technical advance and defectetween paravirtualization and hardware-assisted virtualization.he comparison falls into three categories: CPU, memory, and I/O

irtualization.

CPU. Putting the guest kernel to Ring1 or Ring3 makes the PVMsuffer from remarkable overheads introduced by the boundary

system.� Good isolation.

switching within the system call and hypercall. Hardware-assisted virtualization could eliminate these overheads byputting the guest kernel back to Ring0 and take the nature of thenative OS. But its crucial point is the expensive VM exit overhead,even though alleviated by the hardware improvement.

• Memory. DPT plays an essential role in the virtualization ofpaging mechanism, making it the unique memory virtualizationstrategy in paravirtualization. SPT holds the great advantage ofno requirement for guest paging mechanism modification, yetits performance fails to compete with the DPT. However, withthe emergence of hardware-assisted paging, its efficiency andcompatibility outweigh both those of DPT and SPT. It acceler-ates the HVM paging by simplifying the MMU address translationand reducing the VM exit amount, especially with CPU intensiveworkload.

• I/O. I/O virtualization solution could be flexible in both PVM andHVM, as the virtual device share and direct I/O are both avail-able. The critical difference between PVM and HVM within theasynchronous event is with respect to the interrupt handlingmechanism. Using event channel and virtual IRQ strategies, PVMcould save the CPU utilization compared with HVM which mainlyadopts the native APIC.

In brief, it is sensible to merge both the superiority of paravir-tualization and hardware-assisted virtualization for performancemaximization. This is the primary motivation of our hybrid vir-tualization approach. Besides, issues about reliability such as thelegacy timer synchronization problem also mean to be solved bythe hybrid architecture.

3. Hybrid virtualization design

3.1. Overview

The performance issue with x86 64 PVM derives from the com-promised architecture in which the kernel space and user spacereside in the same privilege ring level (Ring3). Yet they use differ-ent page directory to maintain the space isolation. Consequently,when the boundary switching occurs between kernel mode anduser mode, the necessary TLB flushes will cause overhead andmuch more system call overheads will also be introduced. Thus,

the primary motivation of hybrid virtualization is to eliminate theseoverheads by locating the guest kernel back to Ring0.

There are two probable architecture types of hybrid virtualiza-tion, termed hybrid PVM and hybrid HVM.

Page 4: Optimizing virtual machines using hybrid virtualization

2596 Q. Lin et al. / The Journal of Systems and Software 85 (2012) 2593– 2603

Fig. 2. Hybrid virtualization architecture. (a) Hybrid PVM starts from paravirtualization. It puts the paravirtualized Linux kernel back to Ring0 and introduces hardware-a ualizap

3

PdsMaPBdcvig

3

Hschiwmtcfpibtv

aitH

ssisted paging (HAP) support. (b) Hybrid HVM starts from hardware-assisted virtaravirtualized components to obtain performance improvement.

.1.1. Hybrid PVMThe hybrid PVM constructs the hybrid architecture based on

VM, as shown in Fig. 2(a). In order to eliminate the overheadsue to the boundary switching between kernel space and userpace, PVM guest kernel should be moved back to the Ring0.eanwhile, some hardware-assisted virtualization features such

s hardware-assisted paging should be introduced to improve theVM performance via hardware-assisted virtualization technology.y the nature of paravirtualization mechanism, hybrid PVM guestoes not need any QEMU1 device model and the system bootingould be very quick. Nevertheless, it also contributes to its disad-antage that hybrid PVM guest OS is incapable of the native booting,.e., the hypervisor furnishes a customized boot loader for the PVMuest usage.

.1.2. Hybrid HVMThe alternative implementation of hybrid architecture is hybrid

VM, as shown in Fig. 2(b), which enhances the current HVM witheveral paravirtualized components. Hybrid HVM leverages nativeode and extends to its superset. As an incremental approach,ybrid HVM is easier to reuse most hardware-assisted virtual-

zation features and less modification is needed when comparedith the hybrid PVM. Linux kernel 2.6.23 and the later versionserge the paravirtualization optional support into the mainstream,

ermed pv ops. Linux kernel can apply pv ops to self-patch its binaryode, converting the sensitive instructions into non-sensitive onesor the hypervisor to do the paravirtualization work. Meanwhile,v ops is similar to the kernel hook providing paravirtualizednterface to the hypervisor and facilitating the paravirtualizationehavior changes. The hybrid HVM imports some Xen paravir-ualization APIs and utilizes the pv ops to build up the hybridirtualization guest.

Although there exist many differences between the hybrid PVMnd hybrid HVM approaches, their ultimate goals meet the samentrinsic property: adopt superiority features of both the paravir-

ualization and hardware-assisted virtualization solutions in theVM container.

1 QEMU project: http://www.qemu.org/.

tion. It reuses most hardware-assisted virtualization features and imports several

3.2. Xen modification

Hybrid PVM and hybrid HVM are both feasible for the imple-mentation of hybrid virtualization and they can reach the samepoint eventually. But the latter is more natural and practical withthe hardware-assisted virtualization inheritance and less codemodification to Linux. Therefore, our hybrid extension is startedfrom HVM, importing a component based paravirtualization fea-ture selection such as paravirtualized halt, paravirtualized timer,event channel and paravirtualized drivers. Consequently, the guestwith hybrid extension feature can take advantage of both paravir-tualization and hardware-assisted virtualization.

3.2.1. Hardware-assisted pagingHardware-assisted paging (e.g., Intel Extended Page Table tech-

nology and AMD Nested Page Table technology) plays a vitalaccelerating role in the hardware-assisted virtualization (Bhargavaet al., 2008). Without hardware-assisted paging support, HVM uti-lizes shadow page table (Barham et al., 2003; Adams and Agesen,2006) and shadow TLBs for the accuracy of guest memory mappingand accessing. The crucial defect of the Shadow series strategy isthat each MMU address translation needs to be trapped into thehypervisor (such behavior is called VM exit) and travel anotherlong execution path to fetch the real address. During such pro-cedure, an inevitable and considerable round trip overhead isintroduced which takes more than 10 times of CPU cycles comparedwith the native MMU address translation. But with the hardware-assisted paging support, those Shadow series could be deserted andthe MMU address translation in HVM could avoid triggering a greatdeal of VM exits.

3.2.2. Interrupt disposal changesRedundant VM exits also exist in the interrupt disposal of HVM,

yet no hardware-assisted solution could fix it currently. This can actas the bottleneck of I/O efficiency. Fortunately, paravirtualizationowns a sound strategy in this situation, which offers an opportu-

nity to import it into our hybrid solution. Event channel and QEMUdevice support are entitled for the hybrid guest. Each QEMU emu-lated I/O APIC pin is mapped to a virtual interrupt request so thatone virtual interrupt request instead of I/O APIC interrupt would
Page 5: Optimizing virtual machines using hybrid virtualization

ms and Software 85 (2012) 2593– 2603 2597

bndahaltV

3

uthetVthv

3a1cnc

4

prstitpas

isPsTbgspastseae

se

0

Fig. 3. The paths of system call. The outer line indicates the system call execution

Q. Lin et al. / The Journal of Syste

e delivered to the guest if the device asserts the pin. Event chan-el is a signaling mechanism for inter-domain communication. Oneomain can send a signal to another domain through event channel,nd each domain can receive signals by registering event channelandler. Besides, the disposal path of Message Signaled Interruptsnd its extension (MSI/MSI-X), which conventionally relies on theocal APIC, is also altered for optimization. Hybrid solution paravir-ualized the MSI/MSI-X handling so that MSI/MSI-X do not causeM exit.

.3. Implementation

Current hybrid virtualization approach could support x86niprocessor guests and x86 64 uniprocessor and symmetric mul-iprocessor guests with MSI/MSI-X support. Single binary withybrid feature support can run in PVM, HVM, Hybrid VM and nativenvironment. User can turn on the hybrid virtualization support forhe DomainU by setting a special Hybrid Feature CPUID script in theM configuration file. As long as the Hybrid Feature CPUID is iden-

ified within the DomainU creating progress, HVMOP enable hybridypercall will be triggered to invoke the hybrid capability of hyper-isor, i.e., hybrid virtualized guest will be built up.

Our hybrid virtualization approach is built on the vanilla Xen.4.1 and the guest Linux kernel 2.6.30. The modifications aredded to both the Xen hypervisor and guest kernel with 267 and003 source lines of code, respectively. We also pack the designhanges as patches involving the hypervisor part and the guest ker-el part. These patches have been released to the Xen open sourceommunity.2

. Efficiency of system call

The key point in paravirtualized CPU control is the ring com-ression (or called de-privileging), putting the guest kernel to uppering level (less privileged). This idea actually furnished a boostedoftware-only virtualization solution in the past days. Nevertheless,hings have been changed with hardware-assisted virtualizationmprovement. Paravirtualization is now suffering from many bot-lenecks in the virtualization world, one of which refers to the baderformance of system call with the legacy ring decompressionnd isolation consideration. Both 32-bit and 64-bit PVM guests areubject to this issue while the latter is more serious.

Although the 64-bit architecture (e.g., x86 64) still has four priv-lege rings, only Ring0 and Ring3 are available to separate theystem kernel and user application currently. Thus, in the 64-bitVM case, the kernel space and user space have to stay in theame privilege ring (Ring3) and the Domain0 host occupies Ring0.he outer line in Fig. 3 illustrates the system call path in 64-it PVM. When a system call occurs, it is first trapped from theuest user space to the host, and then the hypervisor injects theystem call event into the guest kernel; once the guest kernel com-letes the service, the execution flow jumps into the hypervisorgain and finally returns to the guest user space. Such hypervi-or intervention introduces considerable overhead of those roundrips within the system call. Additionally, the code path overheadhould also be taken into account. The same bouncing mechanism ismployed when handling exceptions such as page faults (Nakajimand Mallick, 2007), which are also first intercepted by hypervisorven if generated purely by user processes.

Meanwhile, extra TLB flushes within this procedure further thelowdown. Normally, the transition between user and kernel spacexpects no TLB flush except for switching to the process with

2 http://old-list-archives.xen.org/archives/html/xen-devel/2010-3/msg00634.html.

path in the 64-bit PVM whose kernel lies in Ring3. Similarly, the inner line demon-strates that in the Hybrid VM which puts the kernel in Ring0. The sequences ofboundary switch are numbered in both cases.

another page table, i.e., system call does not need TLB flush usu-ally. But in the 64-bit PVM case, guest user space and kernel spaceare both in Ring3 and they have to be separated from each other.So they do not share the same page table as they normally do. Thehypervisor is located in the high memory region and marks all itspages as global. Generally, the principal event leading TLB flush isthe address space switching. Therefore, when the guest executesa system call, it will be trapped by the hypervisor and then thehypervisor injects the system call event to the guest kernel, whichrequires a TLB flush due to the different page table; when it comesback to the user space, another TLB flush is needed.

In the hybrid virtualization, the paravirtualized guest runs inthe HVM container so that the guest kernel can be back to Ring0.Thus, guest system call could avoid hypervisor intervention andcurtail the code path, as demonstrated by the inner line of Fig. 3.System call just bounces in the DomainU so that the TLB could bemaintained. In the meantime, with hardware-assisted paging accel-eration, the overhead of secure page table modification disappears.Hence, hybrid virtualization can show close-to-native performancewith the memory intensive workload, much going beyond the pureparavirtualization.

5. Timer synchronization among high volume VM cluster

Modern computer systems use a variety of timer resources tocount clock ticks. PIT, TSC and HPET are more often than not appliedfor the time keeping in commodity OS as Windows and Linux.Although these diverse timer resources may tick in the differentfrequencies or trigger interrupt in different intervals, they all walkforward at a fixed and related pace reflecting the external timeelapsed. For example, OS may rely on either TSC or HPET as thetiming base. A 2 GHz TSC ticks 200 million times which means a100 ms real time interval; but a 10 MHz HPET only needs to tick 1million times to reach the same duration. Similarly, interrupt basedinterval timer as PIT has to trigger 100 interrupts in the same periodwhen programmed at 1 kHz interrupt frequency.

None of these timer resources, however, is absolutely reli-able, while timing for operating systems is essential. Consequently,

through cross-referencing timer resources, OS as Linux is capableof correcting the potential time drift due to software or hardwarefactors of turbulence. For example, since PIT uses crystal which ismore precise than TSC oscillator, Linux uses PIT to calibrate the TSC
Page 6: Optimizing virtual machines using hybrid virtualization

2598 Q. Lin et al. / The Journal of Systems and Software 85 (2012) 2593– 2603

of tim

fioste

nbfvdis(gtrirto

pwLcthqonstL

da

shoot” way to generate asynchronous interrupts and is not capa-ble of detecting the interrupt losing problem. Consequently, by

Fig. 4. Chaos

requency. However, owing to the chance of interrupt delay or miss,t is probable that two PIT interrupts may be merged if the previousne has not been serviced yet when a new one comes. Therefore,ome versions of Linux may use TSC conversely to detect the poten-ial interrupt miss from PIT if time between two PIT interruptsxceeds a tick period.

In other words, all timer resources in the real world are synchro-ized to indicate the external time elapsed. However, virtualizationreaks this relationship, leading that conventional HVM suffersrom timer synchronization issue. Each timer resource in theirtualization environment is emulated by the hypervisor indepen-ently. The guest TSC is pinned to the host TSC with a constant

nterval offset. PIT interrupts may be stacked because of the VCPUcheduling. For example, when a VCPU is switched out for 30 msFig. 4a), the hypervisor may inject 30 interrupts (if PIT is pro-rammed at 1 kHz) immediately to reflect the elapsed time whenhe VCPU is switched back (Fig. 4b–d). If the PIT interrupt serviceoutine in the guest OS refers the PIT interrupt quantity (e.g., jiffiesn Linux) to its TSC, it will immediately sense tremendous lost inter-upts and pick them up since the guest TSC is already advanced buthe jiffies is still staying with the value when the VCPU was switchedut.

However, the hypervisor does not know whether the guest willick up the lost interrupts, and is consequently stuck in a dilemmahether it should inject the entire missing PIT interrupts. OS as

inux calculates lost ticks on each clock interrupt according to theurrent TSC and the TSC of the last PIT interrupt, and then adds losticks to jiffies in order to fix the inaccuracy (Fig. 4 (1–3)); but theypervisor also injects lost ticks to the guest (Fig. 4b–d). Conse-uently, the redundant compensation accounting causes the chaosf guest timer resources, as shown in Fig. 4d, (4). Furthermore, it isot wise to just depend on guests themselves without the hypervi-or’s tick compensation, because other guest OSes may not supporthe self-compensation ability such as Microsoft Windows and old

inux kernel (e.g., early than 2.6.16).

Although temporarily drifting the TSC within the PIT interruptelivery may solve the problem to a certain degree, it introducesnother problem for SMP guests. Each VCPU in SMP guests has its

er resources.

own TSC, which is synchronized in the real world with each other aswell as other timers such as PIT and HPET. Given that each VCPU isscheduled independently in Xen, if all timer resources are synchro-nized, the single VCPU with its TSC blocked due to being scheduledout will block the platform timer resources as well. Consequently,other VCPU TSC will be frozen for the sake of synchronization.In other words, such forcible timer synchronization paradoxicallyprohibits VCPUs from scheduling.

In order to achieve a comprehensive solution for timer synchro-nization, the hybrid achievement in this paper modifies HVM byimporting the paravirtualized timer component to establish theuniform timer management (UTM). All guest timer resources areparavirtualized and redirected to the hypervisor aware field. Thehypervisor does the whole synchronization work and prepares theaccurate time values for the guests, who are entitled to fetchingthem via shared memory. Hence, UTM could eliminate the legacytime drift and guarantee precious timer synchronization. In themeantime, UTM also saves a great amount of unnecessary interruptinjecting in the hypervisor and the tick counting in the guest.

Apart from the traditional tick based kernel, Linux can be con-figured as tickless kernel 3 by setting “CONFIG NO HZ” in the kernelcompilation options. The feature of tickless kernel is replacing theold periodic timer interrupts with “on-demand” interrupts, whosetimers are reprogrammed to calculate the time interval via per-CPU high resolution timers. Consequently, unlike the conventionalmechanism of OS heartbeat driven by periodical tick, the ticklesskernel allows idle CPUs to remain idle until a new task is queued forprocessing. However, similar to the issue of losing periodic timerinterrupts due to the delayed VM schedule, the tickless kernel mayalso suffer the pain from losing asynchronous interrupts in thevirtualized environment. Unlike the strategy of cross-referencingamong timers in tick based kernel, the tickless kernel uses a “one

the nature of the tickless kernel, no interrupt compensation is

3 Linux tickless kernel: http://kerneltrap.org/node/6750.

Page 7: Optimizing virtual machines using hybrid virtualization

Q. Lin et al. / The Journal of Systems and Software 85 (2012) 2593– 2603 2599

System Call

ProcessCreation

Pipe-basedContextSwitching

PipeThroughput

File Copy(4KB,8000)

File Copy(256B,500)

File Copy(1KB,2000)

ExeclThroughput

Whetstone

Dhrystone

Shell Scripts(8 concurrent)

0 % 20 % 40 % 60 % 80 % 100 %

Hybri d HV M PVM

205.98%204.33%

95.60%

91.77%

77.55%

94.33%

96.61%

92.81%

93.21%

89.80%

95.67%

95.62%

95.67%

90.31%

85.11%

93.98%

97.19%

93.86%

94.62%

88.54%

95.79%

95.49%

12.33%

17.18%

16.11%

21.14%

39.58%

25.67%

27.14%

28.21%

95.84%

95.76%

85.64%

mong

aosbrfh

6

w3xH2att

6

teTtrttdf

Fig. 5. UnixBench performance comparison a

vailable if a VM is not scheduled in timely, resulting in the problemf losing asynchronous interrupts. To address this issue, hypervi-or is able to detect and collect the missing interrupts of VM andehave the compensation. Although this is not involved in the cur-ent implementation of hybrid virtualization in this paper, it iseasible to add this feature into the further improvement of ourybrid virtualization prototype.

. Evaluation

All experiments were conducted on the platform configuredith 3.20 GHz Intel Core i7-965 processor. The hypervisor was Xen

.4.1. The host (Domain0) and the guest VMs ran CentOS 5.5 for86 64. The native environment and the DomainU, including PVM,VM and hybrid guests, shared the same kernel binary of Linux.6.30 with hybrid virtualization extension, whereas Domain0dopted the XenLinux 2.6.18 kernel. All the OSes were set withhe same clock source (time stamp counter, TSC), the same kernelick frequency (250 Hz) and the same memory size (2 GB).

.1. Overall performance

OS basic operations in a variety of conditions can be a true reflec-ion of the overall system performance. UnixBench4 is applied tovaluate the overall performance of Hybrid VM, HVM and PVM.he test suites of UnixBench are for local operation, not coveringhe network performance. Fig. 5 illustrates the benchmark result ofunning UnixBench as single process, with all data normalized to

he non-virtualized (i.e., native) performance, higher is better. Dueo the overhead introduced by virtualization, Fig. 5 in most of theata are indicative of the test program in the virtual machine’s per-ormance less than native performance. Most performance results

4 UnixBench: http://code.google.com/p/byte-unixbench/.

Hybrid VM, HVM and PVM. Higher is better.

of Hybrid VM are very close to the HVM, and the gap is basicallyallowed within the measurement error (less than 2%), except forpipe-based context switching test. On the ground of optimized sys-tem call path and memory virtualization, Hybrid VM demonstratesremarkable performance improvement against PVM.

Hybrid VM owning 7% performance lower than HVM in the pipe-based context switching test is due to the slight drawback of hybridvirtualization when handling read/write of virtual block device inmemory, e.g. pipe. In the hybrid solution in this paper, the perfor-mance of read/write against block device of disk (except DMA) isenhanced due to the changed strategy of I/O handling based on vir-tual interrupt. However, with respect to the I/O with virtual blockdevice in memory, system call of read() and write() also triggers thevirtual interrupt requests which is later treated as invalid ones sincethe destination media is memory. Such invalid virtual interrupt isinevitable because kernel does not know the media type of destina-tion before executing read() and write() until the latter translationof path via file system. The overhead resulting from such invalid vir-tual interrupt, albeit extremely slight, would be magnified withinhighly frequent iteration of the access to virtual block device inmemory. In the pipe-based context switching test, it measures thenumber of times two processes can exchange an increasing integerthrough a pipe. This simple test triggers tremendous read/writeagainst pipe which exists as a kind of virtual block device in mem-ory, leading the performance drop of Hybrid VM cased by invalidvirtual interrupt handling.

Additionally, the Shell program test should be treated as a spe-cial case which manifests that the performance of HVM and HybridVM outweighs native environment more than 200%. The Shell scripttest program of UnixBench counts the number of execution loops ofa certain Shell script within a minute unit. The function of this Shellscript is a series of characters conversion in a data file. Xen virtual

machine for the disk operation is optimized leveraging file systemcache mechanism. Consequently, read/write operation against vir-tual disk image holds superiority of lazy accessing physical disk,especially accompanied by operations with high locality such as the
Page 8: Optimizing virtual machines using hybrid virtualization

2600 Q. Lin et al. / The Journal of Systems and Software 85 (2012) 2593– 2603

0 %

20 %

40 %

60 %

80 %

100 %

stat openclose

selctTCP

siginst

sighndl

forkproc

execproc

shproc

Nor

mal

ized

to n

ativ

e pe

rfor

man

ce

Hybrid PVM(XenLinux) PVM(pv_ops)

Fb

Ufam

6

dietvtwr

ippptmbsx

Ftk

0 %

20 %

40 %

60 %

80 %

100 %

Pipe AFUNIX

UDP RPC/UDP

TCP RPC/TCP

TCPconn

Nor

mal

ized

to n

ativ

e pe

rfor

man

ce

Hybrid PVM(XenLinux) PVM(pv_ops)

ig. 6. LMBench – processor benchmarks. They highlight the overhead contributedy fundamental facilities of OS like system call and signal dispatch. Higher is better.

nixBench Shell script test program. Although PVM also benefitsrom such optimization, the overhead caused by its memory oper-tion using DPT techniques leaves down its overall performance,aking it not outweigh that of native environment.

.2. Microbenchmarks

LMBench is used to evaluate the system call performance inifferent execution environments. All the experimental objects,

nvolving the native and VMs, are equipped with 2 CPUs. Especially,ach virtualized CPU is pinned to the physical CPU uniquely in ordero get rid of turbulence brought by the VCPU scheduling of hyper-isor. All the running time of benchmark testing is normalized tohe native execution. Figs. 6–8 present the measurement resultsith higher percentage better. And the benchmarks exhibit a low

untime variability, i.e., standard deviation within 2.05%.The majority of these bars illustrate a notable performance

mprovement with the hybrid virtualization. The traditional directage table used by PVM needs more hypercalls when guestage faults occur and thus introduces more TLB flushes anderformance regression. The hybrid approach wipes the costlyrampoline in PVM and obtains general performance enhance-

ents. Overall, most advancements of Hybrid VM are broughty hardware-assisted paging, and also by avoiding unneces-ary frequent boundary switches between Ring0 and Ring3 in86 64 mode, e.g., when executing system calls. We witness the

0 %

20 %

40 %

60 %

80 %

100 %

2p/16Kctxsw

2p/64Kctxsw

8p/16Kctxsw

8p/64Kctxsw

16p/16Kctxsw

16p/64Kctxsw

Nor

mal

ized

to n

ativ

e pe

rfor

man

ce

Hybrid PVM(XenLinux) PVM(pv_ops)

ig. 7. LMBench – context switch benchmarks. They focus on the overhead addedo process context switch. “2p/16k” stands for 2 processes handling data of 16ilobytes. Higher is better.

Fig. 8. LMBench – local communication latency benchmarks. They measure theresponse time of I/O request. Higher is better.

performance gain of about 30% in average on benchmarks that high-light the overhead of process creation (like fork and exec) in Fig. 6and context switch in Fig. 7. Besides, Fig. 8 indicates that hardware-assisted paging also benefits local communication bandwidth.

Fig. 7 presents the comparison of micro performance of contextswitching between Hybrid VM and PVM. Note that “np/mK” standsfor n threads process m KB data parallelly. In the PVM case, theexecution of context switching from user-mode to kernel-mode issimilar to that of system call. Context switching in PVM causes mul-tiple TLB flush due to the intervention of hypervisor which makesuser space and kernel space use different page tables to distin-guish with each other. Similar progress also happens in the contextswitching caused by interrupt. In the case of process switching,TLB flush is inevitable because page directory requires refreshing.But within the PVM case, a TLB flush is generated when user-modeprocess requests process switching via system call, and anotherTLB flush is generated when the kernel-mode service completesthe response. Although the context switching accompany with thesemantic of process switching is merged to that of system callreturning, the total amount of TLB flush in PVM is still one moretime than that of non-virtualized environment. Because HybridVM inherits from HVM container, the execution way of guest OSrunning inside Hybrid VM is identical to that of non-virtualizedenvironment. Therefore, no extra TLB flush is introduced withinthe exchanging of privilege levels so that the micro performance ofcontext switching in Hybrid VM goes beyond that in PVM.

The optimal of context switching in Hybrid VM is that of 70% per-formance in the non-virtualized environment. This performancegap is mainly due to the overhead of address translation withinthe virtualization mechanism. In the non-virtualized environment,virtual address of the program needs only once translation by theMMU. But in the virtualized environment, virtual address in guestOS needs one translation through virtualized page tables (i.e., SPT,DPT or EPT/NPT) and another translation through MMU. Althoughhardware-assisted paging strategy has memory virtualization per-formance bottleneck to a minimum, its inherent overhead isinevitable compared with that in the non-virtualized environment.Furthermore, the execution of context switching itself requires fewCPU cycles. Consequently, the slight overhead of accessing mem-ory in Hybrid VM becomes the major portion of overall overheadwithin the procedure of context switching. Such execution stylecan be treated as an exception since most operations in the system

requires significantly more CPU cycles than address translation.

Several results, e.g., sig hndl in Fig. 6 and TCP in Fig. 8 indicatethat the PVM of XenLinux goes beyond the hybrid one. XenLinux isspecially optimized for Domain0 usage, leading the representative

Page 9: Optimizing virtual machines using hybrid virtualization

Q. Lin et al. / The Journal of Systems and Software 85 (2012) 2593– 2603 2601

0 %

20 %

40 %

60 %

80 %

100 %

1 cpu 2 cpus 4 cpus 8 cpus*

Nor

mal

ized

to n

ativ

e pe

rfor

man

ce

*: The 8 cpus data benefit from Intel Hyper-Threading tech.

Hybrid PVM(XenLinux) PVM(pv_ops)

Fm

liP

6

tawan

shaaspfbt

6

tbacHnTvtCa

aHtem6do

Table 2I/O efficiency with ethernet workload. Each guest is configured with 4 virtualizedCPUs and the data reflecting their total utilization. Lower is better.

NIC VM

Pure HVM Hybrid VM

10 Mbps 4.0% 3.0%100 Mbps 9.7% 7.1%

ig. 9. Kernel compile benchmark (KCBench). All the measurement results are nor-alized to native execution performance. Higher is better.

imitation for generic VM guest. Therefore, such individual compar-sons are not adequate to deduce the prominence of the XenLinuxVM guest.

.3. CPU intensive workloads

CPU intensive performance is evaluated via KCBench, as illus-rated in Fig. 9. Vanilla Linux 2.6.25 kernel build time intervalsre measured on each object with 1, 2, 4, 8 CPUs respectively, asell as making the compilation thread number equal to the CPU

mount. The performance of corresponding measurement in theative environment is used as a normalization reference.

KCBench is mainly specialized in memory manipulation inten-ive testing which mostly relies on the MMU. Taking advantage ofardware-assisted paging acceleration, the efficiency of memoryccess and address translation in HVM outweighs that of paravirtu-lization solution which adopts direct page table. As inherited suchuperiority from pure HVM, hybrid solution takes the equivalenterformance with respect to the CPU intensive workload. KCBenchrequently calls various types of system call to process the kerneluilding tasks. The result of such CPU intensive workloades verifieshe theoretical analysis in Section 4.

.4. I/O efficiency

The experiment of I/O efficiency is measuring the CPU utiliza-ion when processing I/O intensive workloads, rather than theandwidth capability. As the typical I/O intensive workload isccompanied with network communication, different bandwidthonfigurations of NIC are used to evaluate the I/O efficiency ofybrid VM versus pure HVM. Netperf is utilized to generate theetwork data and configured to saturate the available bandwidth.he guests are set to adopt VT-d pass-through NIC rather than Xenirtual network so that the comparison can be more fair. Note thathe experiment focuses on the measurement and comparison onPU utilization, while the network throughput of both types of VMre equivalent.

Table 2 records the utilization (in percentage format) of virtu-lized CPUs when running Netperf against Hybrid VM and pureVM, with percentage lower is better. Hybrid VM manifests bet-

er I/O efficiency than the pure HVM due to the reduction of VMxit amount. Using event channel and virtual interrupt request

echanism rather than emulated APIC, Hybrid VM saves more than

0% CPU cycles compared with the pure HVM guest when han-ling single interrupt. The CPU utilization results would dependn the interrupt density. In the 10 Gigabit Ethernet workload case,

1 Gbps 21.2% 17.3%10 Gbps 79.0% 62.0%

about 8000 interrupts per second per virtualized CPU are triggered,where the end of interrupt and MSI/MSI-X take up more than 60%of the interrupt handling. As a result, compared with the pureHVM, Hybrid VM is capable of saving about 3–4% CPU utilizationon each 3.20 GHz processor core, i.e., the total CPU utilization canbe reduced by 12–16% with the 4 virtualized CPUs guest. Table 2shows that the gap between Hybrid VM and pure HVM will be mag-nified with the increment of interrupt workload, especially underthe circumstances of interrupt saturation.

7. Related work

To enhance the performance of applications running in virtual-ization environment, previous researches cut down the overheadby various means of optimization.

7.1. Memory virtualization optimization

Data copying and frequent remapping incur high performancepenalty. Menon et al. (2005) implemented a system-wide statis-tical profiling toolkit for the Xen virtual machine environment,and analyzed each domain’s overhead for network applications.Their experiment summarized that Xen Domain0 degraded thethroughput because of its high TLB miss rate while the guestdomain suffered the instruction cost of the hypervisor and thedriver domain. To avoid these costly overheads, a copy in the TCPpackage receive path has been implemented to replace data trans-fer mechanism between guests and driver domains. Santos’s (2008)work reduced the inter-domain round trip latency by sharing staticcircular memory buffer between domains instead of Xen page-flipping mechanism, and speeded up the second data copy betweendomains by using hardware support of modern NIC to copy datadirectly into the guest memory.

Our hybrid solution saves CPU cycles of MMU operations withthe help of hardware assisted paging support. Adams and Agesen(2006) argued that nested paging hardware should easily repaythe costs of slower TLB misses as the process of filling the TLB ismore complicated than that of typical virtual memory systems. Ourevaluation indicates that hybrid solution outstands in the overallperformance of memory operations compared with paravirtual-ization. Implemented with direct page table, PVM suffers fromexpensive overhead of TLB flushes each time returning from guestkernel to application for hypervisor address space protection.

7.2. I/O virtualization improvement

I/O performance is a popular research issue in the recent vir-tualization world (Raj and Schwan, 2007; Liu et al., 2006; Eirakuet al., 2009). For reliability, security sensitive instructions must betrapped and handled by VMM. Frequent interrupts affect the sys-tem efficiency seriously as the context switch is the primary cause

of overhead. Sugerman et al. (2001) analyzed the major overheadof virtual network with VMware VMM and described a mechanismto reduce the number of “world switches” to improve the net-work performance. King et al. (2003) optimized the performance
Page 10: Optimizing virtual machines using hybrid virtualization

2 ms and

oigsnX

St(ttabia

7

saetLboTaiwa

tptitvfi

8

slasivctro

A

oN2t2f

Tong University in 2011 and B.Sc. degree from South ChinaUniversity of Technology in 2008. Currently he is a Ph.D.candidate in School of Computing, National University ofSingapore. His research interests include operating sys-tem, cloud systems, trusted computing and distributeddatabase.

602 Q. Lin et al. / The Journal of Syste

f Type-II virtual machines by avoiding the context switches. Theirmplementation can support multiple address spaces within a sin-le process by using different segment bounds. Our hybrid solutionimplifies the interrupt handling process by employing event chan-el which can transfer interrupt information from guest domain toen Domain0 directly.

High end network device introduces new techniques includingcatter-gather DMA, TCP checksum offloading, and TCP segmenta-ion offloading which move the functionality from the host to NICMenon and Zwaenepoel, 2008). Establishing the selective func-ionality on a programmable network device to provide higherhroughput and less latency for virtual network interfaces couldlso enhance system performance (Raj and Schwan, 2007). Weelieve that the hybrid solution would achieve better performance

f we develop new functions for virtual network interface to takedvantage of these new techniques.

.3. High performance computing virtual machines

Virtualization technology has been gaining acceptance in thecientific community due to its overall flexibility in running HPCpplications (Tikotekar et al., 2008; Mergen et al., 2006). Whilextensive research has been targeted at the optimization of vir-ualization architecture and device for HPC (Raj and Schwan, 2007;iu et al., 2006), some studies focus on the performance analysisy investigating the behavior and identifying patterns of variousverheads for HPC benchmark applications (Tikotekar et al., 2008).he problem of predicting performance for applications is difficult,nd becomes even more difficult in virtual environments due tots complexity. One of the few tools that can be used as a system

ide profiler on Xen is Xenoprof (Menon et al., 2005). Our worklso utilizes Xenoprof for the experimental data analysis.

Besides, some study proposed the idea that whether novel vir-ual machine usage scenarios could lead to high flexibility versuserformance trade-off. Tikotekar et al.’s (2009) study showedhat different VM configurations could exert diverse performancempact on HPC virtual machines. Our future work also intendso develop the utilities which are pretty suitable for the hybridirtualization application as well as construct an optimal VM con-guration for the hybrid virtualized HPC virtual machines.

. Conclusion

The HVM benefits from hardware assisted paging so that ithows outstanding performance with memory intensive work-oads, whereas the PVM is more efficient for the I/O intensivepplications. The hybrid virtualization is capable of merging bothuperiorities of paravirtualization and hardware assisted virtual-zation. Our hybrid approach reuses most of the hardware assistedirtualization features, as well as imports several paravirtualizedomponents to the Hybrid VM. The experiment results demonstratehat the overall performance of our hybrid solution is much supe-ior than PVM and close to the pure HVM, while the I/O efficiencyf Hybrid VM outweighs that of pure HVM.

cknowledgements

This work is supported by the Program for PCSIRT and NCETf MOE, National Natural Science Foundation of China (Granto. 61073151), the 863 Program (Grant No. 2011AA01A202,

012AA010905), 973 Program (Grant No. 2012CB723401),he International Cooperation Program of China (Grant No.011DFA10850), the Ministry of Education and Intel joint researchoundation (Grant No. MOE-INTEL-11-05).

Software 85 (2012) 2593– 2603

References

Adams, K., Agesen, O.,2006. A comparison of software and hardware tech-niques for x86 virtualization. In: ASPLOS-XII: Proceedings of the 12thInternational Conference on Architectural Support for Programming Lan-guages and Operating Systems. ACM, New York, NY, USA, pp. 2–13,doi:http://doi.acm.org/10.1145/1168857.1168860.

Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt,I., Warfield, A.,2003. Xen and the art of virtualization. In: SOSP ‘03: Proceedingsof the 19th ACM Symposium on Operating Systems Principles. ACM, New York,NY, USA, pp. 164–177, doi:http://doi.acm.org/10.1145/945445.945462.

Bhargava, R., Serebrin, B., Spadini, F., Manne, S.,2008. Accelerating two-dimensionalpage walks for virtualized systems. In: ASPLOS XIII: Proceedings of the13th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems. ACM, New York, NY, USA, pp. 26–35,doi:http://doi.acm.org/10.1145/1346281.1346286.

Clark, B., Deshane, T., Dow, E., Evanchik, S., Finlayson, M., Herne, J., Matthews,J.N.,2004. Xen and the art of repeated research. In: ATC ‘04: Proceedings of theAnnual Conference on USENIX Annual Technical Conference. USENIX Associa-tion, Berkeley, CA, USA.

Eiraku, H., Shinjo, Y., Pu, C., Koh, Y., Kato, K., 2009. Fast networking with socket-outsourcing in hosted virtual machine environments. In: SAC, pp. 310–317.

King, S.T., Dunlap, G.W., Chen, P.M.,2003. Operating system support for vir-tual machines. In: ATC ‘03: Proceedings of the Annual Conference onUSENIX Annual Technical Conference. USENIX Association, Berkeley, CA, USA,pp. 71–84.

Liu, J., Huang, W., Abali, B., Panda, D.K.,2006. High performance VMM-bypass I/O invirtual machines. In: ATC ’06: Proceedings of the Annual Conference on USENIX’06 Annual Technical Conference. USENIX Association, Berkeley, CA, USA.

Menon, A., Zwaenepoel, W.,2008. Optimizing TCP receive performance. In: ATC ’08:Proceedings of the Annual Conference on USENIX Annual Technical Conference.USENIX Association, Berkeley, CA, USA, pp. 85–98.

Menon, A., Santos, R., Jose, Turner, Y., Janakiraman, G.J., Zwaenepoel, W.,2005.Diagnosing performance overheads in the Xen virtual machine environment.In: VEE ’05: Proceedings of the 1st ACM/USENIX International Conferenceon Virtual Execution Environments. ACM, New York, NY, USA, pp. 13–23,doi:http://doi.acm.org/10.1145/1064979.1064984.

Mergen, M.F., Uhlig, V., Krieger, O., Xenidis, J., 2006. Virtualization for high-performance computing. ACM SIGOPS Operating Systems Review 40 (2), 8–11,doi:http://doi.acm.org/10.1145/1131322.1131328.

Nakajima, J., Mallick, K.A., 2007. Hybrid-virtualization – enhanced virtualization forLinux. In: Linux Symposium, pp. 87–96, 2007.

Neiger, G., Santoni, A., Leung, F., Rodgers, D., Uhlig, R., 2006. Intel virtualization tech-nology: hardware support for efficient processor virtualization. Intel TechnologyJournal 10 (3).

Raj, H., Schwan, K.,2007. High performance and scalable I/O virtualization viaself-virtualized devices. In: HPDC ’07: Proceedings of the 16th InternationalSymposium on High Performance Distributed Computing. ACM, New York, NY,USA, pp. 179–188, doi:http://doi.acm.org/10.1145/1272366.1272390.

Santos, R., Jose, Y., Turner, Janakiraman, G., Pratt, I.,2008. Bridging the gap betweensoftware and hardware techniques for I/O virtualization. In: ATC ’08: Proceed-ings of the Annual Conference on USENIX Annual Technical Conference. USENIXAssociation, Berkeley, CA, USA, pp. 29–42.

Sugerman, J., Venkitachalam, G., Lim, B.-H.,2001. Virtualizing I/O devices on vmwareworkstation’s hosted virtual machine monitor. In: ATC ’01: Proceedings of theAnnual Conference on USENIX Annual Technical Conference. USENIX Associa-tion, Berkeley, CA, USA, pp. 1–14.

Tikotekar, A., Vallée, G., Naughton, T., Ong, H., Engelmann, C., Scott, S.L., 2008. Ananalysis of HPC benchmarks in virtual machine environments. In: Euro-ParWorkshops, pp. 63–71.

Tikotekar, A., Ong, H., Alam, S., Vallée, G., Naughton, T., Engelmann, C., Scott,S.L.,2009. Performance comparison of two virtual machine scenarios usingan HPC application: a case study using molecular dynamics simulations. In:HPCVirt ’09: Proceedings of the 3rd ACM Workshop on System-level Virtual-ization for High Performance Computing. ACM, New York, NY, USA, pp. 33–40,doi:http://doi.acm.org/10.1145/1519138.1519143.

Whitaker, A., Shaw, M., Gribble, S.D.,2002. Scale and performance in the denali iso-lation kernel. In: OSDI ’02: Proceedings of the 5th Symposium on OperatingSystems Design and Implementation. ACM, New York, NY, USA, pp. 195–209,doi:http://doi.acm.org/10.1145/1060289.1060308.

Xen.org, 2008. Xen Architecture Overview.

Qian Lin received his M.Eng. degree from Shanghai Jiao

Page 11: Optimizing virtual machines using hybrid virtualization

ms and

Q. Lin et al. / The Journal of Syste

Zhengwei Qi received his B.Eng. and M.Eng degrees fromNorthwestern Polytechnical University, in 1999 and 2002,and Ph.D. degree from Shanghai Jiao Tong University in2005. He is an Associate Professor at the School of Soft-ware, Shanghai Jiao Tong University. His research interestsinclude static/dynamic program analysis, model checking,virtual machines, and distributed systems.

Jiewei Wu received his bachelor degree from ShanghaiJiao Tong University in 2010. He is now a graduate studentin Shanghai Key Laboratory of Scalable Computing andSystems. His research interests include operating systemand virtualization technology.

Software 85 (2012) 2593– 2603 2603

Yaozu Dong is a senior staff in Intel Open Source Tech-nology Center. His research focuses on architecture andsystem including virtualization, operating system, anddistributed and parallel computing.

Haibing Guan received received Ph.D. degree from TongjiUniversity in 1999. He is a Professor of School of Electronic,

Information and Electronic Engineering, Shanghai JiaoTong University, and the director of the Shanghai Key Lab-oratory of Scalable Computing and Systems. His researchinterests include distributed computing, network secu-rity, network storage, green IT and cloud computing.