imp (distributed vmm)

DVMM: a Distributed VMM for Supporting Single System Image on Clusters

Jinbing Peng Xiang Long Limin Xiao School of Computer Science & Engineering, Beihang University, Beijing

[email protected] [email protected] [email protected]

Abstract

Providing single system image (SSI) on clusters has ever been one of the hot topics in the research field of parallel computer architecture, since SSI supports easier programming and administration on clusters. Currently, most SSI studies focus on the middleware level of clusters, leading to some problems of poor transparence, low performance and so on. This paper presents a novel solution to provide SSI on clusters using a distributed virtual machine monitor (DVMM) with hardware-assisted virtualization technologies. The DVMM contains some symmetrical and cooperative VMMs distributed on multi-node. The cooperation among the VMMs virtualizes the distributed hardware resources to support SSI on a cluster. Thus, the DVMM can support an unmodified legacy operating system (OS) to run transparently on a cluster. Compared with the related work, our solution has some advantages of good transparence, high performance and easy implementation. Keywords ： SSI, virtualization, hardware-assisted virtualization, VMM, DVMM. 1. Introduction

Parallel computer architecture has been presenting two development directions, one is the shared memory architecture represented by SMP (Symmetric Multiprocessor), the other is the distributed memory architecture represented by COW (Cluster of Workstations)[1].The shared memory architecture supports the shared memory programming mode, has good programmability, but has poor scalability because of some constrains, such as the bandwidth of the shared memory. The distributed memory architecture uses the message passing programming mode, has poor programmability, however, it has strong scalability because of using the loosely coupled interconnection.

Since the advantages from the two architectures are complementary mutually, how to obtain the two-sided advantages is a certain idea. One way for the combination with the advantages from the two architectures is to implement the image of shared memory architecture based on the hardware of distributed memory architecture. Both DSM and SSI on clusters are the typical practices.

This paper presents a novel solution to provide SSI on clusters using a DVMM with hardware-assisted virtualization technologies. The rest of the paper is organized as follows. Section 2 describes the background of SSI and virtualization technologies as well as an introduction about relative works. Then, we describe the implementation details of the DVMM in Section 3. Section 4 compares the DVMM with existing solutions. Finally, this paper ends up with a concluding remark in section 5.

2. Background 2.1. Single system image

SSI means that all the distributed resources are organized to a uniform unit for users, users can not be aware of the distributed attribute of the resources. SSI includes some attributes such as single memory space, single process space, single I/O space, and so on [2].

The SSI of a cluster can be implemented on the hardware level, the underware level, the middleware level and the application level. Currently there are few solutions on the hardware level; they are Enterprise X-Architecture [3], cc-NUMA [4] and DSM [5]. Special chips or hardware are adopted in these solutions, so that they have high cost and limited applications. The solutions on the underware level are also seldom. The representative solutions are MOSIX [6], Sun Solaris-MC [2]. SSI Implemented on this level has good transparency for the users, but it is difficult to implement. Current solutions on this level can only

The 9th International Conference for Young Computer Scientists

978-0-7695-3398-8/08 $25.00 © 2008 IEEE

DOI 10.1109/ICYCS.2008.190

183

implement part attributes of SSI. There are many solutions on the middleware level, the typical work include: distributed shared memory systems such as IVY [7], the parallel and distributed file systems such as Lustre [8], the systems of resource management and loads schedule such as LSF [9], the parallel programming environments such as MPI and PVM [10]. The SSI implemented on this level has poor transparency. There are seldom solutions on the application level; the representation is LVS [11].

Therefore, the SSI of a cluster may be implemented on the application level, the middleware level, the underware level and the hardware level. From the top down, the difficulty to implement the SSI increases, but the transparency for the users increase, too. Currently most studies focus on the middleware level, leading to some problems, such as poor transparence. There are seldom solutions on the application level, the underware level or the hardware level, furthermore, the solutions on these levels have pitfalls respectively, for example, the solutions on the hardware level have high cost, the solutions on the underware level can not implement the SSI attributes roundly, and the solutions on the application level have poor flexibility.

2.2. Virtualization

Virtualization means that computation and processing are done on the virtual base instead of the real base. A virtual platform can be constructed between the hardware and the OS by means of virtualization techniques for creating multiple domains on one hardware platform, the domains are isolated respectively, and each domain can support the running of his OS and applications [12].

Virtualization techniques can be differentiated to full-virtualization, para-virtualization, pre-virtualization and hardware-assisted virtualization [13][14]. Hardware-assisted virtualization is the most advanced virtualization technology. VT-x [15] is a hardware-assisted virtualization technology for the IA-32 architecture. The contents of VT-x are listed as follows. A new operation form, called VMX (Virtual Machine Extensions), is added to the processor. Two VMX transitions, VM entry and VM exit，are defined. A VMCS (Virtual-Machine Control Structure) and ten new instructions used for controlling the VM are added to the architecture. With the support of VT-x, the design of VMM can be simplified, and full virtualization can be implemented without using binary translation.

2.3. Related work

The essential of virtualization is to separate the software from the hardware through abstracting the physical resources. The goal of SSI is to hide the distributed hardware environment of the cluster. Thus, SSI can be implemented by virtualization. 2.3.1. Virtual Multiprocessor. Virtual Multiprocessor [16]implements an 8-way shared memory virtual machine on a cluster of 8 PCs. The VMMs runs in the application space with supports of the host OS. Para-virtualization is used on the guest OS. The shared memory space is supported by the DSM; the virtual processors are emulated by special processes; the I/O virtualization is implemented through the cooperation between the VMMs and the dedicated I/O sever. The disadvantages of this system are that VMMs on the application level lead to low performance and weak flexibility; para-virtualization needs to modify the guest OS and only the devices in the I/O server can be utilized, so it has limited application and it is difficult to implement. Furthermore, Virtual Multiprocessor can not provide SSI on a SMP cluster. 2.3.2. vNUMA. vNMUA [17] implements a 2-way NUMA virtual machine on a cluster of two workstations; each one has an IA64 processor. The VMMs are implemented directly on the hardware without host OS support. Pre-virtualization technology is used to modify the guest OS. The guest OS is compelled to run on the ring 1. The shared memory is supported by DSM. One node is the master node from which the system is set up. The disadvantages of vNMUA are that pre-virtualization needs to modify the guest OS and degrading the privilege level of the guest OS can bring the confusion of privileges. Also, vNUMA can not provide SSI on a SMP cluster. 3. Design and implementation of DVMM 3.1. Overview

The goal of DVMM is to hide the distributed hardware attributes, provide SSI on a SMP cluster, and support a single OS to run transparently on the cluster. Therefore, three essential problems must be solved. Firstly, the distributed hardware configurations of the cluster can be detected and merged to form the global information. Secondly, the global hardware resources can be virtualized and presented to the OS. Thirdly, the OS can manage, schedule and utilize the global resources just as on a single SMP machine.

184

For providing SSI on a cluster, a new layer named DVMM is added between the OS and the cluster hardware. The DVMM contains some symmetrical and cooperative VMMs distributed on the cluster. A single OS supporting cc-NUMA architecture runs on the DVMM. Using hardware-assisted virtualization technologies, the DVMM detects and merges the physical resources of the cluster to form the global information, virtualizes the whole physical resources, and presents the virtual resources to the OS. The OS schedules and runs the processes, manages and allocates the virtual resources. These actions by OS are transparent to DVMM. The DVMM intercepts the operations of accessing resources from the OS and handles them on behalf of the OS, such as mapping the virtual resources to the physical resources and manipulating the physical resources. In this way, it is assured that the OS can be aware of the whole resources of the cluster as well as can manage and utilize them. And the distributed attributes of the hardware are hidden and the whole cluster is presented to the OS as a cc-NUMA virtual machine. 3.2. Strategies

Providing SSI on a cluster faces problems of detecting, presenting and utilizing the resources of the cluster. To solve these problems, our strategies are that detecting the physical resources of each node during the startup of VMMs and integrating the physical resources through communication among the VMMs; virtualzing the physical resources and reporting them to the OS through hardware-assisted virtualization; managing and utilizing the physical resources of the cluster through the cooperation between the OS and the DVMM. The details of the strategies are as follows. 3.2.1. Resource detection and merger. Emulates and extends the BIOS to the eBIOS (Extended Basic Input/Output System). After the eBIOS acquires the information about the physical resources of native node, it communicates with the other nodes to collective the information about the physical resources of whole cluster, and merges the information to form the information of the global physical resources. Based on the global physical resources, DVMM reserves some resources and virtualizes the remains. DVMM organizes the virtual resources. This includes forming various resources mapping tables, implementing the mappings from the virtual resources to the physical resources and from the physical resources to the nodes, creating the global virtual resources information table. Based on the virtual resources, the OS is set up, the calls for BIOS are captured, and the information of the

global virtual resources is reported to the OS, so that the OS is aware of the global virtual resources. 3.2.2. Resource virtualization. Resource virtualization includes ISA virtualization, interrupt mechanism virtualization, memory virtualization and the I/O device virtualization. Unlike existing virtualization techniques, the virtualization technique in this paper can implement the virtualization for resources crossing the nodes.

The IA-32 ISA is virtualized through the VT-x; the techniques are similar to that used in the HVM of Xen [18]. The interrupt mechanism is virtualized as follows. DVMM emulates interrupt controllers with software, interferes the accesses from the OS to the interrupt controllers, if the target interrupt controller is in the native node, then DVMM manipulates the interrupt controller to reflect the guest’s operation; if the target interrupt controller is in a remote node, DVMM sends the access request to the target node, the target VMM manipulates the virtual interrupt controller accordingly. DVMM catches the hardware interrupt, and the contents of the virtual interrupt controller are modified by the native VMM or by the remote VMM according to the node in which the interrupted object is, so that the interrupt can be shown to the OS. Combine the techniques of Shadow Page Table (SPT) and software DSM to virtualize the distributed memory resources. That is merging the memory resources of the cluster to a distributed shared memory with the software DSM, and then virtualizing the distributed shared memory with SPT. The I/O operations are interfered by the VT-x, if the I/O operation will be processed on the native node, the native VMM executes the interfered instruction and returns the results to the OS, If the I/O operation will be done on a remote node, the I/O instruction is sent to the target VMM for executing, the results are sent back to the native VMM, and then to the OS. 3.2.3. Resource management and utilization. The OS manages and utilizes the virtual resources and the DVMM manages and utilizes the physical resources. The OS interacts with the DVMM through the VM entry and the VM exit [15]. Based on the virtual resources, the OS schedules and runs the processes, manages and allocates the virtual resources independently. This is transparent to the DVMM. When the OS runs a sensitive instruction or a trap or interrupt occurs, the control is switched to the DVMM by the VM exit. The DVMM handles the problem according to the reason of VM exit, for example, allocating and manipulating various physical devices. After the DVMM handles the event for which the VM exit is triggered, the results and the control are returned

185

to the OS through the VM entry. Through the interactions between the OS and the DVMM, the management and utilization of the global physical resources are implemented. 3.3. Design and implementation 3.3.1. System architecture. The system architecture is shown in figure 1. From the bottom up, the system contains hardware level, DVMM level and OS level. The hardware level contains some SMP nodes interconnected by the gigabit Ethernet, and the CPUs of the nodes can support VT-x. The DVMM level contains some symmetrical and cooperative VMMs distributed on the nodes. The VMMs can communicate through the dedicated communication software. The OS can be any one which supports the cc-NUMA. The key element for implementing this system is to construct the DVMM.

Figure 1. System architecture

3.3.2. DVMM structure. The DVMM is composed of the VMMs distributed on the nodes. The DVMM runs on the bare machines. The functions of the VMM are detecting, integrating and virtualizing the physical resources, reporting the virtual resources to the OS and cooperating across the nodes. The structure of the DVMM is shown in the figure 2.

Figure 2. DVMM structure

The initialization module loads and runs the VMM. The eBIOS module detects and integrates the resource information of the cluster and reports it to the OS. The ISA virtualization module virtualizes the IA-

32 ISA and cooperates with the interrupt virtualization module so as to the OS can manage and schedule the virtual computing resources. The I/O virtualization module virtualizes the global I/O resources. The interrupt virtualization module virtualizes the interrupt control mechanism, notifies the interrupt event to the OS. The MMU virtualization module virtualizes the memory resources and assures that the OS can run correctly in the virtual physical address space. The DSM module implements a distributed shared memory transparently. The communication module provides the communication service for the cooperative VMMs. 3.3.3. DVMM mechanism. The DVMM mechanism is shown in figure 3.

Figure 3. DVMM mechanism

The ISA virtualization module is the entry point

as well as the exit point of the DVMM. This module may invocate every other module of the VMM except the communication module, and vice versa. When a VM exit occurs, this module analyzes the reason of the VM exit and invocates appropriate module to handle. When one module completes his duties, it invocates this module to return to the guest system. The communication module is the base of the cooperation among the VMMs. This module may invocate every other module of the VMM except the ISA virtualization module, and vice versa. The eBIOS module is used only during the initialization of the DVMM and the setting up of the OS. Firstly, the eBIOS module invocates the interrupt virtualization module, the I/O virtualization module and the communication module to detect and build the resource information of the whole system. Secondly, while the ISA virtualization module captures the calls to the BIOS during setting up the OS, the eBIOS module returns the information about the virtual resources of the whole system to the OS. The I/O

186

virtualization module receives instructions from the ISA virtualization module, according to the node on which the I/O request should be done, it may execute the I/O instruction or invocate the communication module to send the I/O request to the target node. When the I/O virtualization module receives an I/O request from a remote node, it manipulates the native I/O device and sends the result to the source node. The interrupt virtualization module is invocated by the ISA virtualization module to emulate the operation to the virtual interrupt controller by the OS; on the other hand, it converts the external interrupt vectors to the virtual interrupt vectors and injects a virtual interrupt to the OS. While the ISA virtualization module captures a sensitive instruction or a trap related to MMU, it invocates the MMU virtualization module to handle it. When the MMU virtualization module finds that the requested page is not in the native node, it

invocates the DSM module to get the page. Invocated by the MMU virtualization module, the DSM module requests the page from the remote node, while invocated by the communication module it serves the request and sends the result to the remote node.

Through the cooperation among the modules of the DVMM, based on resource virtualization, the SSI of the SMP cluster is implemented.

4. Discussion

There are many existing solutions for providing SSI on clusters. Few of them are based on virtualization techniques, and the others are not. To distinguish the features of our solution, we compare it with the existing solutions as follows.

Table 1．Comparison among DVMM, Virtual Multiprocessor and vNUMA

Level Technique Difficulty Transparence Symmetry Performance SMP

Supporting ISA

Virtual Multiprocessor

Application Level

Para-virtualization High Poor No Low No IA-32

vNUMA Underware Level

Pre-virtualization High Good No Moderation No IA64

DVMM Underware Level

Hardware-assisted Virtualization

Moderation Good Yes High Yes IA-32

Known from the table 1, the DVMM has

advantages to the Virtual Multiprocessor and the vNUMA. Firstly, the DVMM can implement SSI on SMP clusters, while the Virtual Multiprocessor and the vNUMA can not. Secondly, the DVMM utilizes hardware-assisted virtualization technology to implement full virtualization, need not to modify the guest OS, so that it has moderate difficulty to design and implement. While the Virtual Multiprocessor and the vNUMA adopt para-virtualization and pre-virtualization respectively, both of them need to modify the guest OS, so they have high difficulty to implement and have limited applications. Thirdly, the DVMM is implemented based on assistance of the hardware, and runs on the metal, so it has high performance. While both the Virtual Multiprocessor and the vNUMA are implemented by software, so that they have low performance, further more the Virtual Multiprocessor is implemented at the application level, it must pass through several software layers leading to lower performance. Finally, the nodes of the DVMM are full symmetrical, while the nodes of the Virtual Multiprocessor and the vNUMA are not symmetrical, one of them is the master node. Besides, the DVMM is implemented at the underware level, while the Virtual Multiprocessor is implemented at the application level,

so the DVMM is more transparent than the Virtual Multiprocessor; because the IA-32 is used more widely than the IA64, the DVMM has wider application and higher utilization value than the vNUMA.

Compared with the existing solutions mentioned in section 2.1, the DVMM also has advantages. Firstly, the DVMM does not demand special hardware, so it has lower cost and wider application than the solutions on the hardware level. Secondly, the DVMM can implement full attributes of SSI, while the solutions on the firmware level can only implement part attributes of SSI, so the DVMM has higher utilization value. Thirdly, the DVMM has better transparence and higher performance than the solutions on the middleware level. Finally, the DVMM has better flexibility and higher performance than the solutions on the application level. 5. Conclusions and future work

The DVMM implements the SSI of clusters on the underware level based on the hardware-assisted virtualization technologies, so it can support an unmodified legacy OS to run transparently on a cluster. Compared with the existing solutions for implementing the SSI of clusters, the DVMM has some advantages.

187

There are still further improvements to be made: firstly, using the most advanced VT-d [19] and EPT(Extended Page Tables) [20] techniques to reduce the implementing difficulty and adopting the processor consistency model instead of the sequential consistency model for higher performance; secondly, detecting the physical resources dynamically to support the dynamic change of the number of the nodes; thirdly, adding the functions of resource management and load schedule to the DVMM for supporting multiple guest OS running transparently and separately on a cluster.

Acknowledgment

This work is supported by Hi-tech Research and

Development Program of China (863 Program, No. 2006AA01Z108).

References [1] Culler D E, Singh J P, Gupta A. Parallel computer

architecture — a hardware/software approach. China Machine Press. 1999

[2] Rajkumar Buyya, Toni Cortes, Hai Jin. Single System Image (SSI). The International Journal of High Performance Computing Applications, Volume 15, No. 2, Summer 2001, pp. 124-135

[3] IBM Enterprise X-Architecture Technology [OL]. http://www.unitech-ie.com/ole/doc_library/xArchitecture%20technology%202.pdf

[4] B. C. Brock, G. D. Carpenter, et al. Experience with building a commodity Intel-based ccNUMA system. IBM J. Res. & Dev. Vol. 45 No. 2 March 2001

[5] Ayal, Itzkovitz and Assaf, Schuster. Distributed Shared Memory: Bridging the Granularity Gap. 1999. In Proceedings of the 1st Workshop on Software Distributed Shared Memory.

[6] L. Amar, A. Barak, and A. Shiloh, The MOSIX Direct File System Access Method for Supporting Scalable Cluster File Systems. Cluster Comput-ing, 7(2), pp. 141-150, 2004.

[7] Li, Kai and PAUL, HUDAK. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems (TOCS) . 1989, Vol. 7, ISSN:0734-2071, pp. 321-359.

[8] SUN Corp. Lustre File System [OL]. http://www.sun.com/software/products/lustre/

[9] Platform Corp. LSF Reference [OL]. http://support.sas.com/rnd/scalability/platform/lsf_ref_6.0.pdf

[10] Geist, A., and Sunderam, V. 1990.PVM: A framework for parallel distributed computing. Journal of Concurrency: Practice and Experience [OL]. http://www.epm.ornl.gov/pvm/.

[11] Zhang, W. 2000. Linux virtual servers for scalable network services.Ottawa Linux Symposium 2000,

Canada [OL]. http://www.LinuxVirtualServer.org/. [12] James E.Smith, Ravi Nair. Virtual Machines: Versatile

Platforms for Systems and Processes. ELSEVIER, 2006.

[13] VMware. Understanding Full Virtualization, Paravirtualization, and Hardware Assist. 2007. [OL].http://www.vmware.com/files/pdf/VMware_paravirtualization.pdf

[14] Joshua, LeVasseur, et al. Pre-Virtualization: Slashing the Cost of Virtualization[OL]. http://l4ka.org/publications/2005/previrtualization-techreport.pdf www.l4ka.org. 2005.

[15] Intel. Intel® 64 and IA-32 Architectures Software Developer’s Manual. Vol. 3:System Programming Guide. 2007.

[16] Kenji Kaneda, Yoshihiro Oyama, and Akinori Yonezawa. A Virtual Machine Monitor for Providing a Single System Image (in Japanese). In Proceedings of the 17th IPSJ Computer System Symposium (ComSys ’05), pages 3–12, November 2005.

[17] M. Chapman and G. Heiser. Implementing transparent shared memory on clusters using virtual machines. In USENIX Annual Technical Conference, Anaheim, CA, USA, Apr. 2005.

[18] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. War_eld. Xen and the Art of Virtualization. In Proceedings of the 19th ACM SOSP, pages 164.177, October 2003.

[19] Intel. Intel® Virtualization Technology for Directed I/O [OL]. http://www.intel.com/technology/itj/2006/v10i3/2-io/7-conclusion.htm.

[20] Gil Neiger. Intel Virtualization Technology: Hardware Support for Efficient Processor Virtualization. Intel Technology Journal, Vol. 10, Issue 3, 2006.

188