virtual machine storageperformance using...

Virtual Machine Storage Performance using

SR-IOV

by

c© Michael J. Kopps

B.S., University of Colorado, 2007

A project submitted to the Graduate Faculty of the

University of Colorado at Colorado Springs

in partial fulfilment of the

requirements for the degree of

Master of Science

Department of Computer Science

December 2013

This project report for the Master of Science degree by

Michael J. Kopps

has been approved for the

Department of Computer Science

by

Jia Rao, Chair

Xiabo Zhou

C. Edward Chow

Date

Abstract

This paper presents research on the performance differences for various storage

models on KVM virtual machines when using Single Root I/O Virtualization (SR-

IOV) versus using the traditional hypervisor assisted storage techniques such as virtio.

SR-IOV presents physical hardware devices to the virtual machines and reduces the

overhead of the hypervisor when making disk I/O operations while traditional virtu-

alization techniques use software to provide access to shared hardware.

iii

Contents

Abstract iii

List of Figures vi

1 Introduction 1

2 Background 4

2.1 Linux SCSI Subsystem, a.k.a Bare Metal . . . . . . . . . . . . . . . . 4

2.2 Hypervisor Based Virtual Disks . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Virtio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 IDE Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.3 Image Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 SR-IOV Hard Drive Controller Operation . . . . . . . . . . . . . . . . 8

2.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Bonnie++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.2 Postmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.3 Dos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.4 dd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

iv

3 Discussion 13

3.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Evaluation 18

4.1 Test Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Independent Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Disk Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Performance on Single Workload . . . . . . . . . . . . . . . . . . . . . 34

5 Conclusion 39

6 Future Work 41

Bibliography 43

v

List of Figures

4.1 Average VM Write Throughput with Independent Disks using Bonnie++ 22

4.2 Average VM Read Throughput with Independent Disks using Bonnie++ 22

4.3 Average VM Write CPU % with Independent Disks using Bonnie++ 24

4.4 Average VM Read CPU % with Independent Disks using Bonnie++ . 24

4.5 Average VM Write Throughput Comparing SR-IOV Configurations

using Independent Disks and Bonnie++ . . . . . . . . . . . . . . . . 25

4.6 Average VM Read Throughput Comparing SR-IOV Configurations us-

ing Independent Disks and Bonnie++ . . . . . . . . . . . . . . . . . . 26

4.7 Average VM Read Throughput with Independent Disks using Postmark 27

4.8 Average VM Throughput Standard Deviations with Independent Disks

using Postmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.9 Average VM Write Throughput with Independent Disks using Postmark 29

4.10 Average VM Throughput Comparing SR-IOV Configurations using In-

dependent Disks and Postmark . . . . . . . . . . . . . . . . . . . . . 30

4.11 Average VM Write Throughput with Shared Disks using Bonnie++ . 31

4.12 Average VM Read Throughput with Shared Disks using Bonnie++ . 32

4.13 Average VM Read Throughput with Shared Disks using Postmark . . 33

vi

4.14 Average VM Write Throughput with Shared Disks using Postmark . 33

4.15 Average VM Sequential Read Throughput with Shared Disks using Dos 35

4.16 Average VM Random Read Throughput with Shared Disks using Dos 36

4.17 Average VM Throughput with Shared Disks using dd . . . . . . . . . 37

4.18 dd Throughput per VM with Shared Disks and 4 VMs . . . . . . . . 38

vii

Chapter 1

Introduction

Kernel based Virtual Machine (KVM) is a virtualization hypervisor which uses the

Quick Emulator (QEMU) as its virtual machine monitor and is rapidly growing in in-

dustry adoption for virtualizing servers. The architectures of the underlying hardware

this hypervisor runs on vary widely, but there is always a need for the basics: CPU,

memory, network, and persistent storage. KVM’s CPU and memory virtualization

implementation is able to take advantage of CPU hardware such as Intel’s VT-x and

VT-d to simplify virtualization by the hypervisor operating system. The persistent

storage system, which in any computer, virtualized or native, has always been the

largest, least expensive per gigabyte, and slowest form of data storage technology.

This subsystem has struggled to find an easy hardware accelerated virtualization ar-

chitecture to speed up an already slow process. On virtualization platforms such as

KVM, the storage represented is in reality either a file on the hypervisors file system

or a partition on a physical disk set up by the hypervisor. These methods have ben-

efits such as simplicity and ease of management, but they also require the hypervisor

1

to be involved in all disk I/O operations, which adds an extra layer and therefore

extra latency to the already slow persistent storage.

When the storage is really a partition on a disk partitioned by the hypervisor, the

overhead is small because the hypervisor merely needs to forward the requests on to

the disk. On the other hand, when the storage is really a file, all of the requests need

to be translated and then passed through the file system of the hypervisor, increasing

latency. However, the performance of the I/O will suffer as a result of the additional

layers each request must go through. The latter approach has the benefit of being

able to migrate the storage (and the associated virtual machine) easily since the file

just needs to be copied. The former approach must perform operations on at least

the entire disk partition on the destination disk in order to provide room for the

migration.

Single-Root IO Virtualization (SR-IOV) was developed by the Peripheral Com-

ponent Interconnect Special Interest Group (PCI-SIG) to allow multiple virtual ma-

chines to access the same physical PCI-Express devices as if each virtual machine had

their own dedicated device [1]. This technology is growing in use in network interface

cards and the performance benefits in that space has been widely seen [2]. This tech-

nology has not seen wide adoption in the hard disk host bus adapter (HBA) space,

and there is very little known as for the performance benefits of allowing virtual ma-

chines to communicate directly to the HBA. This work will set up a sample system

with a SR-IOV capable HBA and measure its performance compared with the same

system without SR-IOV enabled and running Linux natively without virtualization.

Even with SR-IOV, the hypervisor is is still involved in the I/O path due to the

need to handle and route interrupts. The hardware will still issue interrupts to the

2

hypervisor operating system, which must then pass those interrupts on to its guests.

Secondly, since it requires direct memory access from the hardware, protection should

be provided to prevent the hardware from corrupting or accessing memory belonging

to another virtual machine. This partitioning and security may be accomplished using

an Input/Ouput Memory Management Unit (IOMMU) hardware, which is available

in newer CPUs from both Intel and AMD.

3

Chapter 2

Background

2.1 Linux SCSI Subsystem, a.k.a Bare Metal

The Linux SCSI Subsystem is the baseline of performance in this research. It should

provide the least amount of overhead since there are no virtualization layers for the

I/O to pass through, and it has the most direct route to the hardware. Since this is

running on a non-virtualized Linux system, it is referred to as “Bare Metal.”

There are three layers in the Linux SCSI subsystem. The top most layer provides

the block and char devices nodes used by filesystems for I/O. The upper layer also

provides the \dev nodes for the devices. The middle layer provides routing between

the top layer interfaces and the hardware devices and their drivers in the bottom

layer. It also provides a buffer between the different transport classes available, such

as iSCSI, Serial Attached SCSI (SAS), and Fibre Channel. The bottom layer is

where the hardware specific drivers exist. This layer is responsible for taking requests

from the middle layer, forming hardware readable commands and issuing them to the

4

hardware and then passing the responses back to the middle layer[3]. The lower layer

is also responsible for notifying the middle layer of new devices as they are added to

the physical topology.

The SAS transport is an evolution of the older parallel SCSI transport. SAS is

able to support thousands of devices on a single topology at line speeds of up to

12Gbps. It does away with the shared bus architecture so disk drives (sometimes re-

ferred to as the target) may be attached directly to the controller (sometimes referred

to as the initiator) or they may be attached to the controller through one or more ex-

pander devices, which operate much the same way as an Ethernet switch. Expanders

may be connected to the controller through wide ports, allowing the controller to

communicate with more than one device at a time.

2.2 Hypervisor Based Virtual Disks

In this research, QEMU is used to provide virtualization services to the Kernel Virtual

Machine (KVM). QEMU can provide at least two types of disk drives to the guest

operating systems. The first, Virtio, is a high performance platform for virtualizing

I/O operations, and requires specialized drivers in the guest to operate. It could be

used by several I/O devices in addition to disk based storage, such as network devices

[4].

2.2.1 Virtio

Virtio was developed to move virtualization towards a single, standard method of

handling paravirtualized I/O devices and transporting the I/O between the guest

5

and hypervisor. Paravirtualization is a technique where the guest driver and the

hypervisor use a virtualization specific means of passing the I/O rather than the

hypervisor presenting a completely emulated device found on real hardware devices.

There are several existing implementations, one each for Xen, KVM, VMWare, ad

nauseum. This unified technique will make new development easier due to an existing

infrastructure and design, and make maintenance easier with having a single, common

structure to maintain and support. Virtio also aims to reduce the complexity found

in some of the more proprietary formats, which can overlap functionality between the

guest drivers, the transport with the hypervisor, and the configuration of the devices

[5].

Virtio implements its transport abstraction for I/O requests using a ring buffer.

There are three portions, the descriptors which allow the guest to formulate address

and length tuples. The available ring describes the descriptors which are available for

use, and finally the used ring where the hypervisor places items which it has consumed

[5].

The block device Virtio driver has a simple header describing the request as a

simple read or write, or a more generic SCSI command. The header also contains a

heretofore unused priority field and a sector field which describes where on the virtual

disk the I/O is targeted. The header is placed into a free descriptor, which is chained

to another descriptor containing the memory information about the stored data. If

the memory is discontiguous, multiple descriptors will be chained together to form a

scatter gather list. The completed descriptor chain is placed on the available ring and

the hypervisor is notified there is a new entry available to process. The hypervisor

then processes the request by formulating a native I/O to the image storage location

6

and passing that to its own storage driver. On the return path, the I/O is completed

to the hypervisor, which triggers the hypervisor to place the request descriptor on

the used queue followed by a notification to the guest the I/O has completed [5].

2.2.2 IDE Emulation

QEMU also provides support for a non-paravirtualized storage interface which em-

ulates the legacy IDE interface. This interface is provided to allow guest operating

systems which do not have a driver written to support virtio to still be hosted on

a KVM virtualization environment. This feature is universally advised against for

guests which have support for the virtio interface due to the low performance seen by

the IDE interface [6].

2.2.3 Image Storage

The actual data can be stored on disk in several formats, each offering different

features and performance trade offs.

raw - This format stores the disk exactly as it would be on a physical drive, is simple

to emulate, and is easy to export to other emulators [7]. This method involves

the least amount of overhead in the hypervisor when translating between the

virtual sector and the physical location of the data store.

qcow2 - This format is the QEMU native format and provides features such as zlib

compression and AES encryption [7]. These features add extra complexity in

the I/O path and may perform lower than the raw format.

7

The data may be cached by the hypervisor, based on user selectable options. The files

may be write through, where the data is written to cache and the disk simultaneously;

write back, where the data is written to cache and then is written to the disk at some

later time; and finally may be stored in such a way where caching is disabled [8].

2.3 SR-IOV Hard Drive Controller Operation

Implementing SR-IOV on a hard drive controller requires three components: 1) the

controller itself; 2) the hypervisor driver; 3) and the guest driver.

The controller is the most complex portion of the SR-IOV stack. First and fore-

most, it must present the multiplicity of PCI-Express functions, both the physical

function for the hypervisor and the virtual function for the guests, to the PCI-Express

host to allow independent software to access the hardware simultaneously. Each of

these functions must have the queues necessary for the drivers to submit requests to

the controller and for the controller to submit replies to the drivers. Finally, each

function must be able to maintain the required interrupt configuration, most com-

monly the multiple MSI-X interrupt vectors associated with the function. With all

of these resources, the controller must then monitor the queues for new requests and

submit them to the SAS topology in a fair manner.

The hypervisor driver is responsible for configuring the controller and available

resources for use by the guests. To provide data security, the hypervisor instructs the

controller what virtual functions have access to what SAS devices. The hypervisor

driver may also provide traditional controller resources to the hypervisor through the

physical function so the hypervisor operating system may access disk resources in

8

parallel with the guests.

The hypervisor itself is responsible for assigning virtual functions to guests. A

guest may access multiple virtual functions, but a virtual function may only be ac-

cessed by a single guest at any one time. All that is required of the hypervisor is

to pass the physical PCI-Express memory space for that virtual function through to

the guest. When the guest drivers allocate interrupts for the virtual function, the

hypervisor must also assist the guest in registering the interrupts and then passing

them between the physical interrupt controller and the guest operating system.

The guest driver is identical to the default non-SR-IOV host driver for the con-

troller and it need not be aware it is running on a virtualized system at all. The

guest driver sees the same PCI-Express registers as it would when running on bare

metal Linux and a standard controller. The hypervisor hides the complexity of the

interrupts below the operating system itself. The MSI-x interrupts appear to arrive

in the same manner as they would in a normal, bare metal, system.

2.4 Benchmarks

This research used several storage benchmarking tools. Some are quite simple while

others strive to show a more real world I/O profile. All of these benchmarking tools

access the storage through a filesystem, so it was important to clear any operating

system cache before running them.

9

2.4.1 Bonnie++

The benchmarking tool Bonnie++ is an expansion of the original Bonnie benchmark

program, written by Tim Bray, which adds the capability of testing large storage space

using a 32-bit program [9]. The tests are designed to investigate common bottlenecks

in I/O subsystems when under heavy load. Two tests types were used in this research,

the sequential output and sequential input.

Sequential output performs three types of write I/O to the disk subsystem. It

performs character sized I/O using the putc() function provided by the standard

I/O library in C/C++. The test then re-opens the same file and uses write(2)

function to perform 1 KB1 block sized write operations to disk. Transferring large

blocks will tend to test the write throughput while the smaller I/O sizes will tend

to test the filesystem caches and the raw number of I/O operations per second the

underlying storage subsystem can support. The benchmark will then reopen the file

to perform read using read(2) and rewrite operations, using write(2), dirtying one

word in each block.

Sequential input performs two types of read operations to the storage. It first uses

getc() to read individual characters from the storage. The file is then reopened and

then read using read(2) to read 1 KB blocks of sequential data from the disk.

2.4.2 Postmark

The Postmark benchmarking utility was created to help fill a gap in showing the per-

formance of ephemeral small file access programs such as EMail and online shopping

1Here, KB refers to 1024 bytes of data. 1 MB refers to 220 bytes or 1,048,576 bytes.

10

web sites. Previous benchmarks utilize static large files to perform their testing, which

does not account for the transient small files used by the aforementioned application

types.

Postmark generates large numbers of small files that are constantly being created,

written to, read, and deleted. The size of the files in the pool of files and the size of

each file itself is configurable by the user. File creation causes a randomly selected

amount of data to be written with data from a portion of previously randomly gen-

erated data. Reads are done by randomly selecting a file, opening it, and reading the

entire file using a configured block size to memory [10].

The benchmark records statistics including the total time, time for creating the

initial files, time to read the files, time to append to the files, and the time taken

to delete the files. It also provides bandwidth results from these tests, showing the

amount of data read and written.

2.4.3 Dos

Dos is a benchmark developed to provide a simple application to compare the various

environments against. It is able to provide sequential and random reads in varying

block sizes using simple C provided libraries.

2.4.4 dd

The GNU\Linux utility called dd is used informally as a quick benchmark for storage

by users across the world. Its ubiquity provides a universal benchmark by which

results may be compared across multiple systems. It is by no means a thorough

11

benchmark, but it is able to provide a perspective on the performance of the storage

subsystem without going through a filesystem.

12

Chapter 3

Discussion

3.1 Issues

When implementing a virtualized environment and hosting tenant guests, it is im-

portant to provide a standard service to tenants as described by the service level

agreement. As ever, the trick is to provide these services at the specified level with-

out adding more overhead than is absolutely necessary, since the more overhead taken

by fairness enforcement results in less resources available for tenants, and therefore

the less resources generating revenue through the work the guests perform.

Disk Sharing A problem arises when using a single hard disk drive for multiple

virtual machines. If the virtual machines are accessing different parts of the disk

simultaneously, the disk head must seek across the media repeatedly, causing a higher

latency for each access. If the hypervisor assigned each guest a different partition on

the disk, then the accesses will be bounded by those disk partitions. If each guest is

accessing only the lower portion of its assigned disk area, there will be bands of hot

13

zones corresponding to the lower portions of each partition, and the disk head must

fly across the cold areas of the disk to reach the hot zones for the other guests. If

the disk has a sufficiently large cache, it may be able to reorder the accesses based

on relative locality, i.e. keep the accesses to the hot zones together.

This could have drawbacks of uneven performance for the various virtual machines.

The partitions located on the outside of the disk will be closer together (assuming

equal partition sizes) and will be optimized better than those closer to the interior

of the disk. This performance benefit is the reason organizations choose to short

stroke their drives to gain higher performance, except here some guests benefit from

their effective short stroking while other guests lose out. Short stroking drives is a

technique to utilize the faster outer cylinders of a disk to improve performance.

3.2 Benchmarks

I/O loads on servers can take on several different profiles. File servers tend to generate

a heavy load of sequential reads and writes, often using larger block sizes. Database

and mail servers on the other hand tend to generate a much larger set of very small

block random read and write operation because these loads need to write small email

messages or the blocksize of the database to the disk. The access sequence of these

small blocks has a relatively small amount of locality.

When testing the performance of the systems at the level of the operating system

and guest operating systems, smaller block I/O as found in databases will test the

overhead imposed by the storage subsystem, while larger block sequential I/O will

test the throughput of the data.

14

In all cases tested in this research, the data path is non-copy, so the only real test

is of the overhead. Overhead can be found between the guest and hypervisor, inside

the hypervisor, and between the hypervisor and its underlying storage system. There

are numerous ways of describing the location of the data, but all of them are some

form of scatter gather list (SGL), however each layer tends to use a different format

of SGL which implies the more fragmented the sequential data is in memory, the

more overhead will be created going through the layers of the hypervisor. For small

I/O operations, it is unlikely the request will use more than one element in the SGL,

due to the operating system’s page size being 4096 bytes. For I/O requests involving

larger than 4KB, the SGL overhead will become more pronounced as the I/O gets

larger. Bonnie++ and Postmark only uses I/O sizes smaller than 4KB, so it will not

see too much overhead. Dos and dd will transfer more than a page of data, and the

impact will be felt more significantly here.

Random and sequential I/O is only of consequence for the disk itself. There is no

additional overhead for random I/Os versus sequential I/Os. When testing on a single

disk, random I/Os could make the disk work harder thereby increasing response on

individual I/Os. On the other hand, since the disk head is already moving rapidly

across the disk, it could help place I/Os from different VMs closer together and

increase the distribution of locations on the disk. Sequential I/Os nearly guarantee a

fixed distance on the disk between the sequential areas for each partition.

Any benchmark used must be able to gauge the breadth of these types of I/O

profiles, or a profiler must use multiple tools to gather the information required.

Bonnie++ is an excellent model of file server type workloads because it tends to use

fairly large block sizes with a mix of sequential and random accesses. Postmark on

15

the other hand was written specifically to test the workloads of database servers and

mail servers. Postmark exclusively performs random I/O.

Due to the fact all of the benchmarks used in this research operated at the filesys-

tem level, it is important to clear the filesystem caches at all levels prior to running or

re-running tests otherwise the benchmarks may be measuring the speed of the filesys-

tem cache rather than the actual disk subsystem. Some of the benchmarks attempted

to overcome the effects of file caching by using files that are significantly larger than

the available memory on the system to make it impossible for the filesystem to cache

the entire file used in the test.

3.3 Questions

This research aims to answer the following questions within the bounds of the envi-

ronments tested.

1. Which storage virtualization approach gives the least overhead?

2. Which approach gives the best performance in a multi-tenant environment?

3. Given a specific workload, which approach performs the best?

3.4 Projections

As with all software based virtualzation techniques, the most important factor in

determining performance will be the overhead involved. Bare metal has the small-

est amount of overhead possible while providing flexibility and configurability for

16

the operating system and its applications. This interface has undergone decades of

refinements and analysis to become as streamlined as it possibly can be. Having

multiple processes access different partitions can still be optimized to achieve higher

performance as a whole.

Hypervisor based storage is relatively new and does not have nearly as much

research devoted to the same system as the bare metal case. This methodology

must take care of transferring data to the guest operating systems, translating the

virtual disk into a physical location on physical storage, formulating a physical I/O

operation, and passing that to the hardware. It must do this while arbitrating access

across multiple guests. At the same time, it could be possible for the hypervisor to

cache disk accesses (though this will not be used during the testing for this research),

which will help disk accesses. It could also be possible for the hypervisor to use the

arbitration overhead to its advantage by optimally scheduling the disk to minimize

the head movement.

Direct hardware access by the guests using SR-IOV offers bare metal overhead

in a virtualized environment. The guests deliver the I/O requests directly to the

hardware without any hypervisor intervention by using the I/O MMU. At the same

time, if multiple guests access the same physical disk, then there will be no intelligent

scheduling of those accesses except for any intelligence provided by the disk itself.

17

Chapter 4

Evaluation

4.1 Test Methodology

The testing performed in this research was executed using the three flavors of I/O

configuration discussed previously: using bare metal Linux; using hypervisor virtio

disks; and finally using SR-IOV physical hardware. The tests were performed on a

Dell PowerEdge T620 server (BIOS version 1.5.3) with a single Intel Xeon E5-2620

processor with 6 hyperthreaded cores running at 2.0 GHz. The server had 8 GB of

1333MHz DDR3 RAM. The server and operating system was configured to use the

available I/O MMU to accelerate memory access by the hardware.

The SAS controller used in the experiments was a LSISAS 3108, a 3rd generation

SAS controller made by LSI Corporation. This controller can be either a standard

non-SR-IOV controller or a SR-IOV controller based upon the firmware installed.

When testing SR-IOV, the controller was using SR-IOV firmware and in the bare

metal and virtio environments, the controller was in non-SR-IOV mode to accurately

18

represent the real-world usage of all three of the test cases. In both configurations,

the firmware version was 02.00.01.00. The number of virtual functions presented to

the system can also be configured, along with the resource allocation between the

functions, both physical and virtual. Except where noted, the number of virtual

functions was set to four and the available resources were equal across the virtual

functions. The physical function received reduced resources because it was not needed

for regular I/O operations. The resources in question are the depth of the request

queues provided to the driver to submit requests to the controller. Each function has

up to 8 MSI-x vectors available to it. The driver used was version 3.00.00.00 for non-

SR-IOV mode and 10.165.01.00 for the SR-IOV hypervisor driver. It should be noted

this SR-IOV driver and firmware are pre-production versions. Both the non-SR-IOV

firmware and driver are production versions.

When in SR-IOV mode, the controller must be configured with how to divide its

available resources between the virtual functions. It was decided to configure the

controller to provide each virtual function with 21% of the resources and the physical

function would have 16% of the resources for its own usage. This division is rather

arbitrary, and is discussed further in Chapter 6. A small amount of testing was done

changing the resource allocation and is discussed later in this chapter.

Attached directly to the controller were Seagate Savvio ST9146803SS 6Gbps

10,000 RPM 146GB SAS drives. Two access types to the physical disks was used. In

one case, each guest had access to its own exclusive disk. No machine other than the

one to which the disk was assigned could make any accesses to that disk. The other

paradigm used was sharing a single physical disk across multiple virtual machines.

The disk would be partitioned into four primary partitions with each virtual machine

19

given access to its assigned partition. Each virtual machine could format and mount

its own partition.

CentOS 6.3 with Linux Kernel 2.6.32–279.el6.x86 64 was used as both the host

and guest operating systems for these tests. KVM using libvirt version 0.9.10 was

installed as the hypervisor. The file system used in all cases was the simple and low

overhead ext3 file system. Virtio using the raw storage format was used in all cases

to keep the testing in line with the current recommendations put forth by the KVM

community.

Scripts were written to ensure the tests were executed the same way every time

and also the ensure the operating system caches of all of the systems involved were

cleared before running the tests. These scripts allowed the tester to easily execute 6

of the same test in order to average against temporal irregularities in the data. All

of the tests were run with 1, 2, 3, and 4 virtual machines or in the case of bare metal

testing, numbers of simultaneous processes running the same test. In the cases where

virtual machines were used, when the test execution reduced the number of guests

needed, the extra guest was shut down before continuing so as to ensure it would not

steal resources from the running virtual machines.

4.2 Independent Disks

To determine which paradigm has the least overhead, testing was performed using

independent disks for each virtual machine. The only resources the guests would be

in contention for would be CPU, interrupt, and hard disk controller related.

20

Bonnie++ The Bonnie++ benchmark showed for write operations, bare metal

had the worst performance while SR-IOV and virtio had relatively similar perfor-

mance as seen in Figure 4.1. This seems counterintuitive since the bare metal case

should have the least amount of overhead and therefore the best performance. All

three also showed a non-sequential decrease in throughput as the number of virtual

machines increased, however all of the results were similar enough to be considered

identical. Large block write throughput also showed a non-sequential decrease in

throughput except for virtio. Single VM throughput was worse on bare metal than

in SR-IOV or virtio, suggesting the processes were interfering with one another in

either the filesystem or the middle layer of the Linux SCSI subsystem when not put

into separate guests, which given the severe drop off in throughput when the number

of simultaneous processes was increased to four.

The read performance showed very different results between small I/O and larger

block I/O shown in Figure 4.2. SR-IOV and bare metal cases show very compara-

ble performance in the small block tests, while virtio turned in consistently lower

throughput across all the number of VMs tested. This supports the idea virtio has

much higher overhead since no caching could take place and the small I/O size will

tend to show the overhead in processing each individual request. As could be ex-

pected, the large block throughput was significantly higher than the smaller block

throughput, showing the smaller overhead involved in transferring a given amount

of data. Unexpectedly, however, was bare metal having much lower performance

once more than one process was active simultaneously. As with the write case, this

is likely due to undesirable interaction between the processes causing interference in

those test cases. SR-IOV turned in the highest throughput of all of the environments,

21

0

20000

40000

60000

80000

100000

120000

Bare Metal Per Chr SR-IOV Per Chr Virtio Per Chr Bare Metal Block SR-IOV Block Virtio Block

KB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.1: Average VM Write Throughput with Independent Disks using Bonnie++

0

20000

40000

60000

80000

100000

120000

140000

160000


KB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.2: Average VM Read Throughput with Independent Disks using Bonnie++

22

besting virtio across the board, again indicative of the higher overhead in transferring

the data between the hardware, the hypervisor, and the guest, though in this case

it could be attributed more to the scatter gather list translation over the individual

request overhead.

Comparing each method against itself with varying number of guests, write per-

formance was fairly flat across all test configurations. Read performance on the other

hand in bare metal showed a steady decline as the number of processes increased.

SR-IOV and virtio showed a much less pronounced decline in performance as the

number of guests increased, especially for large block requests.

Looking at the CPU utilization during each test in Figure 4.3, write tests indicated

the bare metal had the lowest usage and the virtualization techniques both showed

similar CPU usage. During read tests in Figure 4.4, small block reads caused the

highest usage in bare metal and SR-IOV with virtio having the lowest average usage.

For larger block I/Os, all techniques had similar CPU utilization with writes having

higher utilization than read operations.

What happens if the resources match what will be used? Figure 4.5 shows a com-

parison of small and large block writes using SR-IOV in the same configuration as

seen in Figure 4.1 along with the firmware configured to allocate resources with 16%

going to the physical function and the remaining resources evenly split between the

utilized virtual functions. Note the four VM case is not tested as the configuration

would be the same in both cases. Figure 4.6 shows a similar figure with read I/O

operations. In both the SR-IOV and the minimum SR-IOV configuration, the per-

formance is similar for both read and write. This result indicates for this workload,

the resources were not the bottleneck.

23

0

20

40

60

80

100

120


CP

U U

sag

e %

1 VM

2 VM

3 VM

4 VM

Figure 4.3: Average VM Write CPU % with Independent Disks using Bonnie++

0

20

40

60

80

100

120


CP

U U

sag

e %

1 VM

2 VM

3 VM

4 VM

Figure 4.4: Average VM Read CPU % with Independent Disks using Bonnie++

24

0

20000

40000

60000

80000

100000

120000

Minimum Per Chr SR-IOV Per Chr Minimum Block SR-IOV Block

KB

/s 1 VM

2 VM

3 VM

Figure 4.5: Average VM Write Throughput Comparing SR-IOV Configurations using

Independent Disks and Bonnie++

25

0

20000

40000

60000

80000

100000

120000

140000

160000

Minimum Per Chr SR-IOV Per Chr Minimum Block SR-IOV Block

KB

/s 1 VM

2 VM

3 VM

Figure 4.6: Average VM Read Throughput Comparing SR-IOV Configurations using

Independent Disks and Bonnie++

26

Postmark In comparing the performance of the various methods under the Post-

mark benchmark, virtio on the whole performed the best on both read and write op-

erations. Read and write performance on this benchmark, though scaled differently,

was interestingly identical as seen in Figures 4.7 and 4.9. This property suggests the

multiple file properties made it much more difficult to cache the write operations in

the file system of the operating systems.

0

0.5

1

1.5

2

2.5

3

Bare Metal Read MB/s SR-IOV Read MB/s Virtio Read MB/s

MB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.7: Average VM Read Throughput with Independent Disks using Postmark

Far more interesting in this benchmark was the effects of adding additional guests.

Bare metal performance had the best performance seen when there was only one

benchmark running, but that performance quickly tailed off as additional processes

were added. SR-IOV was able to keep performance the same as the number of guests

increased, although it was not able to achieve the average performance of either bare

27

metal or virtio. Virtio was unable to achieve the peak numbers seen on bare metal nor

the consistency seen by SR-IOV. Its average performance was better than SR-IOV,

but the standard deviation in virtio was much higher than that of SR-IOV, as seen in

Figure 4.8. The standard deviations indicate the virtualization methods are able to

maintain better similarity in performance versus that of the bare metal case. Since

SR-IOV was able to keep high performance across all of its guests, the hardware is

much better at enforcing boundaries than the other two cases, though at a cost of

slight performance loss versus virtio.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Read Write

MB

/s Bare Metal

SR-IOV

Virtio

Figure 4.8: Average VM Throughput Standard Deviations with Independent Disks

using Postmark

The flat performance in the SR-IOV case suggests the test was bound by the

controller resources for 1 or 2 VMs, and was forced to share the resources when the

28

0

1

2

3

4

5

6

Bare Metal Write MB/s SR-IOV Write MB/s Virtio Write MB/s

MB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.9: Average VM Write Throughput with Independent Disks using Postmark

number of VMs increased. The lack of performance in bare metal suggests there was

some inefficiency in servicing the multiple processes in the single operating system.

Virtio was able to avoid the single operating system bottleneck while being able to

more fully utilize the resources available on the hardware, though overhead could still

be an issue for small number of guest systems.

Figure 4.10 shows how the resources can affect performance when using the Post-

mark benchmark. Unlike the Bonnie++ results, here the resource allocation in the

minimum SR-IOV configuration shows a performance beneift in the one and two

guest configurations. Since the Postmark benchmark uses many more files, it is much

more likely to issue more requests and therefore consume more resources than the

Bonnie++ benchmark.

29

0

0.5

1

1.5

2

2.5

3

3.5

4

Minimum Read MB/s SR-IOV Read MB/s Minimum Write MB/s SR-IOV Write MB/s

MB

/s 1 VM

2 VM

3 VM

Figure 4.10: Average VM Throughput Comparing SR-IOV Configurations using In-

dependent Disks and Postmark

30

4.3 Disk Sharing

Sharing disks between virtual machines or processes increases the stress of the I/Os

on the hard disk drive.

0

20000

40000

60000

80000

100000

120000


KB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.11: Average VM Write Throughput with Shared Disks using Bonnie++

Bonnie++ Looking at write performance in a shared disk configuration, it is clear

adding additional guests has a significant degradation in performance. In both big

and small block writes, once more than one guest was operational at the same time,

the performance was even across the board as seen in Figure 4.11. The only difference

is with single VM testing, which showed similar performance as with the independent

disk tests, which would be expected since the test is effectively the same. It is clear

the disk is the bottleneck in this configuration.

31

0

20000

40000

60000

80000

100000

120000

140000

160000


KB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.12: Average VM Read Throughput with Shared Disks using Bonnie++

Read performance with shared disk, as with the write tests, shows a significant

drop in performance as the number of guests increase. As opposed to the write tests,

big block and small block tests showed different throughput. In the small block tests,

the performance level was similar across all environments. In large block tests, SR-

IOV and bare metal showed similar levels, while virtio had problems keeping up.

Since the performance levels were different between the environment types, it is clear

virtio suffers from its additional overhead even when the disk is being shared across

multiple guests. This is an interesting result as the hypervisor should be able to

perform some scheduling on the disk to keep accesses localized, however SR-IOV

which has absolutely no centralized disk scheduling was able to outperform it. Either

the overhead of virtio is extremely high or the scheduling mechanism in the hypervisor

is actually degrading performance.

32

0

0.5

1

1.5

2

2.5

3

Bare Metal Read MB/s SR-IOV Read MB/s Virtio Read MB/s

MB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.13: Average VM Read Throughput with Shared Disks using Postmark

0

1

2

3

4

5

6

Bare Metal Write MB/s SR-IOV Write MB/s Virtio Write MB/s

MB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.14: Average VM Write Throughput with Shared Disks using Postmark

33

Postmark As with the independent disk testing, the read and write performance

was identical except for a scaling factor, as seen in Figure 4.13 and Figure 4.14. Unlike

Bonnie++, the throughput did not experience a incremental decrease in performance

as the number of guests increased. For bare metal, the performance significantly

degraded when the first additional processes was added, but as additional processes

were added, the throughput actually improved. The first additional guest for both

SR-IOV and virtio severly impacted the throughput of both tests, and subsequently

added guests did not have quite as much of a negative impact.

4.4 Performance on Single Workload

To test a single workload, this research used a shared disk and a custom micro bench-

mark called dos to look at how each profile would perform. Finally, a common

GNU/Linux utility called dd was used to show simple sequential read throughput on

the three environments.

Dos As seen in Figure 4.15, SR-IOV and bare metal performed identically for se-

quential reads across all block sizes with throughput dropping 17% from one simulta-

neous worker to four simultaneous workers. When running virtio based storage on the

guests, performance dramatically degraded as the number of guests increased. Even

adding a single guest saw a performance drop of 57%. Adding one more guest dropped

performance an additional 15% from the single guest performance. With four guests,

the total performance drop was 78% of the single guest performance. This was true

for block sizes ranging from 16KB to 1024KB. Since bare metal and SR-IOV did not

34

0

20

40

60

80

100

120

140

Bare Metal 16KB SR-IOV 16KB Virtio 16KB Bare Metal 128KB SR-IOV 128KB Virtio 128KB Bare Metal 1024KB SR-IOV 1024KB Virtio 1024KB

MB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.15: Average VM Sequential Read Throughput with Shared Disks using Dos

suffer nearly as significantly when adding workload, it can be concluded the disks

were not the limiting factor and CPU speed was also not the limiting factor since

SR-IOV was able to achieve much higher performance for its guests. Interestingly, for

single guest workloads, virtio had the highest throughput of all of the environments

for all block sizes.

Figure 4.16 displays the results under random read operations, SR-IOV and virtio

performed nearly identically under a 16KB and 128KB block size, both showing about

a 77% drop. When the block size of the read operation was increased to 1024KB, all

three environments had similar results for single guest performance, for two guests

virtio showed similar results as SR-IOV, and for three and four guests, the through-

put on virtio was significantly lower than the other two environments. The similarity

between the environments in the smaller two block sizes indicate the disk was the bot-

35

0

10

20

30

40

50

60

70

80

90

100

Bare Metal 16KB SR-IOV 16KB Virtio 16KB Bare Metal 128KB SR-IOV 128KB Virtio 128KB Bare Metal 1024KB SR-IOV 1024KB Virtio 1024KB

MB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.16: Average VM Random Read Throughput with Shared Disks using Dos

tleneck for performance. For the largest block size, the environment was the limiting

factor. The high performance on bare metal indicates the file system scheduler was

able to more optimally schedule the requests to improve response time. The lower

performance on high VM count in virtio again shows there is some bottleneck in the

virtio path that arises when the number of VMs increases.

dd Figure 4.17 shows the throughput as measured by the dd utility provided by the

Linux distribution while using shared storage. With one copy of the utility running,

all environments show roughly the throughput available at the disk, a little more

than 120 MB/s, though virtio and SR-IOV have slightly higher results than the bare

metal test. Running more than one copy of the utility in bare metal shows a drop in

performance, but the number of additional workers does not affect the throughput,

36

0

20

40

60

80

100

120

140

Bare Metal SR-IOV Test 1 Virtio

MB

/s

1 VM

2 VM

3 VM

4 VM

Figure 4.17: Average VM Throughput with Shared Disks using dd

suggesting a problem at a higher layer. SR-IOV and virtio both show similar results

as the number of workers increase, with SR-IOV having a very small advantage. It is

clear the disk scheduler working on the bare metal operating system is helping improve

the throughput as the number of workers increase. Any optimization provided by

virtio is ineffective since the numbers in that environment are so similar to SR-IOV,

which has no optimizations.

Looking at the individual workers in Figure 4.18 and not the average of all of

the workers, virtio shows nearly perfectly even results across all of the workers. Bare

metal had a fairly consistent drop across all of the individual workers. SR-IOV shows

excellent service for the first guest, significantly less for the next, and so on. This

is in contrast with the other tests where SR-IOV had consistently the most even

performance for all of the guests. It is surprising to see the fourth worker see the

37

lowest performance because it was allocated the outer cylinders, receiving the benefit

of effectively short stroking the disk. Since bare metal and SR-IOV share no portion

of the I/O path except what is in the hardware, the conclusion is the hard drive is the

bottleneck. Virtio provides consistent performance across guests at the cost of good

performance for any guest. This test was run a second time and the results there

showed very little variation between guest throughput. The huge variability between

runs indicates a small variation in the initial conditions of the test can change the

results dramatically.

0

20

40

60

80

100

120

140

Bare Metal SR-IOV Test 1 Virtio

MB

/s

VM1

VM2

VM3

VM4

Figure 4.18: dd Throughput per VM with Shared Disks and 4 VMs

38

Chapter 5

Conclusion

By most measures, SR-IOV was able to keep on par with the bare metal environ-

ment. Under workloads with many small files, it was able to provide the most even

performance of all of the environments. In all measures except for the Bonnie++

write throughput, virtio was unable to keep up with bare metal or SR-IOV in terms

of throughput. In some tests, it had abysmal performance which could only be at-

tributed to a misbehaving optimization, but it also shows the inherent problem in

adding overhead in order to virtualize hardware resources. This overhead is expected

and therefore the performance impact was anticipated. What was unanticipated was

the uneven performance between VMs. Since virtio has added software overhead, it

should be able to enforce even performance and have the ability to optimally schedule

the disk when there is contention between mutliple guests and one physical resource.

This second point should be the most important because scheduling the I/O requests

optimally for the disk will inevitably force even performance across guests.

It can also be concluded the SR-IOV implementation reduces the need for I/O

39

fairness scheduling in the hypervisor. All of the I/O overhead of formulating the

I/O to submit to the hardware is borne by the guest itself instead of on behalf of

the guest by the hypervisor. This should significantly reduce the complexity of the

scheduler routines used by the hypervisor to schedule the guests to run on the CPU.

It should also make fairness easier between I/O intensive tennants and CPU intensive

tennants since the only resource arbitrated by the hypervisor is access to the CPU.

This property could help large cloud hosting organizations achieve better fairness and

reduce the importance of I/O scheduling in the hypervisor scheduler algorithms.

40

Chapter 6

Future Work

While running the tests described in this research, it was observed configuring per-

missions for the guests on the SR-IOV controller was difficult. It would be useful for

future developers to implement some means of having the SR-IOV resources be con-

figured by the hypervisor configuration utilities rather than through separate means

provided in a vendor unique manner by the SR-IOV driver. This would provide a

much more flexible interface that could be used by many vendors and would make

the job of administrators much easier when faced with a changing environment.

Another possible improvement would be to change the affinity of the guests to be

pinned to certain CPUs if it is using SR-IOV. In the bare metal environment, the

driver allocates one MSI-x vector per CPU in order to route the I/O replies back to

the CPU which initated the request. When virtualization is used, the CPU the guest

is running on may change at the whim of the hypervisor, reducing the ability to keep

recently used data in the CPU cache.

Another thread of future research would be to investigate means of issuing hard-

41

ware interrupts directly to a running guest, rather than needing to break to the

hypervisor to route and map the interrupt to the guest.

This work did not investigate many optimizations in the hard disk controller card

itself. What is the performance of the bare metal case when the resources are spread

thin just as they were for the guests accessing the controller through SR-IOV? Why

was the performance limited so much in the benchmarks when using SR-IOV? At the

conceptual level, the performance should have matched the performance seen when

using bare metal. Was the interrupt routing through the hypervisor really significant

enough to disrupt performance? If not, what is the bottleneck?

A small performance improvement could be found if all entities used the same

format of SGL. The virtio library proposes its own proprietary format [5], the Linux

kernel uses another, and finally the driver and the hardware use yet another. If

the entire stack converged on a uniform SGL format, a certain amount of overhead

dedicated towards converting the SGLs can be eliminated and the performance of the

entire stack, with or without virtualization can be improved.

42

Bibliography

[1] PCI-SIG. Single root i/o virtualization. http://www.pcisig.com/

specifications/iov/single_root/, Jan 2010. Accessed: 2013-04-17.

[2] Jiuxing Liu. Evaluating standard-based self-virtualizing devices: A performance

study on 10 gbe nics with sr-iov support. In Parallel Distributed Processing

(IPDPS), 2010 IEEE International Symposium on, pages 1–12, 2010.

[3] James Bottomley and Rob Landley. SCSI Interfaces Guide. Linux Foun-

dation, 660 York Street, Suite 102 San Francisco, CA 94110, 2007. https:

//www.kernel.org/doc/htmldocs/scsi/index.html Accessed: 2013-11-02.

[4] Ismael Luceno. Qemu/devices/virtio. https://en.wikibooks.org/wiki/QEMU/

Devices/Virtio, Dec 2012. Accessed: 2013-11-2.

[5] Rusty Russell. virtio: towards a de-facto standard for virtual i/o devices. ACM

SIGOPS Operating Systems Review, 42(5):95–103, 2008.

[6] Avi Kivity and Anthony Liguori. Tuning kvm. http://www.linux-kvm.org/

page/Tuning_KVM, September 2012. Accessed: 2013-11-2.

43

[7] Anthony Liguori. Qemu emulator user documentation. http://wiki.qemu.org/

download/qemu-doc.html#disk_005fimages, Jan 2010. Accessed: 2013-11-2.

[8] Qemu/devices/storage. https://en.wikibooks.org/wiki/QEMU/Devices/

Storage, May 2012. Accessed: 2013-11-2.

[9] Tim Bray and Russel Coker. Bonnie++ documentation. http://www.coker.

com.au/bonnie++/readme.html, 1999. Accessed: 2013-10-29.

[10] Jeffrey Katcher. Postmark: A new file system benchmark. Technical re-

port, Technical Report TR3022, Network Appliance, 1997. www. netapp.

com/tech library/3022. html, 1997.

44

virtual machine storageperformance using...

Documents