high throughput virtualization

151
High Throughput Virtualization Susinthiran Sithamparanathan Thesis submitted for the degree of Master in Programming and Networks 60 credits Department of Informatics Faculty of mathematics and natural sciences UNIVERSITY OF OSLO Spring 2018

Upload: others

Post on 30-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

High Throughput Virtualization

Susinthiran Sithamparanathan

Thesis submitted for the degree ofMaster in Programming and Networks

60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Spring 2018

High Throughput Virtualization

Susinthiran Sithamparanathan

© 2018 Susinthiran Sithamparanathan

High Throughput Virtualization

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

Abstract

Virtualization is one of the key technologies used in the era of Big Data and Cloud.In this thesis, we’ll look at achieving high throughput network performance with40Gb/s Ethernet (40GbE) adapters with support for Remote Direct Memory Access(RDMA), Single Root I/O Virtualization and Sharing Specification (SR-IOV) andRDMA over Converged Ethernet (RoCE).

The study looks at the challenges, issues and achieved benefits of such SR-IOV-enabled high throughout network adapters in a virtualized environment intendedfor high throughput networking. This study shows that SR-IOV and RoCE is ableto deliver close to bare metal and line rate network throughput. The results showthat the combination of SR-IOV and TCP/IP delivers 91.3% increased bandwidthcompared to Paravirtualization and that RoCE delivers 80.2% higher bandwidth overTCP/IP. The results also show that SR-IOV and TCP/IP is able to deliver an increaseof 18.4% over bare metal in terms of achieved network throughput. However, theincreased performance of SR-IOV does come with a cost of increased system load aswell as higher memory usage as the study will further detail.

1

Acknowledgements

I’d like to express my appreciation to the following people and institutions, andrecognise their support:

• Simula Research Laboratory for hosting this very interesting project andproviding suitable research environment throughout the multiple extendedproject time.

• UiO: University of Oslo and Oslo for offering the master degree program andproviding a high quality study environment and facilities.

• Professor Tor Skeie for being my secondary supervisor at Ifi/UiO.

• Adjunct research scientist Ernst Gunnar Gran for giving me the opportunity towork with such an interesting topic with the mentioned resources available atthe Department of Advanced Computing and System Performance (CASPER) atSimula, and for his guidance even after partially quitting Simula!

• Dr. Vangelis Tasoulas for being my primary supervisor even at time afterfinishing his Ph.D at Simula. Thank you very much for the time you spentgiving me guidance and advise through the extended project time! Your helpis invaluable to me!

• The administration at my former employer Institute of Theoretical Astrophysicsat UiO, for motivating me to complete a master degree, and giving me theopportunity and support to study and take the mandatory exams at for themaster program at UiO.

• My wife Kalpana, my mother Ambi and cousin Asha for your invaluablesupport and help while i was away from home and couldn’t take care of ourkids. No words can express how thankful i’m for spending countless time with

2

our kids while i had to study, take exams and work on the master thesis! Andour much beloved daughters Nitara and Mayraa for letting me work on masterdegree spending less time with you. Now you can look forward to not havingto ask anymore: "Pappa, skal du til skolen idag og?" when dropping you homeafter the nursery!

3

Contents

I Introduction 10

1 Motivation 111.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.2 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Background 202.1 Cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Cloud Computing Characteristics . . . . . . . . . . . . . . . . . . 212.1.2 Cloud Layered Architecture . . . . . . . . . . . . . . . . . . . . . . 222.1.3 Cloud Deployment Models . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.1 Server Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 I/O Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.1 Full Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.2 Paravirtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.3.3 Direct Device Assignment . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 SR-IOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.5 RDMA and RoCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.6 KVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6.1 How does KVM work? . . . . . . . . . . . . . . . . . . . . . . . . . 392.7 OpenStack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.7.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.8 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.8.1 Studying Performance of 1GbE SR-IOV-enabled NIC InVirtualized Environment . . . . . . . . . . . . . . . . . . . . . . . 46

2.8.2 Studying Performance of SR-IOV In Virtualized Environment . . 472.8.3 Dynamic Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . 47

4

2.8.4 High Performance Network Virtualization . . . . . . . . . . . . . 482.8.5 Big Data and Data Protcols . . . . . . . . . . . . . . . . . . . . . . 482.8.6 Accelerating OpenStack Swift with RDMA . . . . . . . . . . . . . 492.8.7 RoCEv2 At Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

II The Project 51

3 Methodology 523.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.2 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 Experiment Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3.2 Experiment Design and Phases . . . . . . . . . . . . . . . . . . . . 543.3.3 Experiment Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.3.4 Experiment Key Factors . . . . . . . . . . . . . . . . . . . . . . . . 573.3.5 Data Collection and Evaluation . . . . . . . . . . . . . . . . . . . . 57

4 Results 644.1 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.1.1 MTU Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 644.1.2 NUMA Topology Considerations and Tuning . . . . . . . . . . . 654.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2 Bare metal to Bare metal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.2.1 iPerf and Netperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.2 RoCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 VM to Baremetal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3.1 Paravirtualized NIC . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3.2 Enabling SR-IOV and VFs . . . . . . . . . . . . . . . . . . . . . . . 75

5 Analysis 815.1 Different methods with bare metal . . . . . . . . . . . . . . . . . . . . . . 815.2 Different methods with virtualization . . . . . . . . . . . . . . . . . . . . 92

5.2.1 Paravirtualized NIC . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2.2 SR-IOV and VF passthrough . . . . . . . . . . . . . . . . . . . . . 99

5

III Conclusion 109

6 Discussion and Future Work 1106.1 Evolution of the project as a whole . . . . . . . . . . . . . . . . . . . . . . 1106.2 Bare metal to bare metal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.3 VM to bare metal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126.4 Changes in intial plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.4.1 Libvirt Bug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.4.2 Issue with booting VM with VF PCI passthrough . . . . . . . . . 116

6.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7 Conclusion 119

Appendices 123

A System setup and configuration 124

B Scripts and Automation Tools 135

C Graphs 136

6

List of Figures

1.1 Cloud adoption 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Three types of cloud computing . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 Type 1 VM architecture (native) . . . . . . . . . . . . . . . . . . . . . . . . 292.2 Type 2 VM architecture (hosted) . . . . . . . . . . . . . . . . . . . . . . . . 292.3 How SR-IOV works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.4 RDMA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.5 Architecture of Infiniband, RoCE and TCP/IP . . . . . . . . . . . . . . . 362.6 KVM Guest Execution Loop . . . . . . . . . . . . . . . . . . . . . . . . . . 412.7 OpenStack Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1 iPerf vs Perftest B/W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2 TCP/IP Paravirtualization vs Bare Metal . . . . . . . . . . . . . . . . . . 754.3 Adding Virtual Hardware from Virtual Machine Manager . . . . . . . . 784.4 VM Average Memory Usage: Paravirtualization (PV) vs SR-IOV

(TCP/IP and RoCE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1 TCP/IP Average Bandwidth for different MTU sizes . . . . . . . . . . . . 825.2 TCP/IP Bandwidth and System Load for MTU 1500 and 9000 . . . . . . 835.3 TCP/IP Bandwidth for different MTU sizes . . . . . . . . . . . . . . . . . 855.4 RoCE and TCP/IP Mean Bandwidth for all MTUs . . . . . . . . . . . . . 865.5 RoCE and TCP/IP Mean System Load for all MTUs . . . . . . . . . . . . 885.6 TCP/IP IRQ Generation Server and Client . . . . . . . . . . . . . . . . . . 895.7 TCP/IP IRQ Affinity Core 1-7 and CPU Affinity Core 0 . . . . . . . . . . 915.8 VirtIO TCP/IP B/W Different MTU . . . . . . . . . . . . . . . . . . . . . 935.9 Paravirtualization TCP/IP IRQs generation Server and Hypervisor . . . 945.10 B/W & System Load MTU 1500 vs 9000 . . . . . . . . . . . . . . . . . . . 965.11 CPU Context Switches Client (VM) and Hypervisor MTU 1500 vs 9000 . 99

7

5.12 SR-IOV TCP/IP Bandwidth for all MTU sizes . . . . . . . . . . . . . . . . 1005.13 SR-IOV TCP/IP Bandwidth and System Load for MTU 1500 and 9000 . 1015.14 SR-IOV: RoCE vs TCP/IP Average Bandwidth for MTUs . . . . . . . . . 1035.15 SR-IOV: RoCE vs TCP/IP Average System Load for all MTUs . . . . . . 1045.16 Paravirtualization and SR-IOV: IRQ Generation on Hypervisor and VM 1055.17 Left: % CPU load hypervisor. Right:% Fraction of CPU time used for

servicing guest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

C.1 Bare metal memory usage RoCE and TCP/IP . . . . . . . . . . . . . . . 137C.2 Hypervisor memory usage Paravirtualization and SR-IOV (RoCE and

TCP/IP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8

List of Tables

3.1 Physical Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2 Experiment Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1 Developed Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

9

Part I

Introduction

10

Chapter 1

Motivation

Cloud computing is a relative recent paradigm which is changing the landscape of ITprocesses in many enterprises. Traditionally, the IT was mainly limited to an in-houseportfolio for many of the businesses. It’s been predicted that cloud computing willgrow [1] and that the cloud computing was going to be the preferred choice for manybusinesses and that the businesses should develop a strategy for workloads that couldbe moved out to the cloud and that should be kept in-house [2][3]. In recent years, thishas actually become the trend for businesses as they have moved out workloads thatis suitable to be run in the cloud. And today cloud computing is among the preferredtechnology for many academic environments, enterprises and service providers.Service providers such as Amazon, Google, Salesforce and Microsoft have alreadyestablished new data centers in various locations around the world for the purposeof delivery cloud computing services to the public with redundancy and reliability.

Cloud computing mainly delivers computing resources on the following levels [4]:

• sofware: such as Dropbox offering a software service to store and sync files.

• platform: such as Microsoft Azure or Google App Engine enabling tenants torun their applications on.

• intrastructure: such as Amazon Web Services (AWS) delivering infrastructureservices (storage, compute, network etc)

As cloud computing is becoming well established and mature, the enterprises aren’twaiting to benefit from the services it offers, whether it’s in the form of public, private

11

or hybrid cloud. According to a survey by IDG [5] in 2016 showed that enterprisecompanies (with more than 1000 employees) have moved 45% of the applications andcomputing infrastructure to the cloud already and that these companies anticipatehaving 60 percent of their total IT environment in a mix of public, private, and hybridclouds by 2018. Another recent survey by Right Scale [6] in 2017 showed that theenterprises run 32 percent and 43 percent of their workload in public and privatecloud, respectively. The same survey showed that the hybrid cloud is the trend andthe preferred strategy for decision makers among the enterprises. As we can seenin the figure 1.1 taken from the survey, 95 percent of the respondents are now takingadvantage of cloud computing in 2017. Yet another report from Technology BusinessResearch [7] estimates a growth of total spending of business in private cloud to $69billion by 2018, a compound annual growth rate (CAGR) of 14 percent from 2014.

This is not surprising as cloud service providers offer tenants to scale as the demandincreases hence allowing tenants to start with smaller workloads. With such adoptionrate, we need to more research and studies of the subject of I/O and high speednetworking. The network of a cloud is the tenants’ highway into the cloud resourcesand it’s the one that pushes the tenants data in and out of the cloud. Resources suchas CPU, memory and storage can be negotiated on an SLA-level with QoS. This way,tenant’s applications are then given a mean of guarantee for the available resources.Therefore, it is critical that high speed network adapters used in cloud infrastructuredo not take up unfair amount of resources, such as CPU and memory, in order toperform as close to their design specifications.

12

Figure 1.1: Cloud adoption 2017

There are many types of cloud computing that exist as of today. The most commoncloud types are public, private and hybrid cloud [4]. These three types as well as someother types of cloud are further explained in chapter 2 under subsection 2.1.3

Figure 1.2 illustrates the three different cloud types. A typical characteristic of publiccloud is that it’s provided by a third party commercial provider over the Internet tomany tenants. The tenants share the resources (CPU, network and storage) within thecloud and the providers have infrastructure with data centers and servers to providesuch a cloud service. Examples of public cloud providers are Amazon, Google,Salesforce and Microsoft. Private clouds are typically dedicated cloud computingresources within a business’ private network and is either managed by the company’sIT or an external cloud provider. A hybrid cloud is a mix of the former two. Typically,it can be a business strategic choice of having some of the applications running in apublic cloud, while keeping the others inside their own private cloud with their owninfrastructure. There are multiple factors that need to be taken into considerationwhen making the decision about what applications to run in different cloud types.These factors involve, but not limited to, laws, regulations and business model.

Private clouds are deployed by research environments and IT businesses aroundthe world. This type of cloud provide researchers a testbed for their research and aninfrastructure for the businesses to develop the application on. As of the writing, there

13

are some orchestration tools (also known as cloud framework) available that are OpenSource Software (OSS) [8] with OpenStack being a mature one.

OpenStack has gained significant popularity and support from many leadingtechnology companies. It has become the de facto standard for open source IaaScloud deployments. Openstack consists of well known open source components andit all started as a NASA and Rackspace project back in 2010 and later adopted byUbuntu Linux developers in 2011. Today OpenStack lists over 200 companies andorganizations as members, sponsors and supporters of the OpenStack Foundationon their website [9]. Among the most notable ones are the founders Rackspaceand NASA. Other supporters include names such as AT&T, HP, Dell EMC, IBM,Canonical, Red Hat, Cisco and SUSE.

Figure 1.2: Three types of cloud computing[4]

14

Cloud provider services are layered in the following SPI(SaaSPaaSIaas) mode:

• Software as a Service (SaaS)

• Platform as a Service (PaaS)

• Infrastructure as a Service (IaaS)

These are layered on top of each other with IaaS at the bottom. On top of IaaS residePaaS and SaaS, with the latter being on top one. Enabling technologies at the coreof the cloud computing is virtualization, network and storage. As the SPI layersare a combination of building blocks of clouds, hardware and software, issues likeavailability, performance and efficiency is to be taken into consideration. Specifically,as both PaaS and SaaS reside on top of IaaS, performance and Quality of Service (QoS)is tightly coupled with the performance and QoS of IaaS. As applications running inthe cloud have become more data-intensive [1], the underlying infrastructure nowhas to move data between VMs at bandwidth requirement as per SLA agreementbetween the tenants and the cloud provider. Therefore, we will deploy such an IaaScloud using OpenStack using 40GbE network adapters in order to study how it mightperform in such a scenario.

In order to achieve higher degree of efficiency and server (and hardware) consolid-ation, virtualization technology is heavily used in cloud deployments. Physical re-sources like processing units (CPU/GPU), memory, storage and networks are virtu-alized and shared between virtual machines in order to achieve a high degree of re-source utilization and efficiency.

It’s not uncommon to host some significant numbers of VMs on a physical host in anattempt to increase the data center resource utilization. In such a scenario, the VMsshare prossessing units and Input and Output interfaces (I/O) which in turn mightaffect the overall performance of the cloud in terms of computational power andcommunications (network). An undesired effect of this could be loss of availabilityof the services.

The processing capabilities of today’s processing units are very high and withpowerful dynamic resource allocation techniques such as dynamic memoryallocation [10], main consideration of performance is about I/O and specificallynetworking [11] to provide high QoS. This strongly emphasizes the significanceof the network in today’s cloud infrastructure. Physical hosts and VMs within adata center is dependent on the QoS provided by the network layer to perform to

15

a satisfactory level. Networks within a cloud can both be physical and virtualized.Cloud computing is highly parallel and distributed in nature where the resources areshared through physical and virtual networks. As VMs reside on different physicalhosts, they are mananged through the networks. Communication between physicalhosts and VMs does also occur over the network be it physical or virtual network.The reliance upon the network in such cloud environments will ultimately affectthe overall QoS which is again directly related to the performance and QoS of thenetworks. Network Interface Card (NIC), hereafter called network adapter, uses avirtualized version of the physical network adapter present in the host server. Thereare multiple techniques to expose a network adapter from a physical host to VMwhich will be discussed later in this paper, but the network I/O in and out of a VMwill pass through this virtualized network adapter inside the VM. As of the writing,it’s quite common to see 10GbE adapters used between physical hosts in clouddeployments as well as 1GbE network adapter. As today’s high end processors havesignificant amount of cores and can address a vast amount of memory, the densityof hosted VMs within a physical hosts are increasing. This also means the physicalhost’s network adapter needs to process higher number network traffic/packets.Without paying closer attention to the networking capacity of a virtualization host,it can quickly become a bottleneck of the cloud deployment.

Emulation is one of the techniques used to expose the network adapter from physicalhost to the VMs. Using emulation for an interface means emulating/simulating thecomplete interface fully in software inside the VM hence adding significant processingoverhead as well as higher resource utilization to the Virtual Machine Monitor(VMM) [12][13][14][15]. Also, paravirtualization and fullvirtualization is used as twoapproaches to expose interfaces from hypervisor to VM. With paravirtualization, theVM does not emulate/simulate hardware, but it’s requried guest OS to be modified.Fullvirtualization on the other hand requires no such modification of the guest OSand can take advantage of built-in hardware support from processors from Intel(VT-x) and AMD (AMD-V). With the above mentioned ways of exposing interfacesfrom a hypervisor to VM, there is a significant amount of intervention of hypervisorinvolved [11] with data processing of the I/O devices which also affects the runningVM. Such overhead can saturate the processor in a high speed network, such as 40Gigabit Ethernet (GbE), impairing the overall system performance.

In order to minimize the hypervisor intervention and keep the overhead at a minimallevel, several techniques are proposed such as PCI passthrugh and SR-IOV. These

16

techiques allow the VMs to directly access the physical I/O resource without theneed of emulation from the hypervisor [11]. PCI passthrough refers to a PCI devicebeing directly and exclusively exposed to one dedicated VM and it’s guest OS. PCIpassthrough yields significant increase in performance close to what we can get withnative device and at the same time minimizing the process overhead introduced bydevice virtualization. But the major downside of exposing a device to a VM usingPCI passthrough is that only one VM can utilize the device at time and the physicalI/O resource can’t be shared between VMs, hence it’s not a scalable solution. Servershave a finite amount of PCI Express (PCIe) lanes and slots hence the amount of VMsutilizing PCI passthrough for devices will be limited. Additionally, direct assignmentof physical I/O resource to a VM is an issue for live migration [16] which is a keycharacteristic of the cloud.

SR-IOV is an extension to the PCIe standard that allows VMs to directly accessthe shared I/O devices without intervention from hypervisor. It is similar to PCIpassthrough in the way that an I/O device is available the VM directly While usingPCI passthrough only one VM can be given exclusive direct access to the I/O device,the SR-IOV extension adds the (huge) benefit of being able to be used by multipleVMs simultaneously. This way of passing through an I/O device yields performancebenefits [17], but requires SR-IOV enabled hardware. A single SR-IOV-capable PCIedevice exposes multiple endpoints that are light-weight PCI functions called virtualfunctions (VFs). Each VF can then be passed to the VMs in a virtualized environment.SR-IOV not only improves performance by bypassing hypervisor intervention, butalso addresses the availability issue we mentioned with PCI passthrough technique.One of the interesting questions is whether an SR-IOV-enabled I/O device deliversthe performance out of the box without any further system tuning in a virtualizedenvironment.

1.1 Problem Statement

In a multi core virtualized cloud environment, applications that use TCP or UDPare processed by the CPU and they all need to wait in line with other applicationsand system processes for their turn of using CPU cycles. In addition to the negativenetwork performance impact this creates, the CPU cycles are better utilized fortenants workloads.

17

High throughput network adapters such as 40GbE network adapter is able to process40 billions packets per second. One major problem in such a scenario is that if theCPU cores have to be involved in processing of every packets, it will firstly lead tohigher latency and reduced network throughout, but it will also adversely affect theperformance of VMs the hypervisor is hosting. When there aren’t a data channelsteadily available to service the network I/O, a system can become less predictivewith regard to when the CPU will be available to the tenant’s workload or when itwill have to dedicate processing power to servicing the network I/O data.

Based on the discussion in the current chapter, the research questions (RQs) for thisstudy are:

• RQ1: What are the challenges of using SR-IOV enabled network adapter in avirtualized environment?

• RQ2: What are the challenges and issues of deploying VMs for high perform-ance and high throughout networking?

• RQ3: How can we achieve close to 40Gb/s without creating significant load tothe CPU cores?

• RQ4: What are the considerations to be made when deploying VMs for highthroughput networking in a cloud environment?

1.2 Thesis structure

The layout of the this paper is as following: The background chapter (2.1) comesafter this introduction, where related works and literatures are collected. It aims atgiving a brief overview of the different tools used throughout the project, as wellas relevant technologies. The methodology chapter (3) next gives an explanation ofobjectives and methods of the study and describes the project plan. It is included bysome important parameters and calculations related to the method. It is followed bythe results and analysis chapters (4 and 5) where the actual results are displayed andanalyzed in detail. Then in the discussion chapter (6), an overall evaluation of whathas been done, problems encountered and future work is discussed. The conclusioncomes at the end where the questions asked in the problem statement section (1.1)are answered based on the findings in this study and the knowledge acquired. Also

18

an appendix chapter with URL links to all of the important scripts created, someconfiguration setup of the NIC as well as some graphs can be found at the very endof this document.

19

Chapter 2

Background

2.1 Cloud computing

In the midst of major trends like Big Data and IoT, cloud computing is a criticaltechnology needed to process, move and store data. Recall from chapter 1 that somesurveys showed that cloud computing is considered mature enough for the businessto adopt the technology as part of their IT portfolio. Cloud computing is known as thesixth paradigm of computing technology as shown in figure ?? (retrieved fromVoasand Zhang).

Infrastructure behind a cloud consists of elastic pools that consist of large number ofservers connected through network. Resources in cloud are elastic in the sense thatthey can be dynamically allocated to the tenant on demand and released after usage,then to become available to other tenants of the same cloud. The elastic resources in acloud environment are of following types:

• Processing units

• Storage

• Networking

• Applications

The cloud infrastructure contains, in addition to the physical layer of hardwareresources, also an abstraction layer of software that is deployed across the physical

20

layer. By offering such elastic resource allocation to the tenants, the cloud eliminatesthe initial cluster setup cost and time [19].

Today, there are many definitions of cloud computing but according to Mell et.alcloud computing is:

"a model for enabling ubiquitous, convenient, on-demand network accessto a shared pool of configurable computing resources (e.g., networks, servers,storage, applications, and services) that can be rapidly provisioned and releasedwith minimal management effort or service provider interaction." [20]

From Mell et.al definition of cloud and various other’s [21][22][23], it can beunderstood that cloud deployment is aiming to achieve highly scalable and availableon demand computing services delivered through the Internet.

2.1.1 Cloud Computing Characteristics

Cloud computing paradigm has brought the following characteristics and features[4][24][20][1]:

• On-demand self service: Tenant’s resource needs are automatically taken care ofwithout manual intervention by the service provider.

• Broad network access: Services are accessible over the Internet from clientplatforms such as mobile phones, tables, laptops or desktops. Additionally, highnetwork performance and localization is achieved by cloud service providers byleveraging geo-diversity of data centers around the world.

• Resource pooling: # Re-phrase needed # The provider’s computing resourcesare pooled to serve multiple consumers using a multi-tenant model, withdifferent physical and virtual resources dynamically assigned and reassignedaccording to consumer demand. There is a sense of location independence inthat the customer generally has no control or knowledge over the exact locationof the provided resources but may be able to specify location at a higher levelof abstraction (e.g., country, state, or datacenter). Examples of resources includestorage, processing, memory, and network bandwidth.

• Rapid elasticity: Resources are automatically scaled and released based ondemand. Resource capabilities appear unlimited to the tenant at any time.

21

• Measured service: A pay-per-use pricing model is employed. Resource usage ismonitored and tenants are billed according to the resource usage only.

Amazon, a leading public cloud provider, offers elastic resources within their AmazonEC2 tier. A tenant utilizing a virtual machine within Amazon EC2 can wheneverneeded, reconfigure elastic resources as mentioned above and then release theadditional resource when they are not needed anymore. In other words, scaling ofresources in the cloud is reversible in contrast to adding hardware resources to abare metal server that is typically done once and never removed. This is coveredby resource pooling, on-demand self service and rapid elasticity. Access to tenant’svirtual machine and resources are provided over the Internet covering broad networkaccess part of the cloud characteristics. Since the service is measured, the "usual"utilization of initial resources as well as the extra resource additions will be added totenant’s invoice.

2.1.2 Cloud Layered Architecture

In the section 2.1, we briefly mentioned how a cloud infrastructure would look like.The architecture of a cloud computing environment is divided into four layers [24][4]:

• the hardware/datacenter layer: consists of physical resources of the cloudwithin a data center such as physical servers, routers/switches, cabling, powerand cooling systems.

• the infrastructure layer: pool of virtualized resources created on top of physicalresources.

• the platform layer: operating systems and applications frameworks built on topof virtualized infrastructure layer.

• the application layer: Actual cloud applications residing on top of the platformlayer and used by the tenants.

For the delivery of cloud services, a service-driven business model is em-ployed [4][24][20]. As mentioned in the chapter 1, these services can be grouped intothree categories:

• Infrastructure as a Service (IaaS): refers to scalable computing resources suchas VMs, servers, storage, load balancers, networks etc provided to the tenants

22

to be able to deploy and run arbitrary software such as operating systems andapplications.

• Platform as a Service (PaaS): refers to computing platform services such asoperating system, programming language execution environment, database andweb server that are readily available to the tenants.

• Software as a Service (SaaS): refers to delivery of application over Internetthat run in the cloud. There is no burden for the tenant to manage nor controlthe underlying cloud infrastructure that are all taken care of by the serviceproviders. Today smart mobile phones are widely used and it typically runs"apps" that is a great example this type of cloud software delivery. While app’sfront end resides locally within the phone, the backend resides in the cloud.

Each layer above IaaS abstracts the details of the underlying layer. IaaS being at thebottom means it’s the layer that the upper layers of PaaS and SaaS depend on. Thisthesis intend to study the network performace of IaaS as it’s a critical layer of thearchitecture.

In addition to abovementioned cloud "aaS" architecture, there are also other modelssuch as Database as a Service (DBaaS) [25] and and even the concept of Everythingas a Service (XaaS)[26]. XaaS was mentioned by Armbrust et.al already in 2009 [27].

2.1.3 Cloud Deployment Models

According to [20] and other research papers there are four types of cloud deyploy-ment models [4][24][26][25], but there are also other emerging deployment type [28].

Public cloud is a type of cloud where computing resources are dynamicallyprovisioned off-site by a third-party provider. The computing resources are present inthe cloud service provider’s data center and share shared with the tenants in a multi-tenant architecture. A public cloud services offered vary from infrastructure, storageto applications to the general public over the Internet. Some well-known examples ofpublic clouds include Amazon Elastic Compute Cloud (EC2), Google AppEngine andWindows Azure Services Platform

Public clouds enable the enterprises to cut the initial investment into hardwareand software, thus reducing the economic risk. As availability is one of the main

23

characteristics of the cloud, an enterprise can carefully start with small amount ofresources and grow the resource utilization based on the demand. However, not everytype of workload is well suited to be put into the public clouds. As the public cloud isoperated by a third-party offsite, security and privacy along with law and regulationsare of high concern. Not to mentioned the connectivity that solely depend on theavailability of the Internet on both side, the provider and the tenant.

Private cloud is a type of cloud intended internally for the enterprises. In contrastto public cloud, private clouds are not publicly available to everyone. Cloudinfrastructure can either be owned by and be on premise within the enterprise or offpremises at a cloud provider exclusively reserved for usage by the enterprise. In bothcases, the management can be done by the IT of the enterprise, by a third-party ora combination of both. On premise private cloud deployment has the huge benefitof giving the enterprise security and privacy enabling them to comply to requiredlaws and regulations. As of today, there are some popular private cloud platformsavailable: OpenStack (will be discussed in details in section 2.7), CloudStack,Eucalyptus, OpenNebula, VMware vCloud Suite and Microsoft Azure Stack.

Hybrid cloud is a type of cloud that combines the two above mentioned types ofcloud. As such, a hybrid cloud can be a good trade off between the limitations of aprivate cloud and the security and privacy issues of a public cloud. As we could seefrom some of the recent surveys, this type of cloud is on the rise and a trend as oftoday.

Community cloud is a type of cloud that slightly differs from public cloud, althoughused by several constituencies. The definition of community cloud is:

"The cloud infrastructure is provisioned for exclusive use by a specificcommunity of consumers from organizations that have shared concerns (e.g.,mission, security requirements, policy, and compliance considerations). It maybe owned, managed, and operated by one or more of the organizations in thecommunity, a third party, or some combination of them, and it may exist on oroff premises." [20]

As such, a community cloud involves cooperation and integraton of IT infrastuctureand resources from multiple organizations. It requires interoperability and

24

compliance between the participating organizations and their resources, includingidentity management (IAM). One example is the community cloud shared betweenscientists from different organizations at the CERN Large Hadron Collider (LHC).

Inter cloud is also known as federated or multi cloud, is a type of cloud thatprovides basis for provisioning heterogeneous multi-provider resources for variousworkloads on demand with respect to QoS [28]. It aims to provide seamlessintegration of public cloud of different providers. Figure (cite figure) shows thearchitecture of an inter cloud.

2.2 Virtualization

Virtualization is one of the key technology leveraged by data centers deliveringcloud computing services. It has virtually transformed today’s data centers. Withinthe concept of virtualization lies the ability to emulate another computer system, insoftware, using the same hardware. However, the concept of virtualization is notnew and it has existed from the 1960s [12][29][30][31]. At that time, more powerfulhardware called mainframe was used with hypervisor to partition and isolate eachVM running simultaneously within one mainframe. Fast forwarding to 2005, as theenhancements in hardware technology has been steadily improving, hypervisorsgained traction among academia and industry.

Today, the term "virtualization" has become ambiguous [32]. For instance, mobiledevice emulators are a type of virtualization due to the fact that the OS is runningon an emulated hardware, hence removing the OS binding from the hardware [31].In this study, we look at virtualization in the context of cloud computing and datacenters. There are multiple definitions of the term "virtualization" [33][34][35]. Onedefinition by Sahoo J. et.al is:

"Virtualization is a technology that introduces a software abstraction layerbetween the hardware and the operating system and applications running on topof it".

The objectives of the virtualization technology [36][34][37][38][39] is to:

• Add an abstraction layer between the application and the hardware

25

• Enable consolidation and reduce cost as well as complexity.

• Provide isolation of computer resources for improved reliability and security

• Improve service level as well as the QoS

• Better align IT processes with business goals

• Eliminate redundancy in, and maximize utilization of, IT infrastructure

There are many numerous approaches to virtualization [31][33] such as mobile, data,memory, Desktop Virtual Infrastructure (VDI), storage, network and applicationvirtualization, but this study will focus on server and I/O virtualization in particular.They can be classified into three categories:

• Infrastructure virtualization: network and storage

• System Virtualization: server and desktop (VDI)

• Software Virtualization: application and high level language

2.2.1 Server Virtualization

In order to maximize resource utilization and efficiency, a physical server can beutilized to run multiple operating systems with isolation, independently from otherOS. This is generally the most common virtualization known today and when peoplegenerally use the term "virtualization", they refer to server virtualization. It hides thephysical characteristics of computing resources such as CPU, memory and storage tothe software running on it and the entity using it.

There are also multiple definitions of server virtualization[\cite ][~][31]

Common types of server virtualization is as following [36][33][31]:

• Hardware virtualization, aka HVM

• Paravirtualization, aka PVM

• Operating system virtualization, aka containers

Hardware Virtualization Also known as Hardware Virtual Machines (HVMs),is a virtualization technique that relies on special hardware to achieve computer

26

virtualization. An HVM works by intercepting privileged calls from a VM andhanding these calls to the hypervisor. The hypervisor decides how to handle thecall, ensuring security, fairness, and isolation between running VMs. The use ofhardware to trap privileged calls from the VMs allows multiple unmodified OSs torun. This provides tremendous flexibility as system administrators can now run bothproprietary and legacy OSs in the VM. In 2005, the first HVM compatible CPU becameavailable, and as of 2012 nearly all server-class and most desktop-class CPUs supportHVM extensions. Both Intel and AMD implement HVM extensions, referred to asIntel VT-X and AMD-V, respectively. Since HVMs must intercept each privileged call,considerably higher overhead can be experienced than with PVMs. This overheadcan be especially high when dealing with input/output (I/O) devices such as thenetwork card, a problem that has led to the creation of paravirtualization drivers.Paravirtualization drivers such as the VirtIO package for KVM allow a VM to reap thebenefits of HVMs such as an unmodified OS while mitigating much of the overhead.Examples of HVM-based virtualization systems include KVM and VMware servers.

Paravirtualization Also know as Paravirtualization Machines (PVMs), is avirtualization technique that was the first form of full computer virtualization andare still widely deployed today. The roots of paravirtualization run very deep indeedwith the first production system known as VM/3701, created by IBM and availablein 1972, many years before paravirtualization became a mainstream product. TheVM/370 was a multi-user system which provided the illusion that each user hadtheir own operating system. Paravirtualization requires no special hardware andis implemented by modifications to the VM’s operating system. The modificationsinstruct the operating system (OS) to access hardware and make privileged systemcalls through the hypervisor; any attempt to circumvent the hypervisor will result inthe request being denied. Modifying the OS does not usually create a barrier for opensource OSs such as Linux; however, proprietary OSs such as Microsoft Windows canpose a considerable challenge. The flagship example of a PVM system is Xen,2 whichfirst became an open source project in 2002. The Xen Hypervisor is also a keystonetechnology in Amazon’s successful cloud service EC2.

Operating System Virtualization There is also a third class of virtualization knowas container virtualization or OS virtualization, which allows each user to have asecure container and run their own programs in it without interference. It has been

27

shown to have the lowest overhead when compared to PVMs and HVMs. This lowoverhead is achieved through the use of a single kernel shared between all containers.Such a shared kernel does have significant drawbacks, however, in that all users mustuse the same OS. For many architectures such as public utility computing, containervirtualization may not be applicable as each individual user wants to use their ownoperating system in their VM, and is therefore not the focus of this article [40].

The management of VMs in such a virtualized platform is done by a hypervisor orVMM. There are two types of hypervisor architecture as illustrated in figure

• Type 1 (also known as native): The VMM or the hypervisor run directly on topof the hardware with the VMs or guest OSes, be it Windows or Linux, runningabove the hypervisor. The applications are run inside each VM which is thelayer above the VMs as seen from the figure From the same figure we cansee that since the hypervisor run directly on top of the hardware, there is noadditional OS layer between them, hence yielding better performance. However,hardware support can be an issue. Example of this Type 1 virtualizationarchitecture is Kernel-based Virtual Machine (KVM),VMWare ESXi, Xen andMicrosoft Hyper-V

• Type 2 (also known as hosted): In this virtualization architecture, the hypervisorsits on top of an OS that controls the hardware resources. The VMs is run on topof the hypervisor and the application on the top layer as seen from figure Thefact that the hypervisor is run on top of another OS, performance penalty withthis type of virtualization architecture is inevitable. Seen from a VM, access tohardware resources goes through two OSes. On the other hand, the hardwaresupport is as good as the hardware supported by the OS running the hypervisor.

Example of this Type 2 virtualization architecture is KVM,VMWare Workstationand Virtualbox

28

Figure 2.1: Type 1 VM architecture (native)

Figure 2.2: Type 2 VM architecture (hosted)

Figure2.1 and 2.2 illustrates the architectural differences of the above mentionedhypervisor types.

Note that KVM is listed as both Type 1 and Type 2 hypervisor. KVM is a kernelmodule in Linux that supports hardware virtualization, hence it depends on the Linuxkernel fully. However, if there is a bare minimal installation of a Linux distribution ofany kind, one might argue that adding KVM module will make it a Type 1 hypervisor.

2.3 I/O Virtualization

The rate at which data can flow from one device or server to another, commonlyreferred to as I/O for Input and Output, is becoming a bottleneck in the ever growing

29

virtualized infrastructure. In a virtualized cloud environment, the I/O devices, suchas NICs or storage devices (e.g HDD/SSD), must be shared between the VMM andthe VMs. It is the responsibility of the VMM to expose the needed I/O device to therunning VMs being hosted and at the same time provide isolation and security fordevice access routing between the VMs and the physical I/O devices. When such anI/O device is virtualized and exposed to a VM, the VMM must be able to interceptall I/O operations that are issued by the guest OS running inside the VM, in orderfor the VMM to execute those I/O operations on the physical I/O device. As such,these I/O operations are trapped by the VMM and handled by the privileged VMM,on behalf of the guest OS. Such interceptions by the VMM inevitably creates overhead.The issue with overhead is even more significant when a high speed network adapteris virtualized and shared between VM because of the high rate of packet arrivalsand departures. And by virtualizing I/O devices we achieve multiplexing anddemultiplexing, isolation, portability and interposition [16]:

With the increasing processor core counts and higher addressable memory on today’sserver hardware, VM density is increasing and is consolidating more I/O traffic ontothe servers used as virtualization host. If I/O is not sufficient, then it could limit allthe gains brought about by the virtualization process.

Further, I/O virtualization can be divided into full virtualization, paravirtualizationand direct device assignment [14][41][42]. The following sub sections intend toexplain the three types of I/O virtualization.

2.3.1 Full Virtualization

Full virtualization provides a complete simulation of the underlying hardwareenabling software that can be run on the physical hardware to be able to run insidethe VM. It is also referred to as emulation or software emulation as there is anemulation layer sitting in between the VM and the underlying hardware. This typeof virtualization has the widest range of support for guest OSes.

Pros:

• Offers higher level of flexibility where guest OS need not be modified

• Provides complete isolation of betweem VMs and between VMs and VMM

• Provides near-native performance

30

Cons:

• The on-the-fly translation of instructions from the guest OS to host OS(hypervisor) causes significant performance degradation

• Complex on x86 architecture as not all privileged instructions can be trapped.

2.3.2 Paravirtualization

Paravirtualization uses a technique to provide partial simulation of the underlyinghardware. One of the key feature is the address space virtualiation to offer each VMit’s own unique address space. Most of the hardware features are simulated, althoughnot all.

Pros:

• Easier to implement compared to full virtualization

• Provides complete isolation of between VMs and between VMs and VMM

• Provides high performance for network and disk I/O when no HW assistance isavailable

Cons:

• Guest operating systems inside the VMs need modification

• Lack of backward compatibility and low portability

2.3.3 Direct Device Assignment

Called device passthrough, direct assignment, PCI (device) passthrough and Direct(Access) I/O, all referring to the technique of assigning a physical PCI device to aspecific VM so that it can directly access the physical resource without interventionof the VMM [43]. VM will, using the device driver, be able to communicate with thedevice without requiring a device driver in the VMM. I/O Memory ManagementUnit(IOMMU) is a hardware unit that enables the mapping of device DMA addressinside the VM to physical memory address [43]. Additionally, the IOMMUs provideanother significant security in the form of isolation for VMs with direct device access.

31

As mentioned above, both full and paravirtualization have significant overhead usingsoftware to emulate devices to the VMs. Since direct device assignment bypasses theVMM for it’s operation, the overhead is significantly reduced [14] compared to fullor paravirtualization. Today, CPU offerings from both Intel (Intel VT-d) and AMD(AMD IOMMU) come with hardware support for direct device assignment [44][45].The following are the advantages and disadvantages of direct device assignment.

Pros:

• Reduced intervention by the VMM hence less overhead compared to fullvirtualization and paravirtualization

• Better security by providing isolation of mapped memory regions for devicesand VMs.

• Device driver not required at VMM level.

Cons:

• One device can only be used by one VM

• The amount of required PCIe slots is limited

2.4 SR-IOV

Compared to native hardware performance, the I/O performance of virtualizedenvironments have been significantly worse and can quickly become the bottleneckof such a system. The poor I/O performance of virtual machines has suffered becausehigh performance I/O is enabled by the I/O device’ ability to perform direct memoryaccess (DMA), whereby the I/O device can write directly to the VMM’s memorywithout interrupting the VMM’s CPU. But from within a VM, the DMA is morecomplex due to the fact that memory address space inside the VM is not the sameas the real memory space of the VMM. Every DMA triggered by a VM requires theVMMs intervention for VM-to-host memory address translation. For this, the VMis using interrupts against VMM’s CPU so that the VMM can perform the addresstranslation. Further, when there are multiple VMs being hosted by the same VMM,the VMM also has to act as a virtual network switch when the I/O is bound to anetwork adapter, ultimately leading to higher latency for such I/O operations.

32

As we have seen earlier, there exists multiple techniques to virtualize and share anI/O device with the VMs from a VMM. We have mentioned full and paravirtualiza-tion techniques as well as device passthrough, both with it’s pros and cons. However,hardware vendors such as Intel and AMD, have been providing increased supportfor virtualization within the their hardware, with IOMMU being an example of such.As part of a continuous research and development to further minimize the overheadinvolved and improve the sharing capabilities of I/O devices in a virtualized envir-onment, an extension to the PCIe standard was introduced by PCI-SIG [46]. As men-tioned previously, the new technique is called SR-IOV and allows VMs to directly ac-cess shared I/O devices without the intervention from VMMs hence contributing toreduction in overhead. With the standardization and adoption of SR-IOV, virtualizedcloud environments have got a desired increase in terms of high performing I/O vir-tualization. FigureFiguresr-iov is an illustration of how SR-IOV works.

Figure 2.3: How SR-IOV works

With SR-IOV specification [46], two new function types were introduced, namelyPhysical Functions (PFs) and Virtual Functions (VFs):

• PFs These are full PCIe functions that include the SR-IOV Extended Capability.The capability is used to configure and manage the SR-IOV functionality.

• VFs These are lightweight PCIe functions that have just enough resources for

33

supporting data movement.

2.5 RDMA and RoCE

RDMA [47][48][49][50] as a technology has been utilized by Infiniband intra datacenter networks for quite some time where low latency and high throughput are keyrequirements. For instance, in the research field when scientists run code that requireshigh degree of parallelism using Message Passing Interface (MPI), low latency iscritical as MPI passes small messages back and forth between the nodes in a large-scale cluster. This, in contrast to TCP/IP-based network communications that requirecopy operations causing higher latency, increased CPU utilization and higher memoryusage. As of the time of writing, RDMA is supported by the following protocols:

• InfiniBand (IB) a network protocol which supports RDMA natively from thebeginning. Often used in HPC environment where low latency and throughputare requirements.

• RDMA Over Converged Ethernet (RoCE) a network protocol that allowsperforming RDMA over Ethernet networks and existing Ethernet infrastructure.

• Internet Wide Area RDMA Protocol (iWARP) a network protocol that allowsperforming RDMA over Transmission Control Protocol (TCP). iWARP can beseen as the competing protocol to RoCE although it initially had the ability towork over Wide Are Network (WAN).

34

Figure 2.4: RDMA architecture

35

Figure 2.5: Architecture of Infiniband, RoCE and TCP/IP

Figure 2.4 illustrates the RDMA architecture while figure2.5 illustrates the differentstack of Infiniband, RoCE and TCP/IP. The former two protocol are both based onRDMA, while the latter is not. While Direct Memory Access (DMA) is the abilityof a device to access host memory directly without the intervention of the CPU,RDMA is the ability of accessing memory on a remote system without interruptingthe processing of the CPU(s) on that system, effectively bypassing the remote system’soperating system kernel and CPU.

In a brief summary, RDMA offers the following advantages:

• Zero-copy: applications can perform data transfers without the involvementof the network software stack. Data is sent and received directly to the bufferswithout being copied between the network layers.

• Kernel bypass: applications can perform data transfers directly from user-spacewithout kernel involvement.

• No CPU involvement: applications can access remote memory withoutconsuming any CPU time in the remote server. The remote memory serverwill be read without any intervention from the remote process (or processor).Moreover, the caches of the remote CPU will not be filled with the accessed

36

memory content.

In recent years, RoCE [47] [48][51][52][53][54] is emerging as an interesting RDMAtechnology promising to keep the latency low, but at the same time running datamovement over the well known Ethernet switched fabric instead of InfiniBand HostChannel Adapters (HCA) and switches. RDMA efficiently allows supported systemsto communicate with low overhead, latency and with significantly reduced CPUutilization. It does so by having transport offload with hardware RDMA engineimplementation and bypasses operating systems kernel to communicate directlybetween applications. RoCE is a standard protocol defined in the InfiniBand TradeAssociation (IBTA) standard [52]. One of the main idea of the RoCE protocol isto allow organizations to keep utilizing their existing Ethernet infrastructure andleverage the benefits of RDMA. The ability to do RoCE requires RoCE-capablenetwork interface cards, such as the Mellanox adapters used in this study.

Since RoCE is a sibling technology of Infiniband, it also requires lossless fabric toleverage what it promises. A lossless fabric, such an Infiniband network, is a fabricwhere packets on the wire are not reqularly dropped. The standard Ethernet isdesigned as best-effort where packet loss can occur and there are mechanisms on theTCP transport layer to re-transmit the lost packets which in turn adds to the overhead,latency, memory consumption and CPU utilization. Infiniband, on the contrary, usesa technique known as link level flow control to ensure that packets are not dropped inthe fabric under normal circumstances.

In order to achieve a lossless fabric with RoCE, a set of enhancements to Ethernetprotocol exist under the term Data Center Bridging (DCB). DCB comprises fivenew specifications from the IEEE which taken together provide almost the samelossless characteristic as InfiniBand’s link level flow control. One of the adopted andnotable enhancement is Priority Flow Control (PFC). PFC is a link level flow controlmechanism that can be controlled independently for each frame priority to ensurelossless transmission when a DCB network is congested. This requires the RoCEinfrastructure to support recent version of Ethernet, meaning the switches, NICs andHCAs must implement the important parts of these new IEEE specification. Sincethis study’s infrastructure is based on back-to-back connected HCAs, the DCB is notwithin the scope.

37

2.6 KVM

Kernel-based Virtual Machine [55][29] (KVM) brings an easy-to-use, open source andfully featured integrated virtualization solution for Linux. It relies fully on the Linuxkernel to be usable. Compared to the type 1 hypervisors that are installed directlyon top of the running hardware, KVM requires a running Linux kernel. It’s origingoes back to a Israeli company Qumranet Inc that developed and maintained KVM.KVM’s debut goes back to 2007 [56][57] when it was merged into the Linux kernel andreleased with the Linux mainline kernel version 2.6.20 February 5, 2007. QumranetInc was aquired by Redhat Inc in September 2008 and further development effort wasorganized by the open source community with the supervision of Redhat Inc.

KVM is provided as a kernel module to the Linux kernel or the Linux operating sys-tem [58][55] turning the operating system into a hardware accelerated hypervisor. Itstarted with support for the x86 architecture (Intel and AMD), further to expand withsupport for additional architectures. As of the time of writing, KVM virtualization issupported on the following hardware architectures [59]:

• Intel with the extension VT-x and the vmx CPU flag.

• AMD with the extension AMD-V and the svm CPU flag.

• ARM: 32-bits System on Chip (SoC) ARMv7-A(Cortex-A7,Cortex-A15, Cortex-A17) as well as 64-bits ARMv8-A SoCs.

• PowerPC is supported by a number of selected embedded cores.

• S390 is supported for the 64-bit versions such as z9.

Virtual machines created using KVM appear as normal Linux processes and integrateseamlessly with the rest of the system. The thight integration of KVM into Linuxenables us to reuse existing functionality in the kernel such as the scheduler andNUMA support on a developer level. As well, it also enables us to to reuse existingprocess management infrastruture in Linux, for instance top to monitor CPU usage,taskset to pin virtual machines to specific CPUs and kill(1) to pause or terminatevirtual machines.

38

2.6.1 How does KVM work?

We can turn any Linux distribution into a hypervisor by installing the kernel modulekvm.ko which provides the necessary hardware acceleration for the virtualizedresources. And for each processor architecture, there will be an accompanying kernelmodule installed. For instance, for the Intel processors, the module is called kvm-intel.ko and for AMD processors the modules is called kvm-amd.so. It is importantto emphasis that KVM alone is not enough for a hypervisor to be usable. KVM basedLinux virtualization solution consists of KVM itself, QEMU and libvirt. KVM needsthe assistance of QEMU for the creation of VMs and for the management of VMs andvirtual resources, libvirtd [60][61] is commonly used.

QEMU is a generic and open source machine emulator and virtualizer [62][63].Together with KVM, QEMU is a user-space provided component for emulatingmachine devices that provides an emulated BIOS, PCI bus, USB bus and a standardset of devices such as IDE and SCSI disk controllers, network cards, etc. Since QEMUexecutes the guest code directly on the host CPU, it’s performance is close to baremetal. Without the hardware acceleration provided by KVM, QEMU was around fourto ten times slower executing code [63].

Libvirt is a C toolkit to interact with the virtualization capabilities of recent versions ofLinux (and other OSes). Libvirtd, QEMU and KVM is a combination that’s commonlyfound in various Linux distributions to enable virtualization and perform variousoperations on VMs.

As already mentioned, virtual machines created by KVM are regular Linux processesthat are scheduled by operating system scheduler (Linux scheduler). Regular Linuxprocesses execute in either user mode or kernel mode with the former being thedefault execution mode for application running as a Linux process. An applicationcan change into kernel mode only if it requires a service from the Linux kernel, suchas an I/O service.

KVM adds another execution mode called guest mode that also has both user modeand kernel mode within it’s context . In other words, a process executing in guestmode is a process that is run inside of a virtual machine. Figure shows a conceptualview of KVM virtualization architecture.

Guest execution loop is executed as following:

39

• At the outermost level, userspace calls the kernel to execute guest code untilit encounters an I/O instruction, or until an external event such as arrival of anetwork packet or a timeout occurs. External events are represented by signals.

• At the kernel level, the kernel causes the hardware to enter guest mode. If theprocessor exits guest mode due to an event such as an external interrupt ora shadow page table fault, the kernel performs the necessary handling andresumes guest execution. If the exit reason is due to an I/O instruction or asignal queued to the process, then the kernel exits to userspace.

• At the hardware level, the processor executes guest code until it encounters aninstruction that needs assistance, a fault, or an external interrupt.

Guest execution loop is illustrated in figure 2.6 (retrieved from Lublin et al.). Asmentioned, KVM requires CPU hardware support to expose a character specialdevice, namely /dev/kvm that’s available to the userspace to create and run virtualmachines through a set of ioctl()s. The following are the operations provided by the/dev/kvm device:

• Creation of a new VM

• Memory allocation to a VM

• Reading and writing virtual CPU registers

• Interrupt injection into a virtual CPU

• Ability to run virtual CPU

2.7 OpenStack

In the field of open source cloud computing platform, OpenStack is a mature andwell known software. OpenStack provides an IaaS solution that is composed of aset of loosely coupled, but rapidly evolving FOSS projects that support a wide set oftechnologies and configuration options. The integration between the components isfacilitated by the use of Application Programming Interface (API) offered by eachcomponents [64]. OpenStack supports all types of cloud environments. At the timeof the writing of this thesis, OpenStack is the leading FOSS platform for buildingpublic and private IaaS cloud and has got very good traction among academia and

40

Figure 2.6: KVM Guest Execution Loop

41

businesses around the world. The project is being backed by many bigger namesin the industry as mentioned in the motivation part and today there are over 200businesses behind this project to collaboratively driving it forward with an active andvibrant community. A significant amount of business provide both public and privatecloud services based on OpenStack.

OpenStack IaaS framework is composed of the following three core FOSS projects:

• OpenStack Compute, known as Nova handles VM instantiation and termina-tion, among others, based on VM images from Glance.

• OpenStack Object Storage, known as Swift provides distributed and redundantobject storage similar to Amazon S3.

• OpenStack Image Service, known as Glance provides API for VM images toNova, for instance to create VM instances.

In addition to the three core projects mentioned above, there are other other projectssuch as OpenStack identify service (Keystone), OpenStack Block Storage (Cinder),OpenStack Network (Neutron) and OpenStack Dashboard (Horizon) providing aweb user interface (UI) for management purposes, in addition to the command lineinterface (CLI).

The code base of OpenStack is developed and released around a 6-month releasecycle. After the initial release, additional stable point releases will be released in eachrelease series [65]. During the development cycle, the release is identified using acodename and codenames are ordered alphabetically for consecutive releases. Thereleases are also referred to by a numerical version number that consists of the releaseyear appended by a 1 or a 2, depending on whether it’s the first or second release ofthe year in question. For instance, at the time of the writing of this thesis, the currentstable version and release of OpenStack is 2017.2 codenamed Pike. Other releasesfrom 2017 is 2017.1 codenamed Ocata. The codenames are cities near where thecorresponding OpenStack design summit took place [66].

An illustration of OpenStack’s modular architecture with it’s various components isdepicted in the figure ??.

42

Figure 2.7: OpenStack Architecture

43

2.7.1 Components

Some of the major components of OpenStack framework are explained [67][68][69][70][71]below in detail.

Nova is a cloud computing fabric controller and being the Compute servicemakes this the main part of the IaaS framework. It’s aimed for management andautomation of pools of resources. It interacts with other components like Keystonefor authentication and Horizon for user interface. KVM, VMWare, Hyper-V and Xenhypervisors are supported, as well as Linux container (LXC) technology. Nova offersfour main services as following [69]:

• The API Service: it receives user requests and translates them into Cloud actionsthrough Web Services

• The Compute Service: It mainly handles communications with the localhypervisor, so as to enable VM instantiation and termination, as well as queriesto VM load indicators and performance metrics

• Network Service: handles all the aspects related to network configuration andcommunications. In particular, for each server, an instance of this service is incharge of creating virtual networks useful to let VMs communicate betweenthemselves and with the outside of the Cloud. However, due to some limitationsby this service, a recent networking component (Neutron, explained below) hasemerged with advanced networking capabilities. This network service is nowconsidered legacy.

• The Scheduler service: decides the node on which a new VM has to beinstantiated and launched based on policies

Glance is the Image service, lookup and retrieval system for VM images. It is anessential part of the OpenStack IaaS enabling tenants to discover, register and retrieveVM images. Through it’s Image REST API, tenants can query VM image metadataand retrieve an actual image. The VM images can for instance be stored in the ObjectStorage provided by Swift. Glance provides support for multiple disk and containerformats. Amazon machine image (ami) is one of the supported disk image fortmants.

44

Neutron is the advanced networking part of OpenStack. It provides managementof networks and IP addresses for the IaaS ensuring network is not a bottleneck nor alimiting factor in a cloud deployment scenario. Additionally, it also provides tenantsself-service ability of network configurations.

Cinder is the Block Storage service provider to the OpenStack environment. Itprovides and manages persistent block storage and can attach a logical volume to aVM like a local disk. Cinder also has the ability to back up VMs by interacting withSwift. In previous releases of OpenStack, this service was integrated into Nova andwas called nova-volume.

Keystone being the Identity and Access Management (IAM) of OpenStack, itprovides authentication and authorization services to entities. It can integrate withexisting backend directory services like Leightweight Directory Access Protocol(LDAP). It supports multiple forms of authentication including standard usernameand password credentials, token-based systems and AWS-style (i.e. Amazon WebServices) logins. Elements of OpenStack including Swift, Glance, and Nova areauthenticated and authorized by Keystone.

Swift is the Object Storage service part of OpenStack. It aims to provide highlyscalable and redundant object store that is conceptually smiliar to Amazon’s S3service. Multiple replicas (copies) of each object is distributed to multiple storagenodes to achieve scalability and redundancy! Swift is one of the oldest and maturecomponent of OpenStack.

Trove Trove is Database as a Service for OpenStack. It’s designed to run entirelyon OpenStack, with the goal of allowing users to quickly and easily utilize thefeatures of a relational or non-relational database without the burden of handlingcomplex administrative tasks. Cloud users and database administrators can provisionand manage multiple database instances as needed. Initially, the service will focuson providing resource isolation at high performance while automating complexadministrative tasks including deployment, configuration, patching, backups,restores, and monitoring.

45

Horizon is the web UI that provides a dashboard for the cloud managementpurposes in addition to the CLI provided by OpenStack. It interacts with othercomponents thorough their respective APIs.

Heat is an orchestration deployment function to create or update virtual resourceinstances using Nova, Cinder or other blocks based on a text template.

Ceilometer being a newer component of OpenStack, it provides metering functionof virtual resource usage such as virtual CPU and network I/O as foundation forbilling systems.

2.8 Related works

To best of our knowledge, there has been no studies on evaluating performance of40GbE RDMA and RoCE based NICs in a virtualized environment. There are multiplestudies and papers where SR-IOV, RDMA and RoCE have been studied. However,they were either limited to previous generation 10GbE NICs without the supportfor RDMA or 40GbE RoCE-capable NICs where focus was high throughput and lowlatency, mostly focusing on the improved performance of hardware-enabled SR-IOVversus various other software-based I/O virtualization.

However, they are interesting studies because they study very important aspects ofSR-IOV-capable NICs as well as many system wide key aspects such as IRQ affinity,CPU utilization, latency etc. Such studies can be the foundation for further researchand study as the technology advances.

2.8.1 Studying Performance of 1GbE SR-IOV-enabled NIC In VirtualizedEnvironment

Network interface virtualization: challenges and solutions [40].

In this study Ryan Shea et al. presented a performance evaluation of a 1GbE SR-IOVcapable NIC in a virtualized environment with regard to bandwidth, CPU cycles,context switches, last level cache (LLC) usage and interrupts. This study compared

46

performance of the network interface in bare metal with various I/O virtualizationtechniques such as paravirtualized driver (VirtIO), emulated network device (rtl8139)and hardware assisted virtualization with SR-IOV. The hypervisor used in the paperwas KVM. Although the study mentioned another hardware assisted virtualizationtechnique, Virtual Machine Device Queues (VMDQ, it was not studied further due toSR-IOV being an industry standard.

The findings in the paper’s experimentations showed us that even though with SR-IOV technology provides significant performance boost over the paravirtualizedVirtIO driver, it was still a big gap from bare metal performance in terms of usedCPU cycles, LLC references, context switches and interrupt generation. Since thenetwork adapter used in this study did not support RDMA operations, the mentionedoverhead created by CPU utilization, memory copy, context switches and interruptgeneration in this paper in general was as expected.

2.8.2 Studying Performance of SR-IOV In Virtualized Environment

Evaluating Standard-Based Self-Virtualizing Devices [72].

In this study Jiuxing Liu presented a performance evaluation of a 10GbE SR-IOVcapable NIC in a virtualzed environment with regard to bandwidth, latency, CPUutilization, memory access, VM exits, host/guest interrupts, MTU size, multi CPUsockets, IRQ affinity and IRQ distribution. The hypervisor used in the paper wasKVM. The findings in the paper’s experimentations showed us that SR-IOV providessignificant performance boost over other software-based virtualization techiques orprovides better CPU utilization.

2.8.3 Dynamic Reconfiguration

Performance analysis and dynamic reconfiguration of a SR-IOV enabledOpenStack cloud [16].

In this study Mohsen Ghaemi evaluated network performance and dynamicreconfiguration of underlying infrastructures of an OpenStack IaaS cloud using SR-IOV capable NIC. SR-IOV was compared to other three (paravirtualization, emulationand passthrough) I/O virtualization techniques. The study also had an emphasis onSR-IOV NIC attached VMs and live migration challenges due to the hardware driver

47

dependency of SR-IOV VFs inside the VMs. The study showed that SR-IOV deliversnear line rate when Maximum Transmission Unit (MTU) is increased to the value of9500, with an increase in CPU utilization.

2.8.4 High Performance Network Virtualization

High Performance Network Virtualization with SR-IOV [11].

In this study Yaozu Dong et al. designed, implemented and tuned a genericvirtualization architecture for an SR-IOV-capable network device. Their architecturesupported reusable PF and VF drivers across multiple hypervisors as well asimplementing dynanic network interface interface switching to faciliate migration.One of their finding is related to interrupt handling: the tasks consuming most ofthe time are emulation of guest interrupt mask and unmask operation and End OfInterrupt (EOI).

The study also discusses the application of different techniques and optimizations inorder to reduce the virtualization overhead of interrupt mask and unmask operationand End Of Interrupt (EOI). The study also predicted SR-IOV based I/O virtualizationto meet the scalability demands of the future. Their work has been submitted to bothXen and KVM.

2.8.5 Big Data and Data Protcols

Efficient data transfer protocols for big data [47].

In this study Tierney Brian et al. evaluated performance and scalability of TCP, UDP,UDT and RoCE over high latency 10Gbps and 40Gbps paths.

The study showed that RoCE-based data transfers can fill a 40Gbps path with lowCPU utilization compared to other procotols and that that the Linux zero-copy systemcalls can improve TCP performance significantly. However, RoCE requires hardwaresupport in the NIC, and a congestion free layer-2 circuit to work well. The study findsthat TCP and UDP using traditional UNIX sockets use too much CPU to be able toscale to the data rates needed for tomorrows scientific workflows.

48

2.8.6 Accelerating OpenStack Swift with RDMA

Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPCCloud [73].

In this study, Gugnani et.al evaluated performance and scalability of OpenStack’sobject storage service Swift and proposed an RDMA-based implementation calledSwift-X.

The study’s analysis of get and put operations identified hashsum computation,communication, and I/O as the major factor involved in the performance ofSwift. One of the proposed design is client-oblivious where users can benefit fromimprovements without the need for any modification in the client library or anyneed of RDMA-capable networking devices on the client node. The second designis a metadata server-based design, which completely overhauls the existing designof put and get operations in Swift. In both of abovementioned design, the proposedhigh-performance implementations of network communication and I/O modulesbased on RDMA to provide the fastest possible object transfer. Also different hashingalgorithms are explored to improve object verification performance in Swift.

Although this study used OpenStack Swift and brought in RDMA support to improveSwift, it’s focus was on storage I/O. This thesis will not focus on storage, but lookat pure networking capabilities delivered to and from an OpenStack VM as well asbetween VMs and the host. Another significant difference is that Infiniband was used,although a Mellanox ConnectX-3 adapter, in the evaluation and not in RoCE-mode asthis study will be doing.

2.8.7 RoCEv2 At Scale

RDMA over commodity ethernet at scale [74].

In this study, Guo et.al described experiences from deployment of RoCEv2 for intradata center (intra-DC) communication in Microsoft data centers.

The study shared some valuable experience from deploying a 40Gb/s RoCEv2 in anexisting Ethernet intra-DC network. The bugs and other challenges were addressed intheir early phases thanks to their RDMA management and monitoring capabilitiesdeployed from the start. Also, other challenges and bugs were discussed, such as

49

Priority Flow Control (PFC) deadlock and PFC pause frame storm. The latter cancause a whole network segment to be disconnected from rest of the network.

Although a full scale evaluation of a RoCE deployment using PFC-enabled Ethernetswitches, this study did not get into SR-IOV nor RoCE in a virtualized environment,thus quite different different type of study than this thesis.

50

Part II

The Project

51

Chapter 3

Methodology

The methodology chapter will explain how the problem statements are approached.It also addresses the research questions including environment design, tools andhardware, planned workflow and the analytical procedures to achieve the final goal.

3.1 Objectives

Based on the problem statement of section1.1, this study aims to address the issue ofimproving performance of networking in a cloud environment by utilizing a proposedmethod.

3.2 Testbed

A testbed is vital for this study hence it needs to be setup and configured as a mean toapproach the problem statements. The testbed consists of two physical servers.

The table 3.1 shows the technical information of the equipments used as infrastruc-tures for the testbed. In addition to 1GbE and 10GbE NICs, the servers are alsoequally equipped with one dual port Mellanox ConnectX-3 VPI (MCX354A-FCBT)adapter. Mellanox ConnectX-3 VPI adapters support transmission rates of 56Gb/s forInfiniband and 40Gb/s for Ethernet (RoCE). The NICs are connected back-to-back us-ing QSFP cables. Although both ports we connected, the experiments will be carried

52

Table 3.1: Physical Servers

Vendor - Model CPU Cores/Threads Memory NICs

HP ProLiant DL360p Gen8 Xeon® E5-2609 4C/4T 32GB4x1 GbE2x10 GbE1x40GbE

HP ProLiant DL360p Gen8 Xeon® E5-2609 4C/4T 32GB4x1 GbE2x10 GbE1x40GbE

over port 1 on each NIC.

3.3 Experiments

The experiments to be conducted in this paper will be evaluated by us. Therefore,it is needed to identify the involved factors and key parameters in the design of ourexperiments. The important factors identified experiments are explained briefly in thefollowing section.

3.3.1 Experiment Factors

• Bandwidth: Refers to the amount of data that can be transferred per second.In this study, it will be a measure of how much data that can be transferredbetween two nodes over the RoCE network, e.g from compute 8 to compute7,denoted in giga bits per second (Gbps). is a measurement of bit-rate of availableor consumed data communication resources expressed in bits per second ormultiples of it (bit/s, kbit/s, Mbit/s, Gbit/s, etc.)[74]. Usually the expectedbandwidth might be different to nominal bandwidth since it is depended onother contributors in the network (connection) like cables, switches, other nodes,etc. Also sometimes the delivered bandwidth is different to expected bandwidthdue to environmental effects such as noises, collision and intentional trafficdecline by load balancers

• Latency: can be understood as how long it takes data to travel between itssource and destination, usually measured in milliseconds. In this study, the

53

network latency will be an expression of how much time it takes for a packetof data to get from one node to another, e.g from compute8 to compute8. Thisis known as Round Trip Time (RTT). There are several important factor that cancontribute to a network latency such the medium the packets travels through,propagation, and router and switches.

• Overhead: This term refers to processing overhead of different configurations.The extra processing load and memory usage which are imposed to the systemby each of I/O virtualization techniques are intended. "The processing load"can for instance be the CPU cycles used to run a process and as well the CPUcycles used for servicing IRQs. The overhead can be calculated by monitoringprocessing resources of system and memory during all experiments andcomparing the results together.

3.3.2 Experiment Design and Phases

For the operating system, CentOS 7 was chosen, during all phases of the experimentand for host as well as guests. MLNX_OFED drivers for ConnectX-3 adapters weredownloaded from vendor’s (Mellanox) website and installed on both hosts and guestsduring the different phases of the experiment. It is also possible to install OFEDdrivers from CentOS repositories, but vendor’s OFED drivers were recommend sinceit was the only way we could get support for SR-IOV which is critical in this study.The installed version of MLNX_OFED driver on the testbed is MLNX_OFED_LINUX-4.1-1.0.2.0.

The environment for this phase of the experiment was prepared by installing theappropriate MLNX_OFED driver according to OS major and minor version, and thekernel version. The version downloaded and installed is OK: MLNX_OFED_LINUX-4.1-1.0.2.0 (OFED-4.1-1.0.2) from the vendors web page. The driver package providedan install script that was used to install the script as the listing below shows.

1 ./install2 [root@compute8 MLNX_OFED_LINUX-4.1-1.0.2.0-rhel7.3-x86_64]# ./mlnxofedinstall3 Logs dir: /tmp/MLNX_OFED_LINUX.17033.logs4 General log file: /tmp/MLNX_OFED_LINUX.17033.logs/general.log5 Verifying KMP rpms compatibility with target kernel...6 This program will install the MLNX_OFED_LINUX package on your machine.7 Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be

removed.

54

8 Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.9

10 Do you want to continue?[y/N]:y

After the installation of MLNX_OFED drivers, the network adapters show up as ens2and ens2d1 since this adapter has two ports. Only port 1 named ens2 was utilizedduring the experienced.

Table 3.2: Experiment Phases

Experiment Phase Server 1 Server 2 Summary1 bare metal bare metal Both bare metal servers with

vanilla CentOS 7 installation.Monitor and measure relatedsystem metrics on bothsystems in order to achievehighest possible throughputwith low latency.

2 bare metal VM VM provisioned withKVM hypervisor and virt-manager. There are multipleways of virtualizing anetwork adapter: emulation,paravirtualization, PCIpassthrough and SR-IOV.Monitor and measure relatedsystem metrics for eachchosen ways of virtualizingthe network adapter to assisttuning of the VM and/orhypervisor parameters.Crucial step to acquire theneeded knowledge about40GbE and SR-IOV inside aVM.

Memory usage observation and collection were insignificant during the phase of

55

RDMA measurements. Although the measurements scripts collected memory usagedata from /proc/memory during all the 3 phases of the experiment, there will be nographs showing system memory usage during phase 1 with RDMA measurements.

The measurements with iperf will be discussed in the chapter 5. For experiment phase1, the data collection scripts were run on compute 7 (the server) and compute 8 (theclient). For experiment phase 2, the data collection scripts were run on compute 7 (theserver) and compute 8 (the hypervisor 1) and vm_compute8 (the client).

The data collection scripts all were collecting same metrics across all the server nodesinvolved, with the exception of no bandwidth measurement were collected on the"server" side of the measurement tools.

3.3.3 Experiment Tools

• iperf [75] is an IP network bandwidth measurement tool that’s widely knownand used with support for multi threading support (supported by it’s clientmode). It’s a tool that can act as a server as well as a client part and listen to auser specified port.

• rperf [76] is a network bandwidth measurement tool similar to iperf that’s lessknown. Leveraging original iperf software architecture, it adds many importantRDMA extension to it.

• qperf [77] is also a network bandwidth measurement tool that can work overTCP/IP as well the RDMA transports. Using server-client model, it supportsmeasurements types of bandwidth, latency and cpu utilization.

• Perftest [78] package for Mellanox OpenFabrics Enterprise Distribution forLinux (MLNX_OFED) is a collection of tests written over uverbs intendedfor use as a performance micro-benchmark. The perftest package consistsof multiple measurement tools such as ib_send_bw, ib_send_lat, to measurebandwidth and latency on RDMA supported devices such as the ConnectX-3network adapter.

• strace [75] is a diagnostic, instructional, and debugging tool that can tracesystem calls and signals. It intercepts and records the system calls which arecalled by a process and the signals which are received by a process.

56

• Ansible Automation Tool [79] is a tool to automate repetitive tasks for instancefor configuration and installation of software into our nodes involved in thestudy.

3.3.4 Experiment Key Factors

• Bandwidth: is a measurement of bit-rate of available or consumed datacommunication resources expressed in bits per second or multiples of it(bit/s, kbit/s, Mbit/s, Gbit/s, etc.). Usually the expected bandwidth might bedifferent to nominal bandwidth since it is depended on other contributors in thenetwork (connection) like cables, switches, other nodes, etc. Also sometimes thedelivered bandwidth is different to expected bandwidth due to environmentaleffects such as noises, collision and intentional traffic decline by load balancers.

• Network Throughput: The achieved bandwidth of each of the experimenterphases with different configuration is collected. It is compared with thetheoretical expected bandwidth according to the hardware specifications aswell as achieved bandwidth by other configurations in a same condition. Inthis study, the actual bandwidth is collected as reported by the bandwidthmeasurement tools in the different experiments.

• Overhead: This term refers to processing overhead of different configurations.The extra processing load and memory usage which are imposed to the systemby each of I/O virtualization techniques are intended. The overhead can becalculated by monitoring processing resources of system and memory during allexperiments and comparing the results together. For instance, the Linux kernelkeeps track of both system load, CPU time, amount of IRQs generated, memoryusage, among others. For RoCE based measurement using RDMA, due to it’soffloading and kernel bypass capabilities, we will not be able to get some of themetrics from the Linux kernel.

3.3.5 Data Collection and Evaluation

Script was written to automate gathering of data from the experiments. The scriptmeasure.sh was created to facilitate gathering of measurement data that is significantfor later analysis. The different measurement data are recorded as comma separated

57

units into a file to ease later processing and analysis. Data were gathered in intervalsfrom files under /proc such as system load, CPU time, memory usage and networkactivity data. To ease the analysis of the collected data, the R statistics software wasused to extract statistics about the data. The provided statistics such as average,median, standard deviation and the plots are all obtained by using R.

The file /proc/stat contains various pieces of information about kernel activities and theprovided data in this file is aggregates since the system was booted. The file consistsof multiple columns and rows. The rows starting with CPU followed by a numberprovides the amount of time (hundreds of a second) each specific CPU core has spentperforming a number of tasks. Being idle is also a task that’s included among theprovided data in this file. The following truncated listing is an example of the contentof /proc/stat from compute8 while being used as hypervisor:

1 [root@compute8 ~]# cat /proc/stat2 cpu 4054433 6 127681 32330479 9929 0 92 0 4043806 03 cpu0 559398 0 43015 3957118 1908 0 70 0 558364 04 cpu1 523531 0 16786 4024281 581 0 7 0 522626 05 cpu2 2432683 0 12567 2120700 427 0 4 0 2431783 06 cpu3 243650 0 12166 4308991 359 0 5 0 242888 07 cpu4 78456 0 15858 4471234 1165 0 0 0 76975 08 cpu5 71372 1 12904 4479372 2597 0 1 0 69663 09 cpu6 78605 2 7150 4478212 1630 0 0 0 76302 0

10 cpu7 66733 1 7232 4490567 1260 0 2 0 65202 011 intr 153142017 53 3 0 0 0 0 0 0 1 0 0 0 4 0 0 0 0 0 0 0 75 29 0 0 0 0 1 0 4404 2393 262312 ..13 ctxt 2943527314 btime 152693263015 processes 12438416 procs_running 117 procs_blocked 018 softirq 51188770 6 45975225 15242 398268 32901 0 1869 2897418 0 186784119 [root@compute8 ~]#

The following listing shows the ten columns name for the first cpu line:

1 user nice system idle iowait irq softirq steal guest guest_nice2 cpu 4054433 6 127681 32330479 9929 0 92 0 4043806 0

The description of the ten columns is given below together with other metricscollected from the systems. From /proc/stat the ten columns of each CPU cores wasrecorded by the script to collect data during the experiments. Also, for the analyses of

58

CPU context switches in phase 1, the row starting with "ctxt" was recorded.

The output files contains the following columns separated by comma and areexplained below in details: Timestamp Bandwidth CPULoad1m CPULoad5mCPULoad5m NetDevRX NetDevTX IRQens2-0 IRQens2-1 IRQens2-2 IRQens2-3IRQens2-4 IRQens2-5 IRQens2-6 IRQens2-7 user nice system idle iowait irq softirqsteal guest guest_nice MemTotal MemFree MemAvailable Buffers Caches. Thefollowing are some exception and variation to the columns for data collection andrecording:

1. the IRQ columns will look differently for the VM in phase 2 since the will have 4CPU cores assigned to it (from NUMA node 0). It will have 3 and 4 IRQs assigned forparavirtulization and SR-IOV, respectively.

2. ctxt field of /proc/stat was recorded in some of the experiments in phase 2 to analyzethe CPU context switches.

• timestamp: the start time of measurement collection

• bandwidth: the network transfer throughput as reported by the bandwidthmeasurements tool

• cpuload1m,cpuload5m and cpuload15m: the system load average for last one,five and fifteen minute(s), respectively, of time seriess.

• NetDevRX and NetDevTX: amount of received and transmitted network trafficrespectively, in bytes for the network adapter and the used port

• IRQens2-0 to IRQens2-7: generated interrupt for

• user: normal processes executing in user mode

• nice: niced processes executing in user mode

• system: processes executing in kernel mode

• idle: twiddling thumbs

• iowait: is the percentage of time the CPU is in the state idle and there is at leastone I/O in progress. Each CPU can be in one of four states: user, sys, idle andiowait.

• irq: servicing interrupts

59

• softirq: time servicing softirqs

• steal: involuntary wait

• guest: time spent running a virtual CPU for guest operating systems under thecontrol of the Linux kernel. Measured in clock ticks.

• guest_nice: time spent running a niced guest (virtual CPU for guest operatingsystems under the control of the Linux kernel).

• MemTotal: total usable RAM in kilobytes (i.e. physical memory minus a fewreserved bytes and the kernel binary code)

• MemFree: the amount of physical RAM left unused by the system.

• MemAvailable: an estimate of how much memory is available for starting newapplications, without swapping [80].

• Buffers: the amount of physical RAM used for file buffers.

• Caches: the amount of physical RAM used as cache memory. Memory in thepagecache (diskcache) minus SwapCache.

• ctxt: the total number of context switches across all CPU cores.

For the purpose of the analysis of CPU usage in phase 2 when the experimentwill be conducted from a VM, following formula was used to calculate (average)percentage of CPU load (and not system load) and the share of CPU load used forservicing guest from the hypervisor based on data from /proc/stat. The basis for thecalculations are based the following formulas from question and answer by Vangelisat Stackoverflow [81].

1 PrevIdle = previdle + previowait2 Idle = idle + iowait3

4 PrevNonIdle = prevuser + prevnice + prevsystem + previrq + prevsoftirq + prevsteal5 NonIdle = user + nice + system + irq + softirq + steal6

7 PrevTotal = PrevIdle + PrevNonIdle8 Total = Idle + NonIdle9

10 # differentiate: actual value minus the previous one11 totald = Total - PrevTotal12 idled = Idle - PrevIdle

60

13

14 CPU_Percentage = (totald - idled)/totald x 100

With the help of R software we could get the average using the mean() function to getthe average CPU usage.

Based on the above formulas, we can now get the percentage that was used to servicethe guest using data from guest (column 10). The idea is to take the guest time, divideit by CPU non-idle time and then multiply with 100 to get the percentage as shownbelow. Note that nonidleTime below doesn’t include idle and iowait from /proc/stat.Another note is the fact that guest (column 10) and guest_nice (column 11) are alreadyaccounted for in user (column 2) and nice (column 3), respectively, hence these twocolumn values are not part of nonidleTime below.

nonidleTime = user + nice + system + irq + softirq + irq

guestTime = guest + guest_nice

We can now use the following formula to get the percentage of CPU time thehypervisor was using for servicing the guest:

guestTimePercentage = ( guestTime / nonidleTime) x 100

Again, using the mean() function in R we were able to get the fraction of average CPUload used to service the guest.

The second metric to consider for calculating system overheads is memory usage.The file /proc/meminfo reports a large amount of information about the Linux systemsmemory. Among those information the measure script record MemTotal, MemFree,Buffers and Cached. The following listing is an example of the first 5 rows of/proc/meminfo from compute7:

1 [root@compute7 ~]# head -n5 /proc/meminfo2 MemTotal: 32804244 kB3 MemFree: 31802068 kB4 MemAvailable: 31697108 kB5 Buffers: 2116 kB6 Cached: 185068 kB

The following formula was used to calculate used memory for the purpose of plottingwith R statistics software:

61

Used Memory = MemTotal - ( MemFree + Buffers + Cached )

A description of the recorded fields of this file is also given further below in this subsection.

The third metric to consider is the system load average which is also recorded byscripts in this study. The data about system load is provided by the Linux kernel inthe file /proc/loadavg and consists of one row with multiple columns. The first threecolumns provides the system load average for last one, five and fifteen minute(s),respectively, of time series. It is important to pay attention to the fact that the systemload values not only counts for CPU threads utilization such as running processes orprocesses in the run queue (state R))waiting for a CPU share, but also the disk I/Outilization (waiting for disk, state D). The metric for system load is is given as floatingnumbers. A value of zero means there is no load on the system (CPU idling withoutany outstanding disk I/O). A process using or waiting for CPU (the ready queueor run queue) increments the load number by 1. Any number more than 1 meansthere are some processes in waiting mode. For instance number 1 means there is noheadroom and 1.3 means CPU is fully loaded and there are some process waiting forCPU that the amount of them is equal to 30% of CPU capability. Another importantfact is that on a multi-processor system, the load is relative to the number of processorcores available. In a system setup like in this thesis with 8 CPU core and threads, asystem load of 1 means a utilization of 1/8 (12.5%), while 8 means the system is fully(100%) utilized. The following listing is an example of /proc/loadavg from compute7:

1 [root@compute7 ~]# cat /proc/loadavg2 0.00 0.01 0.10 1/203 35693

The network traffic statistics are provided by the system in file /proc/net/dev. This filecontains information about the traffics to /from configured network interfaces. Thefollowing truncated listing is an example of /proc/net/dev from compute7:

1 [root@compute7 ~]# cat /proc/net/dev| egrep '(^Inter|ens2:|face)'2 Inter-| Receive3 face |bytes packets errs drop fifo frame compressed multicast4 ens2: 45908555318 30372306 0 0 0 0 0 05 | Transmit6 |bytes packets errs drop fifo colls carrier compressed7 20975322 349584 0 0 0 0 0 0

The recorded columns from this file was received and transmited network traffic

62

in bytes. That is provided by the columns number two and ten, respectively. Adescription of the recorded fields of this file is given below. Although this fileprovides data about the bandwidth statistics, among others, the bandwidth datawas also provided and recorded by the measurements tools used in this study. Thisstudy relies on the bandwidth data provided by the measurements tool such asiperf and ib_send_bw for the purpose of plots and analysis. Also, for the RoCE andRDMA bandwidth measurements, there is no related data in /proc/net/dev since RDMAbypasses kernel completely.

63

Chapter 4

Results

4.1 Test

This chapter presents short description of system setup, some system tuningconsiderations and functionality of written scripts, as well a descriptions of thebandwidth measurements tools used in the different experiment phases. It alsoincludes description of implementation of proposed method and some results.

4.1.1 MTU Considerations

Standard Ethernet frames are 1518 bytes, including the MAC header and the CRCtrailer. Jumbo frames, supported by modern network interfaces and switches, increasethe size of an Ethernet frame to 9000 or 9500 bytes [82]. To use the jumbo frames itwas needed to reconfigure network interface on the servers and the VMs during thedifferent phases of the experiments by changing MTU size to the desired value. Basedon some basic tests, we could discover that the NIC used in the study supported MTUsizes of up to 9900 for Ethernet transport. To get a better understanding of how theMTU sizes might affect the network throughput, the MTU size of 1500, 9000, 9500 and9900 bytes were used in the TCP/IP based experiments throughout all the phases.

For experiment phase 1 MTU sizes were configured directly on the physical interfacesin the OS. For phase 2 we had to take into considerations that the VM had the NICvirtualized using paravirtualization and then using one of the VFs. When exposing

64

the NIC into the VM with paravirtualization, we have to configure the same MTU sizeconsistently on host as well as on the VM. In this scenario the NIC is a shared devicebetween the host and the VM. The inability of libvirt to pass through the host MTUsize has been subject to multiple discussion [83] [84] [85] at Redhat’s Bugzilla. Whenthe NIC was exposed into the VM using one of the VFs, it becomes an independentPCI device as seen from the VM. Hence the MTU sizes can be set independently fromthe host.

Since the setup in the study does not use a switch, we could freely use MTU sizeof 9900 without considering that switches also have to support and be configuredwith the same MTU size as the NICs for optimal performance. For the RoCE-basedexperiements, we have to consider the MTU sizes that are defined by Infiniband(hence RoCE as well) protocol. The following MTU sizes are on the list: 256, 512,1024, 2048 and 4096 bytes. The MLNX_OFED driver will select an "active" MTU sizethat is the largest value from the listed MTU sizes above that is smaller than EthernetMTU in the system. For instance, with the default Ethernet MTU size of 1500 bytes,the RoCE would have used 1024 as the "active" MTU size since there are additionalRoCE transport headers and CRC fields that must be transported, within the set1500 bytes. In order to leverage the maximum capability of the NIC and RDMA, theMTU supported by MLNX_OFED driver for RoCE applications, an MTU size of 4200was chosen to ensure that 4096 was chosen as the "active" MTU for RDMA basedexperiments. The following command show how we can change the MTU size for theNIC named ens2:

1 [root@compute8 ~]# ifconfig ens2 mtu 4200

4.1.2 NUMA Topology Considerations and Tuning

According to Redhat’s libvirtd NUMA tuning guide [86],

the performance impacts of NUMA misses are significant, generally startingat a 10% performance hit or higher. vCPU pinning and numatune should beconfigured together.

To get an overview of NUMA layout for the system used in this study, we can usenumactl –hardware:

1 [root@compute8 scripts]# numactl --hardware

65

2 available: 2 nodes (0-1)3 node 0 cpus: 0 1 2 34 node 0 size: 16349 MB5 node 0 free: 14915 MB6 node 1 cpus: 4 5 6 77 node 1 size: 16383 MB8 node 1 free: 15559 MB9 node distances:

10 node 0 111 0: 10 2112 1: 21 10

The above listing shows that compute8 node has two NUMA nodes, named node 0and node 1, with CPU core 0,1,2,3 and 4,5,6,7 assigned to the NUMA nodes 0 and 1,respectively. The listing also show the cost to go from one NUMA node to another.For instance, the cost of going from node 0 to 1 is 21, which is higher than going fromnode 0 to node 0, i.e keeping within the same NUMA node. This will have impact onmemory accesses and performance in general if not taking into considerations duringthe setup and tuning phase of experiments.

The following NUMA and the CPU affinity optimizations were doing while the VMwas running:

1 [root@compute8 ~]# numactl --show2 policy: default3 preferred node: current4 physcpubind: 0 1 2 3 4 5 6 75 cpubind: 0 16 nodebind: 0 17 membind: 0 18 [root@compute8 ~]# lscpu | grep numa9 [root@compute8 ~]# lscpu | grep NUMA

10 NUMA node(s): 211 NUMA node0 CPU(s): 0-312 NUMA node1 CPU(s): 4-713 [root@compute8 ~]#14 [root@compute8 ~]# virsh vcpupin vm815 VCPU: CPU Affinity16 ----------------------------------17 0: 0-718 1: 0-719 2: 0-720 3: 0-7

66

21

22 [root@compute8 ~]#

1 root@compute8 ~]# for i in {0..3}; do virsh vcpupin vm8 $i $i;done2 \begin{lstlisting}[language=Bash]3 [root@compute8 ~]# virsh vcpupin vm84 VCPU: CPU Affinity5 ----------------------------------6 0: 07 1: 18 2: 29 3: 3

10

11

12 [root@compute8 ~]# virsh emulatorpin vm8 0-313

14 [root@compute8 ~]# virsh emulatorpin vm815 emulator: CPU Affinity16 ----------------------------------17 *: 0-318

19 [root@compute8 ~]#20

21 [root@compute8 ~]# numastat -c qemu-kvm22

23 Per-node process memory usage (in MBs) for PID 3819 (qemu-kvm)24 Node 0 Node 1 Total25 ------ ------ -----26 Huge 0 0 027 Heap 15 76 9128 Stack 0 2 229 Private 167 3012 317830 ------- ------ ------ -----31 Total 182 3090 327232 [root@compute8 ~]# virsh numatune vm833 numa_mode : strict34 numa_nodeset :35

36 [root@compute8 ~]#37 [root@compute8 ~]# virsh numatune vm8 --nodeset 038

39 [root@compute8 ~]# virsh numatune vm840 numa_mode : strict41 numa_nodeset : 042 [root@compute8 ~]# numastat -c qemu-kvm

67

43

44 Per-node process memory usage (in MBs) for PID 3819 (qemu-kvm)45 Node 0 Node 1 Total46 ------ ------ -----47 Huge 0 0 048 Heap 91 0 9149 Stack 2 0 250 Private 3178 0 317851 ------- ------ ------ -----52 Total 3272 0 3272

4.1.3 Experiments

To ease and automate planned experiments, first of all couple of scripts weredeveloped and tested. Written scripts, their functionality, and outputs are listed infollowing table:

Table 4.1: Developed Scripts

Name Function Output

c8toc7_c8_iperf.shRecording operating systemdetails of compute8 during

TCP/IP experimentdate_time_c8toc7_c8_iperf.txt

c8toc7_c7_iperf.shRecording operating systemdetails of compute7 during

TCP/IP experimentdate_time_c8toc7_c7_iperf.txt

c8toc7_c8_roce.shRecording operating systemdetails of compute8 during

RDMA experimentdate_time_c8toc7_c8_roce.txt

c8toc7_c7_roce.shRecording operating systemdetails of compute7 during

RDMA experimentdate_time_c8toc7_c7_roce.txt

To be certain to clear the operating system cache and buffer, as well as memory usageon the servers, servers and VMs and were rebooted between each phases and sub-phases of the experiments. Since the CPU load reported by the kernel could vary afterand between different experiments, a reboot would clear such data reported by the

68

kernel. Still the observed CPU load right after a reboot was significant and a sleeptime of ten minutes was put in place before running any experiment. Although the"boot" CPU load would have less impact on the reported CPU load over time.

4.2 Bare metal to Bare metal

The environment for this phase of the experiment was prepared by going throughthe tuning guide [87] by Mellanox. Among other things, we needed to ensure somecritical settings regarding optimal performance was in place.

• Ensuring that IRQ and CPU affinity are set correctly

• Disabling CPU C-state to avoid CPU going into power save mode

• Making kernel-level parameters tuning persistent across reboots by inserting thekeys and values into in /etc/sysctl.conf

• Activating network-throughput profile using tuned-adm

CPU affinity, also known as CPU pinning, enables binding a process or multipleprocesses to a specific CPU core in a way that the process(es) will run from thatspecific core only. On Linux, a process’s CPU affinity can be altered with the tasksetcommand and throughout the experiments, the bandwidth measurement tools, will beassigned to CPU core 0 using taskset command.

By default all interrupts generated by hardware in a system go to the first core whichis core0 in modern server systems. IRQ affinity, is the affinity of an interrupt requestwhich is defined as the set of CPU cores that can service that interrupt. To improveapplication scalability and latency, it is recommended to distribute IRQs between theavailable CPU cores. Our hardware for this study has four core per socket, hence eightcores in total since there are two CPU sockets in both servers. The IRQ affinity wasspread to all available core on the systems using the script set_irq_affinity.sh that issupplied with the MLNX_OFED driver. The following confirms the distribution ofinterrupts among the available cores in the system:

1 [root@compute7 ~]# show_irq_affinity.sh ens22 61: 00000000,000000013 62: 00000000,000000024 63: 00000000,000000045 64: 00000000,00000008

69

6 65: 00000000,000000107 66: 00000000,000000208 67: 00000000,000000409 68: 00000000,00000080

4.2.1 iPerf and Netperf

Bandwidth measurements using iPerf was done in multiple phases since there aresome important factors and parameters to consider according to it’s documenta-tion [88].The first measurements were done using different sizes of MTU. Data werecollected with iPerf using Ethernet’s default MTU size is 1500, 9000, 9500 and 9900bytes as mentioned in 4.1.1. In the TCP world, the equivalent of Ethernet’s MTUis TCP maximum segment size (MSS) and iPerf can automatically detect this basedon Path MTU Discovery to make the transmission most efficient. Also, we have totake TCP/IP header into consideration hence the MSS is usually the MTU - 40 bytes(TCP/IP header size). For Ethernet, the MSS is 1460 bytes (1500 byte MTU) and 8960bytes for MTU size of 9000. We can confirm the MSS as reported by server using the-m option both on server and client side of an iPerf transmission:

1 MSS size 8960 bytes (MTU 9000 bytes, unknown interface)

iPerf also supports running bandwidth measurement with multiple threads. This willcreate multiple TCP streams in parallel on different ports and the sum of each streamwill be reported as the final bandwidth. This study focuses on offloading the CPUand it’s cores in order for the applications to utilize them. However, one bandwidthmeasurement will be done with two threads and we will observe how more threadsmight affect the bandwidth measured by iPerf.

An early, but nevertheless a very important observation was that the inability of iPerf2to support RDMA operation. iPerf3 was not recommended by posts [89] at Mellanoxforums. According to a post by Vangelis [90] at Mellanox forum, it’s reasonableto assume that iPerf3 had support for RDMA at an earlier point in time and laterremoved. The reason for this assumption is that Vangelis was able to achieve near linerate using iPerf3 and a single thread.

These concerns resulted in some questions that were posted [91] in the Mellnox’sforums without getting any reply or engaging other forum members at the time ofthe writing.

70

Some basic measurements showed that iPerf3 was unable to get anything near the linerate with a single thread even when used with the –zerocopy option. This and the factthat the above mentioned Mellanox post recommend not to use it, made us chooseiPerf2 as the tool for TCP/IP bandwidth measurements.

Netperf by Hewlet Packard (HP) was another network bandwidth measurementtools that was discovered in the initial research for benchmark tools. It is a server-client based software that consists of the server netserver and the client netperf. It hassupport for many feature such as CPU utilization reporting. Since such data will becollected by the kernel and the Netperf is similar to iPerf, this tool was not used in anyof the experiment phases.

All iPerf measurements were done in 30 seconds at a time using the -t 30 option andrepeated 100 times to obtain 100 samples for the analysis. Figure 4.1 shows an initialmeasurement using single threaded iPerf with MTU size of 1500.

15.0

17.5

20.0

22.5

100 200 300

iPerf Measurement # − MTU 1500

B/W

Gbp

s

38.7

38.8

38.9

39.0

100 200 300

Perftest Measurement # − MTU 4200

B/W

Gbp

s

Figure 4.1: iPerf vs Perftest B/W

71

The graph show the huge difference in achieved bandwidth using TCP/IP iPerfmeasurements and RDMA-based ib_send_bw tool from Perftest tool. Using an MTUsize of 1500 bytes, the box plot on the left side shows a median well below 15.0Gb/swhile the box plot on the right side shows a median above 39.0Gb/s. This has hugeimpact on the performance were we to use application that are not RDMA aware.

For measurement with jumbo frames, we can see early indications that although theperformance gain, there is a penalty to pay in terms of CPU usage. Additionally, otherresource utilization such as memory and IRQ generation will be discussed further indetails in chapter 5

4.2.2 RoCE

During the early phase of research and evaluation of benchmarking tool, it wasdiscovered that neither of iPerf2 nor iPerf3 had support for RDMA. But there existedanother benchmark tools such as rperf [76] and qperf [77] at the time of evaluationwith claimed support for RDMA. Rperf’s code was downloaded, compiled and builton both compute7 and compute8. It was clear that it was based on iPerf2, but withRDMA patches to be used with IB and RoCE protocols. Some basic tests showed thatit indeed was capable of running RDMA transfers, but the main concern for not usingthis tool for RoCE based measurements was that the fact it was not maintained forsome years. The developer was contacted through email with some questions, but noreply was received as of the writing.

Qperf might have been a tool to be used in the study for RDMA measurements andsince there are fewer tools out there with RDMA support than traditional TCP/IP-based, this code was cloned from the Github repository, compiled and built onboth compute7 and compute8. At the time of the writing, Qperf version was at0.4.10. While we were able to do some basic TCP/IP measurements, none of thesupported RDMA measurements we tested worked. The qperf server came up withthe following error:

1 [root@compute7 qperf]# qperf2 libibverbs: GRH is mandatory For RoCE address handle3 failed to create address handle

And the qperf client came up with the following error:

72

1 [root@compute8 qperf]# qperf 192.168.100.1 rc_bi_bw2 rc_bi_bw:3 failed to modify QP to RTR: Network is unreachable4 [root@compute8 qperf]# qperf 192.168.100.1 ud_lat ud_bw5 ud_lat:6 libibverbs: GRH is mandatory For RoCE address handle7 failed to create address handle8 server: failed to create address handle

No further research was done to investigate the experienced issues and qperf was notused for any of the experiments in this study.

4.3 VM to Baremetal

In this phase of experiment, mentioned as phase 2 earlier, we are introducing a VMusing KVM as the hypervisor. The KVM hypervisor was installed onto our secondnode compute8. We’ve utilized our Ansible playbook[92] in order to install thenecessary software packages and turn compute8 into a hypervisor. Ansible playbooksor automation with Ansible in general is idempotent so running a playbook multipletimes do not cause unwanted configuration changes. The VM itself was installedmanually using virt-manager GUI. A minimal version of CentOS was installed and thenecessary packages were installed using Ansible.

This phase of experiment is further divided into multiple parts in order to study theeffect of different MTU sizes and IRQ and different CPU affinity settings during whenthe NIC to the VM is exposed with paravirtualized NIC and SR-IOV.

4.3.1 Paravirtualized NIC

KVM’s and libvirtd’s network model supports multiple device models for VM’snetwork device. These are e1000, rtl8139 and VirtIO. Choosing either of e1000 orrtl8139 emulated device model for the VM resulted in ethtool reporting both as onlycapable of 1Gb/s. In phase 2, some basic measurements were done to observe theresults and the bandwidth using e1000 with an MTU size of 9000 was below 3Gb/s.Using rtl8139 device model, that didn’t allow change of MTU size from 1500, weobserved much lower results, in the ballpark of well below 500Mb/s. Since our RoCE

73

link operates at 40Gb/s, these two device models were inappropriate and droppedin the further experimentation of the study. Therefore, the first part of this phase wasdone with exposing the Mellanox network adapter using the paravirtualization driverVirtIO to collect data for the analysis and compare the results with the second partusing SR-IOV, TCP/IP and RDMA using vendor supplied MLNX_OFED driver. Also,the measurement at this part was done using developed scripts running on compute7(the server) and compute8 (the hypervisor 1) and vm_compute8 (the client).

Bandwidth was measured using iperf using default settings. This means only a singlethread was used although iperf in client mode supports multiple threads. Usingmultiple threads means taking away more CPU cores from the VM and the hypervisorfor the network load leaving less CPU cores for the applications the VMs are intendedto run. Among other goals, this study aims to offload the CPU for network workloadas much as it’s possible. Based on this reasoning, measurements were done only witha single core. Figure 4.2 shows the average bandwidth with TC/IP using VirtIO drivercompared to bare metal for different MTU sizes. In this early phase, we observe fromthe plot that bare metal network bandwidth throughput is significantly higher thanusing VirtIO driver, except for the MTU size of 1500, where the difference is smaller.

74

0

5

10

15

20

TCP/IP Paravirtualization (PV) vs Bare Metal (BM)

B/W

Gb/

s

MTU

PV 1500

PV 9000

PV 9500

PV 9900

BM 1500

BM 9000

BM 9500

BM 9900

Figure 4.2: TCP/IP Paravirtualization vs Bare Metal

4.3.2 Enabling SR-IOV and VFs

To be able to use SR-IOVs Virtual Functions (VFs), some changes were required on theNIC’s driver as well as on the system setting, according to vendor’s documentationand recommendation. To enable the VFs, the following command was run:

1 [root@compute8 ~]#mlxconfig -d /dev/mst/mt4099\_pciconf0 set SRIOV\_EN=1

75

And to set the number for VFs, the following command was run:

1 [root@compute8 ~]#mlxconfig -d /dev/mst/mt4099\_pciconf0 set NUM\_OF\_VFS=8

And the following will confirm that the desired changes are in place:

1 [root@compute8 ~]# mlxconfig -d /dev/mst/mt4099\_pciconf0 q2

3 Device #1:4 ----------5

6 Device type: ConnectX37 PCI device: /dev/mst/mt4099_pciconf08

9 Configurations: Next Boot10 SRIOV_EN True(1)11 NUM_OF_VFS 812 LINK_TYPE_P1 VPI(3)13 LINK_TYPE_P2 VPI(3)14 LOG_BAR_SIZE 315 BOOT_PKEY_P1 016 BOOT_PKEY_P2 017 BOOT_OPTION_ROM_EN_P1 True(1)18 BOOT_VLAN_EN_P1 False(0)19 BOOT_RETRY_CNT_P1 020 LEGACY_BOOT_PROTOCOL_P1 PXE(1)21 BOOT_VLAN_P1 122 BOOT_OPTION_ROM_EN_P2 True(1)23 BOOT_VLAN_EN_P2 False(0)24 BOOT_RETRY_CNT_P2 025 LEGACY_BOOT_PROTOCOL_P2 PXE(1)26 BOOT_VLAN_P2 127 IP_VER_P1 IPv4(0)28 IP_VER_P2 IPv4(0)

A reboot is required for the changes to be effective. The other system setting that wasrequired was out in place with the help of our Ansible playbook and the change is theaddition of the following line into /etc/modprobe.d/mlnx.conf:

1 options mlx4_core num_vfs=8 port_type_array=2,2 probe_vf=0 enable_sys_tune=1

We can now verify that the VFs are actually working and that the chosen amount of 8VFs are showing up in the system after the reboot:

76

1 [root@compute8 ~]# lspci -vv| grep Mellanox2 07:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]3 Subsystem: Mellanox Technologies Device 00504 07:00.1 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/

ConnectX-3 Pro Virtual Function]5 Subsystem: Mellanox Technologies Device 61b06 07:00.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/

ConnectX-3 Pro Virtual Function]7 Subsystem: Mellanox Technologies Device 61b08 07:00.3 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/

ConnectX-3 Pro Virtual Function]9 Subsystem: Mellanox Technologies Device 61b0

10 07:00.4 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

11 Subsystem: Mellanox Technologies Device 61b012 07:00.5 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/

ConnectX-3 Pro Virtual Function]13 Subsystem: Mellanox Technologies Device 61b014 07:00.6 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/

ConnectX-3 Pro Virtual Function]15 Subsystem: Mellanox Technologies Device 61b016 07:00.7 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/

ConnectX-3 Pro Virtual Function]17 Subsystem: Mellanox Technologies Device 61b018 07:01.0 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/

ConnectX-3 Pro Virtual Function]19 Subsystem: Mellanox Technologies Device 61b0

One of the VFs was exposed to the VM using "Add Hardware" button in the section of"Show virtual hardware details" in the GUI of virt-manager. In the following "AddNew Virtual Hardware", choosing "PCI Host Device" gives us a list of PCI deviceswith their bus, device and function numbers as shown in figure 4.3. Scroll downuntil we see our PF and VFs of the NIC. Since the first VF has the bus:device.functionnumber 07:00.1, it was chosen for the VM from the list and applied with clicking on"Finish".

77

Figure 4.3: Adding Virtual Hardware from Virtual Machine Manager

Some device assignment can be done live while the VM is running and theassignment of the VF was done while the VM was running. Now we can confirm thatthe VF is visible with lspci command.

For the VM, an assigned VF is a PCI networking device and will behave as a fullphysical network device. As such, the vendor supplied MLNX_OFED driver wasinstalled into the VM exactly same way as for the host itself. Although newer versionof the driver is available from the vendor in this phase of the experiement, theinstalled driver version was the same as for the other nodes.

1 [root@vm scripts_vm8c8]# lspci -vv |grep Mellanox2 00:09.0 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/

ConnectX-3 Pro Virtual Function]3 Subsystem: Mellanox Technologies Device 61b04 [root@vm scripts_vm8c8]#

The experiment measurements for RoCE and TCP/IP were carried out using the tools

78

ib_send_bw and iperf, respectively. It was run at a period of 30 seconds and 100 times.

Early outcomes of recorded information from the system showed that althoughSR-IOV delivers better network throughput, it comes at the cost of higher memoryconsumption. Also CPU usage information indicated that SR-IOV imposed significantCPU load and that using bigger MTU sizes did not impose considerable processingoverheads to the different configurations. Figure 4.4 shows memory usage withparavirtualization and SR-IOV. The SR-IOV experiments in this study was conductedwith both TCP/IP and RoCE.

79

0

100

200

300

400

VM Memory Usage: Paravirtualization (PV) vs SR−IOV

MB

MTU

PV 1500

PV 9000

PV 9500

PV 9900

SR−IOV RoCE 1200

SR−IOV RoCE 2200

SR−IOV RoCE 4200

SR−IOV TCP/IP 1500

SR−IOV TCP/IP 9000

SR−IOV TCP/IP 9500

SR−IOV TCP/IP 9900

Figure 4.4: VM Average Memory Usage: Paravirtualization (PV) vs SR-IOV (TCP/IPand RoCE)

80

Chapter 5

Analysis

In this chapter, analysis is done for the experiments that are conducted in this study.First section will analyze the obtained experiment results with bare metal to baremetal both for TCP/IP and RoCE. The second section is divided into two sub phases.Phase 1 will analyze the obtained experiment results from paravirtualization andTCP/IP, while phase 2 covers analysis of obtained results using SR-IOV VF forTCP/IP and RoCE in a virtualized environment.

5.1 Different methods with bare metal

In this part, obtained results from experiments of bare metal to bare metal areanalyzed. Looking at the bandwidth numbers from obtained results with differentMTU sizes, we make an important observation from 5.4 that RoCE using RDMAtransfers is able to deliver close to the line rate, while TCP/IP-based transfers are farfrom RoCE in terms of achieved bandwidth.

81

0

5

10

15

20

TCP/IP Bare−Metal

B/W

Gb/

s

MTU

MTU 1500

MTU 9000

MTU 9500

MTU 9900

Figure 5.1: TCP/IP Average Bandwidth for different MTU sizes

The graph from figure 5.1 shows that optimal MTU size for TCP/IP transfers are9000, belonging to Ethernet’s jumbo frame specification. Lowest performance ofthroughtput is delivered by an MTU size of 1500. Changing MTU from 1500 to 9000led to an increase of bandwidth by 39%. This is a significant increase in averagebandwidth.

82

5

10

15

20

25 50 75

Measurement # − MTU 1500

B/W

Gbp

s

17.5

20.0

22.5

25 50 75

Measurement # − MTU 9000

B/W

Gbp

s

0.2

0.4

0.6

25 50 75

Measurement # − MTU 1500

Syst

em L

oad

0.2

0.3

0.4

0.5

0.6

0.7

25 50 75

Measurement # − MTU 9000

Syst

em L

oad

Figure 5.2: TCP/IP Bandwidth and System Load for MTU 1500 and 9000

Figure 5.2 shows box plots of the bandwidth and system load as measured from theclient generating data for the TCP/IP measurements. The median bandwidth forMTU 1500 and 9000 is 14.37Gb/s and 19.88Gb/s, respectively. This is a bandwidthincrease of 38.3% by going from MTU size of 1500 to 9000. As we can recall fromsub section 3.3.5, reported system load average in Linux include CPU and disk I/Outilization. Most part of the disk I/O during the experiments in this thesis occurswhen the data is written to a file every 30 seconds.

83

Comparing median of the metrics iowait and idle for measurement with the MTUsize of 9000, the values of 2408 and 1714732, respectively, was extracted from themeasurements. Also from sub section 3.3.5 an explanation for the metric iowait isgiven and based on that explanation as well the collected data, we can see that thedisk I/O percentage is very low compared to the CPU idle percentage. In this regard,the system load averages mainly consist of CPU thread utilization.

The following listing from R statistics software gives us some details about quartiles:

1 > quantile(dfclient9000$CPULoad1m)2 0% 25% 50% 75% 100%3 0.13 0.41 0.48 0.56 0.69

As we can see, the lower and upper quartiles are 0.41 and 0.56, respectively. Thisindicates that the 50% of the collected data about system load is to be found withinthose range. Lower quartile is also known as first quartile and 25th percentile, and isdenoted Q1. Upper quartile is also known as third quartile and 75th percentile, and isdenoted Q3.

The box plots show us that in order to achieve higher median bandwidth with anMTU of 9000, it also increases the system load to a median of 0.48 from 0.34 for theMTU size 1500. A system load of 0.34 and 0.48 means the imposed load to the systemis 4.25% and 6% respectively, taking into consideration that each servers has eightCPU threads. That’s an increase of 1.75% and given the bandwidth gain is 38.3%, suchan optimization should be considered strongly for TCP/IP-based application runningon bare metal using a high throughput NIC like Mellanox ConnectX-3 VPI.

84

MTU

1500M

TU 9000

MTU

9500M

TU 9900

0 25 50 75 100

5

10

15

20

25

5

10

15

20

25

5

10

15

20

25

5

10

15

20

25

Measurement # − Bare−Metal

B/W

Gb/

s

MTU

MTU 1500

MTU 9000

MTU 9500

MTU 9900

Figure 5.3: TCP/IP Bandwidth for different MTU sizes

85

0

10

20

30

40

RoCE and TCP/IP

B/W

Gb/

s

MTU

MTU 1200 RoCE

MTU 2200 RoCE

MTU 4200 RoCE

MTU 1500 TCP/IP

MTU 9000 TCP/IP

MTU 9500 TCP/IP

MTU 9900 TCP/IP

Figure 5.4: RoCE and TCP/IP Mean Bandwidth for all MTUs

Figure 5.4 shows all the averages of delivered bandwidth for different MTU sizes usedin this part of the experiment for RoCE and TCP/IP.

The highest average bandwidth for TCP/IP is 20.01Gb/s using an MTU of 9000, whileRoCE is able to perform up to 39.18Gb/s on average using an MTU size of 4200.The difference of 95.73% higher average bandwidth can be attributed to the RoCEand RDMA technology for it’s clever way of transferring data between applicationsbypassing the OS kernel completely.

For TCP/IP measurements, the standard deviation for the MTU sizes of 1500, 9000,9500 and 9900 are 9.39Gb/s, 3.70Gb/s, 2.38Gb/s and 2.46Gb/s respectively. Thisindicates that for the MTU size of 1500 bytes, the measured bandwidth has ratherbeen more spread than the rest of the measurements. This is something any cloud

86

provider would like to avoid in terms of performance predictability. In order tocomply with a given SLA, this experiment part tells us that an MTU size of 1500 willgive a tenant least stable performance in terms of network throughput.

For RoCE measurements, the standard deviation were very low and insignificant.For instance for the MTU size of 4200, the standard deviation is 0.000003616126Gb/s.This tells us that the distribution of collected data is very close to it’s mean value of39.18Gb/s. Such stable network throughput performance makes it ideal for cloudproviders to meet and conform with tenant’s SLAs in terms of predictable networkthroughput. As such, in all likelihood, tenants will get expected network throughput.

The graphs also show that the optimal MTU for RoCE-based transfers is 4200 bytes.MTU change from 1200 to 4200 had an impact of increase in bandwidth by 6.9%with RoCE. This is a less significant increase than the increase we saw with TCP/IP,although it gives us valuable knowledge about how vital MTU size settings are forachieving optimal network throughput.

87

0.00

0.25

0.50

0.75

1.00

RoCE and TCP/IP

Syst

em L

oad

MTU

MTU 1200 RoCE

MTU 2200 RoCE

MTU 4200 RoCE

MTU 1500 TCP/IP

MTU 9000 TCP/IP

MTU 9500 TCP/IP

MTU 9900 TCP/IP

Figure 5.5: RoCE and TCP/IP Mean System Load for all MTUs

Although the RoCE measurements delivered the highest throughput numbers, thesealso generate a higher CPU load as we can see from the plot in the figure 5.5. In thetype of measurements involving RDMA transfers as explained in the section 2.5,the system load doesn’t come from the CPU cores servicing IRQs, but from the usedCPU time by the bandwidth measurement tool ib_send_bw for the data generation.It requires more CPU time to generate more data for the RoCE measurement thatachieves close to line rate network throughput. Hence the higher system load asobserved from the RoCE results is imposed by the "application" itself and the networkstack doesn’t contribute to the system load.

For the TCP/IP bandwidth measurements, CPU time is not only used for generationof data by the bandwidth measurement tool iperf, but also for servicing IRQs, forinstance for copying data between user and kernel space. That is, in the case with

88

TCP/IP, the system load doesn’t mainly consist of CPU time used by the "application",but the network stack also contributes to the system load as well. The CPU load forTCP/IP measurements were considerably lower as we can see from the plot. Thehighest CPU load average of 0.47 is seen with an MTU size of 9000. The CPU loadhere includes both the CPU core’s IRQ servicing as well as the data generation by themeasurements tool iperf

0

30000

60000

90000

IRQ

s/s

MTU size

Client 1500

Client 9000

Client 9500

Client 9900

Server 1500

Server 9000

Server 9500

Server 9900

Figure 5.6: TCP/IP IRQ Generation Server and Client

Figure 5.6 shows the generation of IRQ per second on the server and the clientside, respectively, for the given MTU sizes. Except for the MTU size of 1500, we cansee that the server was generating considerably more IRQ per second. Based onthis observation, we understand that servicing IRQs on the server is an importantfactor. Although the client has considerably less IRQ generation per second, it’salso important to have the optimal settings to service the IRQs without CPU core

89

contention. This will be discussed further below.

From the graph in figure 5.3 of TCP/IP in this part of the experiment, we couldobserve spikes and dips that we needed to investigate to get an understanding ofwhy the plot looked like it did. The IRQ affinity tool shipped with MLNX_OFEDassigns each IRQ of the NIC port to each CPU core in the system. This is also one ofthe recommended steps according to Mellanox documentation for tuning the NIC foroptional performance. Further researching into the issue, we saw some sign of IRQsbeing serviced by the same core that the iPerf process was running on when it wasstarted with a specific CPU affinity (e.g using taskset -c 0). This observation led us intofurther investigation. To further collect data to establish and confirm what we wereobserving, the following was done:

• 1 - The IRQ affinity was set in a way that it would not use and interfere with thecore iPerf would be running on and we isolated the cores involved in the IRQaffinity setup to CPU cores 1 to 7 distributed among both NUMA nodes. iPerfwas started with the CPU affinity set to core 0 belonging to NUMA node 0.

In order to set the IRQ affinity as described above, we used the vendor suppliedset_irq_affinity.sh script and then did a manual change as following:

1 [root@compute7 ~]# echo 7 > /proc/irq/61/smp_affinity_list

The following listing shows that IRQ number 61 is set to be serviced by CPU core00000080, which is the eight core (core 7) in the system:

1 [root@compute7 ~]# show_irq_affinity.sh ens22 61: 00000000,000000803 62: 00000000,000000024 63: 00000000,000000045 64: 00000000,000000086 65: 00000000,000000107 66: 00000000,000000208 67: 00000000,000000409 68: 00000000,00000080

10 [root@compute7 ~]#

The IRQ and the CPU affinity was consistently set on both the server (compute7) andthe client (compute8) side following the steps provided above. Also, for both type ofmeasurements above, 100 samples were collected, similar to rest of the experimentdata collection. Figure 5.7 shows the plot for all the mentioned MTU sizes. From

90

the plot we can see that the bandwidth is more consistent and stable, without thebig spikes mentioned above with default IRQ settings. In this setting context, wecan observe that the delivered bandwidth among all the MTU sizes is very close toeachother. Especially the MTU size of 1500 performs equally well as the rest here.

MTU

1500M

TU 9000

MTU

9500M

TU 9900

0 25 50 75 100

5

10

15

20

25

5

10

15

20

25

5

10

15

20

25

5

10

15

20

25

Measurement # IRQ Affinity Core 1−7. CPU Affinity Core 0

TCP/

IP B

/W G

b/s MTU

MTU 1500

MTU 9000

MTU 9500

MTU 9900

Figure 5.7: TCP/IP IRQ Affinity Core 1-7 and CPU Affinity Core 0

Based on the data gathered during the experiments, we are unable to understandwhy we observed similar bandwidth measurements across all the MTU sizes. Thisbehavior and observation should be further investigated in the future work.

For other potential server systems with more than eight cores, using the abovementioned way of assigning IRQ and CPU affinity, we could make sure each of theeight IRQs were services by different cores and possibly further minimize the smallerdips observed. As mentioned, in the custom IRQ and CPU affinity settings above, the

91

eighth CPU core (core 7) had to service two different IRQs which isn’t optimal shouldthere be contention.

5.2 Different methods with virtualization

In this part, obtained results from experiments of VM to bare metal are analyzed.

Considering only the bandwidth by different virtualization technique and MTU sizes,we can see that SR-IOV using RoCE transfers delivers close to the line rate, albeitslightly slower than bare metal performance. Some very interesting obervations inthis part of the experiment are which MTU sizes that delivered highest bandwidthnumbers from paravirtuliation using VirtIO as well the spikes and dips we observedfrom the phase 1. Also, TCP/IP bandwidth numbers using SR-IOV and VF exposed tothe VM are considerably higher .

92

5.2.1 Paravirtualized NIC

0

5

10

TCP/IP VirtIO VM Default IRQ affinity/CPU Affinity core3

B/W

Gb/

s

MTU

MTU 1500

MTU 9000

MTU 9500

MTU 9900

Figure 5.8: VirtIO TCP/IP B/W Different MTU

For the plot in figure 5.8, the average bandwidth for MTU size of 1500 is 12.90Gb/s.Among the MTU sizes, 9000 bytes has the lowest bandwidth mean of 5.01Gb/s. Theseresults are quite in contrast to the bare metal results where MTU size of 9000 and 9500bytes showed the highest mean bandwidth, while 1500 bytes showed the lowest meanbandwidth. We expected to see higher average bandwidth for the MTU size of 9000than for 1500.

93

0

30000

60000

90000

IRQ

s/s

MTU size

Hypervisor 1500

Hypervisor 9000

Hypervisor 9500

Hypervisor 9900

Server 1500

Server 9000

Server 9500

Server 9900

Figure 5.9: Paravirtualization TCP/IP IRQs generation Server and Hypervisor

94

Standard deviation for the MTU sizes of 1500, 9000, 9500 and 9900 are 0.64Gb/s,1.48Gb/s, 0.90Gb/s and 2.60Gb/s respectively. This indicates that for the MTU sizeof 1500 bytes, the measured bandwidth samples are all close to the mean value of12.90Gb/s, while for the MTU sizes of 9000 and 9900, they are more spread. The MTUsize of 9000 has the next lowest standard deviation meaning the distribution of thebandwith values in the ballpark of the mean of 5.01Gb/s.

Using paravirtualized VirtIO driver, the IRQ generation was lower on the serveras well as on the hypervisor as we can see from figure 5.9. Compared to bare metalIRQ generation as shown in figure 5.6 from experiment phase 1, we can see that IRQgeneration is considerable lower on the server in this part of the experiment. Fromphase 1, we saw that the MTU size of 9000 with IRQ generation on the server sideabove 120000/s showed highest mean bandwidth, while in this part, the highest meanbandwidth is achieved with an MTU size of 1500 with IRQ generation below 30000/sas we can see from 5.9. This leads us to believe that it’s not the server’s ability toservice the IRQs that’s the bottleneck in order to achieve higher bandwidth.

95

●●

●●

●●

10

11

12

13

25 50 75

Measurement # − MTU 1500

B/W

Gbp

s

●●

4.5

4.6

4.7

4.8

4.9

5.0

25 50 75

Measurement # − MTU 9000

B/W

Gbp

s●

0.2

0.3

0.4

0.5

25 50 75

Measurement # − MTU 1500

Sys

tem

Loa

d●

0.0

0.2

0.4

25 50 75

Measurement # − MTU 9000

Sys

tem

Loa

d

Figure 5.10: B/W & System Load MTU 1500 vs 9000

Figure 5.10 shows box plots of the bandwidth and system load as measured fromthe client generating data for the TCP/IP measurements. The median bandwidth forMTU 1500 and 9000 is 13.10Gb/s and 4.76Gb/s, respectively. Surprisingly, this is abandwidth decrease of 63.7% by going from MTU size of 1500 to 9000.

96

The following listing from R statistics software gives us some details about quartiles:

1 > quantile(dfclient1500$BW/1000**3)2 0% 25% 50% 75% 100%3 10.08051 12.96023 13.09608 13.21649 13.53049

As we can see, the lower and upper quartiles are 12.96 and 13.22, respectively. Thisindicates that the 50% of the collected data about bandwidth is to be found withinthose range. Although the box plot shows outliers below that are all 1.5 times lowerthan lower quartile of 12.96 for MTU size 1500.

In an attempt to understand why higher MTU sizes showed lower averagebandwidth, we did some investigation. First of all, using VirtIO drivers we wantedto make sure VirtIO did support jumbo frame sizes. From libvirt project pages [93][94]we could see that there are portions of examples of configuring the guest eXtensibleMarkup Language (XML) file for the guest networking. This gave us confidencethat VirtIO did support jumbo frames and the problem was not related to thevirtual networking device fragmenting the jumbo frames into smaller frames beforetransmission, which would cause degradation of network throughout.

Another factor we looked into was the different segmentation offloading capabilitiesof the NIC used in this study. We learned [95] that for the purpose of sending andreceiving the packets, there are some HW offloading capabilities built into the NIC:TCP Segmentation Offload (TSO) and Large Receive Offload (LRO). Additionally,the Linux kernel has similar offloading capabilities called Generic SegmentationOffload (GSO). Also according to [95], GSO should increase the throughput. We wereable to verify that all the above mentioned offloading capabilities were enabled onthe VM (client), the hypervisor and the server. The following listing shows that theparavirtualized network device inside the VM support and have enable the discussedas well as some other offloading capabilities:

1 [root@vm ~]# ethtool -k eth1| grep -i segment2 tcp-segmentation-offload: on3 tx-tcp-segmentation: on4 tx-tcp-ecn-segmentation: on5 tx-tcp6-segmentation: on6 tx-tcp-mangleid-segmentation: off7 generic-segmentation-offload: on8 tx-fcoe-segmentation: off [fixed]9 tx-gre-segmentation: off [fixed]

10 tx-ipip-segmentation: off [fixed]

97

11 tx-sit-segmentation: off [fixed]12 tx-udp_tnl-segmentation: off [fixed]13 tx-mpls-segmentation: off [fixed]14 tx-gre-csum-segmentation: off [fixed]15 tx-udp_tnl-csum-segmentation: off [fixed]16 tx-sctp-segmentation: off [fixed]

We carried some simple experiments with a different combinations of turning off thedifferent segmentation offloading capabilities to see if any of the offloading capabilit-ies could adversly affect the network throughout when used with paravirtualization.But we were unable to observe any significant improvement to the network through-put for MTU size of 9000.

Another factor we looked into was whether the CPU context switches could be anadversely affecting factor for the results. As shown in figure 2.6, with the introductionof KVM hypervisor and virtualization in this part of the experiments, there is now anadditional CPU execution mode called guest mode. Context switches are expensiveoperations and leads to performance degradation. For instance, the difference in theamount of context switches between MTU sizes 1500 and 9900 could be interesting tolook at on the host (hypervisor) and the VM. Figure 5.11 shows the amount of contextswitches for MTU sizes 1500 and 9900 that have occurred on client and the hypervisor.The plots show that there is no significant difference between the MTU sizes keepingin mind that 1500 gave more of a stable bandwidth performance.

Although we used some time in an attempt to understand the behavior with thelower average bandwidth for jumbo frames, unfortunately we were not able get tothe bottom of the issue. This issue should be investigated deeper in the future to getcloser to the root cause of the issue.

Based on measurements we saw that median for MTU size of 1500 was 13.10Gb/sand considerably lower for all the other MTU sizes. Even though MTU size of 1500showed highest median, we also observed outliers that are below the lower quartilefor this MTU size. Given the observations and analysis so far in this part of theexperiment, we can understand that a single VM using paravirtualization as the wayto virtualize the NIC will not provide a predictable performance for any cloud orvirtualized environment. Cloud providers will hardly be able to meet tenant’s SLAwhen it comes to network throughput using this type of I/O virtualization.

98

MT

U 1500 V

MM

TU

9000 VM

MT

U 1500 H

ypervisorM

TU

9000 Hypervisor

0 25 50 75 100

3000

6000

9000

12000

3000

6000

9000

12000

3000

6000

9000

12000

3000

6000

9000

12000

CP

U C

onte

xt S

witc

hes

per

seco

nd

MTU

MTU 1500 VM

MTU 9000 VM

MTU 1500 Hypervisor

MTU 9000 Hypervisor

Figure 5.11: CPU Context Switches Client (VM) and Hypervisor MTU 1500 vs 9000

5.2.2 SR-IOV and VF passthrough

In this sub part of the experiement, the NIC was exposed to the VM using a VF andSR-IOV. The VF was enabled and setup as described in section 4.3.2. The assignmentof the VF was done using virt-manager as shown in the figure 4.3. A total of eightVFs was configured, but only one of the VFs was passed through to the VM usingPCI passthrough support of the KVM hypervisor. Since one VF is exposed to the

99

VM using PCI passthrough, we installed MLNX_OFED drivers inside the VM. Thefollowing is an analysis of the both TCP/IP and RoCE results obtained from themeasurements in this part of the experiment.

MTU

1500M

TU 9000

MTU

9500M

TU 9900

0 25 50 75 100

5

10

15

20

25

5

10

15

20

25

5

10

15

20

25

5

10

15

20

25

Measurement #

B/W

Gb/

s

MTU

MTU 1500

MTU 9000

MTU 9500

MTU 9900

Figure 5.12: SR-IOV TCP/IP Bandwidth for all MTU sizes

Figure 5.12 shows the achieved bandwidth for TCP/IP measurements with SR-IOVand VF passthrough, using different MTU sizes. At first glance, we can observe thatfor MTU size of 1500, the graph does not dip as far down as it did with the bare metalplot from the figure 5.3, except for some of the measurement number below and above25.

100

●●●●●●●●●●5

10

15

20

25

25 50 75

Measurement # − MTU 1500

B/W

Gbp

s

●●●

23.0

23.5

24.0

24.5

25 50 75

Measurement # − MTU 9000

B/W

Gbp

s

●●

●●

●●

●●

0.00

0.25

0.50

0.75

1.00

1.25

25 50 75

Measurement # − MTU 1500

Syst

em L

oad

●●

0.25

0.50

0.75

1.00

1.25

25 50 75

Measurement # − MTU 9000

Syst

em L

oad

Figure 5.13: SR-IOV TCP/IP Bandwidth and System Load for MTU 1500 and 9000

Figure 5.13 shows box plots of the bandwidth and system load as measured from theclient generating data for the TCP/IP measurements. The median bandwidth forMTU 1500 and 9000 is 25.06Gb/s and 23.54Gb/s, respectively. A notable decreaseof 6% in bandwidth is observed from the plot by going up from MTU size of 1500 to9000.

101

The following listing from R statistics software gives us some details about quartiles:

1 > quantile(dfclient9000$BW/1000**3)2 0% 25% 50% 75% 100%3 15.85009 17.60366 23.53587 23.62892 24.084204 > quantile(dfclient1500$BW/1000**3)5 0% 25% 50% 75% 100%6 5.210922 17.517460 25.066597 25.424397 25.687900

As we can see, the lower and upper quartiles are 17.60 and 23.63, respectively. Fromthe plot we can see that the quartile group 2 and 3 both are well above 23.5 mark.From this we can understand that 50% of the measurements fall within this rangeof bandwidth. Compared to MTU size of 1500, with a median of 25.07, the quartilegroup 2 stretches all the way from the median to lower quartile of 17.52 which tellsus that 25% of the measurements are within this range. Also the quartile group 3 onlystretches from the median to upper quartile of 25.43.

The median for system load is slightly decreased for the MTU size of 9000 witha value of 0.83. Both MTU sizes have outliers, with MTU size of 1500 havingconsiderably more outliers on the lower scale. This indicates the system load waslower for the lower bandwidth numbers.

Compared to bare metal TCP/IP measurements in phase 1 as shown in figure 5.5, wesee that bandwidth has increased with SR-IOV, but at the cost of increased systemload. Comparing the MTU size of 9000 from phase 1 and this part of the experiment,the increase in median is 3.66Gb/s. This is a significant increase of 18.4% compared tophase 1 and bare metal measurement results. Similarly, the CPU increase is 0.35 whichis 72.9% higher than the bare metal part of the experiment.

Now compared to the measurements from paravirtualization technique in subsec-tion 5.2.1 and the figure 5.10, we can see the higher bandwidth for both MTU sizes,although paravirtualization showed lower system load. Comparing the MTU sizeof 1500 from subsection 5.2.1 and this part of the experiment, the increase in medianis 11.96Gb/s. This is a significant increase of 91.3% compared to paravirtualization.However, we observe that the CPU increase is 0.56 which is 175% higher than theparavirtualization part of the experiment in section 5.2.

102

0

10

20

30

40

RoCE vs TCP/IP VM VF Passthrough

B/W

Gb/

s

MTU

MTU 1200 RoCE

MTU 2200 RoCE

MTU 4200 RoCE

MTU 1500 TCP/IP

MTU 9000 TCP/IP

MTU 9500 TCP/IP

MTU 9900 TCP/IP

Figure 5.14: SR-IOV: RoCE vs TCP/IP Average Bandwidth for MTUs

Figure 5.14 shows all the average of delivered bandwidth for different MTU sizesused in this part of the experiment both for RoCE and TCP/IP.

The highest average bandwidth for TCP/IP is 21.68Gb/s using an MTU of 9000, whileRoCE is able to perform up to 39.06Gb/s on average using an MTU size of 4200.The difference is 80.2% higher bandwidth using RoCE. Compared to RoCE resultsfrom bare metal, the difference is a 0.31% decrease in the bandwidth. The bandwidthdecrease is very small and negligible for the RoCE results.

The standard deviation for RoCE measurements were low and insignificant. Forinstance for the MTU size of 4200, the standard deviation is 0.02964947Gb/s. This tellsus that the distribution of collected data is very close to it’s mean value of 39.06Gb/s.The graphs also show that the optimal MTU for RoCE-based transfers is 4200 bytes.

103

MTU change from 1200 to 4200 had an impact of increase in bandwidth by 6.9% withRoCE.

0.00

0.25

0.50

0.75

1.00

RoCE and TCP/IP

CPU

Loa

d

MTU

MTU 1200 RoCE

MTU 2200 RoCE

MTU 4200 RoCE

MTU 1500 TCP/IP

MTU 9000 TCP/IP

MTU 9500 TCP/IP

MTU 9900 TCP/IP

Figure 5.15: SR-IOV: RoCE vs TCP/IP Average System Load for all MTUs

Similar to bare metal RoCE results, figure 5.15 shows that it imposes higher systemload than TCP/IP-based results. The average system load for MTU size of 4200 is0.9854. Compared to bare metal average of 0.983 for the same MTU size, the differenceis as low as 0.24%. In terms of CPU usage by RoCE, we see the same pattern betweenbare metal and SR-IOV.

104

0

20000

40000

60000

IRQ

s/s

MTU

PV1500

PV9000

PV9500

PV9900

RoCE1500

9000RoCE

9500RoCE

1500TCP/IP

9000TCP/IP

9500TCP/IP

9900TCP/IP

0

20000

40000

60000

IRQ

s/s

Figure 5.16: Paravirtualization and SR-IOV: IRQ Generation on Hypervisor and VM

Figure 5.16 shows the average IRQ generation per second on the hypervisor (left)and the VM (right). We can see that paravirtualization causes high amount of IRQgeneration per second on the hypervisor, while the it’s very low on the VM. A highlevel of IRQ generation on the hypervisor will adversely affect the CPU performanceas the CPU cores have to engage in servicing the IRQs. This is evident from figure 5.17where we can see that for paravirtualization and for all MTU sizes, the comparableright side plots show a low percentage of CPU share being used to service the guest.

105

An important note comparing plots in figure 5.17 is that the guest servicing time inpercentage on the right side is based on the overall CPU non-idle time on the left side.The calculation is based on description in section 3.3.5.

By changing the MTU size from 1500 to 9000, the average IRQ generation decreasesfrom 37999 to 21013.21, which is a 44.7% decrease, but as we could see fromfigure 5.13, the bandwidth in this case also decreased with 6%. Referring tofigure 5.12, this is the most stable configuration with SR-IOV and TCP/IP.

106

0

5

10

15

20

CP

U lo

ad %

MTU

PV1500

PV9000

PV9500

PV9900

1200 RoCE

2200 RoCE

4200 RoCE

1500 TCP/IP

9000 TCP/IP

9500 TCP/IP

9900 TCP/IP

0

25

50

75

100

Fra

ctio

n of

gue

st C

PU

tim

e in

%

MTU

1500GuestP

9000GuestP

9500GuestP

9900GuestP

1200 RoCE

2200 RoCE

4200 RoCE

1500 TCP/IP

9000 TCP/IP

9500 TCP/IP

9900 TCP/IP

Figure 5.17: Left: % CPU load hypervisor. Right:% Fraction of CPU time used forservicing guest

SR-IOV and TCP/IP caused even higher IRQ generation per second, but on theVM, with no notable IRQs on the hypervisor side. The latter reduces hypervisorintervention. With SR-IOV, the virtual interrupts are forwarded to the VM by thehypervisor. Also in this scenario, it has the highest CPU load on the hypervisor andwell over 50% of the CPU non-idle time went for servicing the guest. The latter detailwas already observed in the analysis of figure 5.13.

107

The plots also show that RoCE causes no notable IRQ generation, neither onhypervisor or on the VM. This aligns with our knowledge about RoCE and RDMA,and is per our expectation. Most of the CPU usage during RoCE experiment in thisphase was used to service the guest as figure 5.17 shows. In this scenario whereCPU cores did not have to service IRQs, the CPU time was acquired by the RDMAbandwidth measurement tool ib_send _bw.

SR-IOV and RoCE imposes considerable higher memory usage compared toparavirtualization, which is an additional factor for overhead. Figure 4.4, C.1 andfigure C.2 show the memory usage on VM, bare metal and hypervisor, respectively.The latter figures shows a significant increase in memory usage on the hypervisorwith SR-IOV compared to paravirtualization. There is a possibility that this highmemory usage is related to our discussion in section 6.4.2 in chapter 6 about the VMhanging during the boot with an SR-IOV VF passed through.

Since this study focused on IRQ and other CPU related metrics since these were muchmore important and relevant, we did not put further effort into investigate the highmemory usage as mentioned above. This issue should be investigated in details infuture works.

The analysis of the RoCE measurement results from the VM gives us the understand-ing that the network throughput performance of SR-IOV and RoCE is very close tobare metal. SR-IOV and RoCE deliver a very stable network throughput performancefrom the VM. Although RoCE experiements show higher CPU usage, we have to keepin mind that RoCE achieves close to line rate network throughput. Such performancelevel makes it ideal for cloud providers to meet and conform with tenant’s SLAs interms of predictable network throughput in the VMs.

108

Part III

Conclusion

109

Chapter 6

Discussion and Future Work

6.1 Evolution of the project as a whole

High Throughput Virtualization: The focus in the study was mainly on utilizing SR-IOVtechnique for I/O virtualization for achieving high through network performancein a virtualized environment. This is to improve the networking performance andultimately the efficiency of IaaS that’s one of the layers of the cloud environments.As it was mentioned already in chapter 1, cloud environments leverage the benefitsof virtualization to deliver services to the users. To achieve higher efficiency is entirelydependent on improving the performance of system, and overall performance of thesystem is the result of performance of activities in subsystems such as networking. Acloud consists of different technologies and techniques in a setting of different layers.Hierarchically higher layers are served by lower layers so that their functionalityand performance depend on the lower layer. The lowest layer is IaaS which is theinfrastructure of the cloud environment. IaaS mainly consists of networking andvirtualization. There are still studies that going on to improve the performanceof IaaS based on these two elements. SR-IOV as the technology has been aroundfor some time as of the writing and is designed to improve the I/O virtualizationand scalability. By the literature survey of this study, it was found that there arefew studies around SR-IOV and its utilization. However, this study used a highthroughout NIC with the support for RDMA and RoCE, to study and analyzethe behavior and challenges of using such NIC in a virtualized environment andcomparing it to bare metal performance.

110

The biggest challenge throughout the accomplishment of the study was mostlytechnical related to OS kernel level tuning and learning R programming languageas well as some other some technical challenges. A part of initial plan was todemonstrate utilization of SR-IOV using multiple deployment of VM on top ofOpenStack cloud to study the behavior of this NIC, but was unfortunately left outdue to time constraints. Some configurations and changes were time consuming.For instance, due to issue 6.4.2 with booting the VM after exposing the VF, all OSesof the systems involved in this study had to be upgraded as well as new version ofNIC driver. Consequently, the experiments were conducted again in order to get aconsistent measurement throughout all the phases.

6.2 Bare metal to bare metal

As a part of approaching the research questions, phase 1 of this study conductedexperiments between two nodes without introducing any virtualization layer toobserve different aspects of the system behavior. Among others, experiment factorssuch as network throughput, system load, CPU share time, IRQs and memory usagewere analyzed.

For the RDMA-based bare metal to bare metal measurements, the bandwidthperformance observed was as expected. The IRQ handling was completely taken careof by the dedicated hardware in the NIC and we could observe only an insignificantamount of IRQ being generated by the system during this part of the experimentcompared to the TCP/IP bandwidth measurements.

The MTU size of 9000 gave the highest average bandwidth of 20.01Gb/s whichwas an improvement of 39% from an MTU size of 1500. The imposed system loaddifference was 1.75% while the bandwidth increase was 39%. THe MTU size of 1500also had the highest standard deviation of 9.39Gb/s while it was 3.709.39Gb/s for theMTU size of 9000.

RoCE bandwidth numbers was significantly higher than the TCP/IP measurements,which was as expected due to it’s offloading and kernel bypass capabilities. With theoptimal MTU size of 4200, the bandwidth was 39.18Gb/s on average while showingvery low standard deviation. RoCE results showed a stable and predictable networkthroughput throughout all of the tested MTU sizes. Also, here the MTU size had an

111

impact in terms of achieved bandwidth. For instance, a change from MTU size of 1500to 4200 showed a bandwidth increase of 6.9%.

RoCE’s impressive network throughout did come at cost of higher system loadcompared to TCP/IP. Due to RoCE’s and RDMA’s offloading and kernel bypassabilities, the system load can be attributed to the RDMA measurement toolib_send_bw.

Data collected from this study’s experiments gave us some very interesting valueregarding IRQ and CPU affinity. For the initial system setttings and tuning for theNIC we followed the recommendation from the vendor. Using these settings, weobserved very varying bandwidth in the TCP/IP measurements. It was similarpatterns to be seen from both phase 1 and phase 2. Based on further research andobservations keeping the varying results from our plotting, we discovered a wayto mitigate the issue by setting IRQ affinity for instance to NUMA node 1 while themeasurements process’ CPU affinity, iperf, was set to NUMA node 1. Using thissetting one CPU core had to service two IRQ lines, but we kept the CPU cores ofNUMA node 0 dedicated to running iperf only without being interrupted to serviceany IRQs by the system. As we can see from the analysis chapter, this gave a moreconsistent bandwith performance without the high spikes and low dips. This showthe significance of closely tuning IRQ and CPU affinity in a NUMA-system.

6.3 VM to bare metal

The second part, phase 2, was divided further into two sub parts. Part 2.1 didexperiments with paravirtualization as the way to virtualize the NIC and expose itto the VM. The results and analysis from this part of the experiment was interestingsince only an MTU size of 1500 showed the highest network througput, while9000, 9500 and 9900 resulted in far worse throughput. In an attempt to get moreunderstanding of the observation, we carried out plots and analysis of CPU contextswitching. The latter showed us stable graphs of the context switches for the differentMTU sizes and based on this we were not able to find a clear answer to why higherMTU sizes than 1500 resulted in worse network throughput.

For the MTU size of 1500, the average bandwidth achieved with paravirtualizationis 12.90Gb/s while an MTU size of 9000 gave the lowest average bandwidth of

112

5.01Gb/s. The former also had the lowest standard deviation of 0.64Gb/s among allthe MTU sizes, which tells us that it was MTU size of 1500 that had the most stableperformance with paravirtualization technique. In this part of the experiements,we observred that the IRQ generation was lower on both server and client (VM),compared to bare metal results from phase 1. System load load was relatively low andfor MTU size of 1500, the median was around 0.325 as measured from the client VM.

Second part of phase 2 did experiments with SR-IOV Virtual Function as the way tovirtualize the NIC and expose it to the VM. One of the VFs was assigned to the VMusing PCI passthrough. In this part, the VM had direct access to a PCI device henceMLNX_OFED drivers were installed in the VM. The driver installation enabled us toboth do TCP/IP and RoCE based measurements from the VM to a bare metal node,compute7, that was also enabled with RoCE as well as TCP/IP. TCP/IP measurementsshowed us an average bandwidth of 21.68Gb/s for MTU size of 9000 which is 18.4%higher than bare metal. Also, compared to paravirtualization technique, the differencein average bandwidth for MTU size 1500 is 91.3% for the TCP/IP measurements.

Compared to bare metal results from phase 1, SR-IOV and TCP/IP achieved anincreased bandwith of 18.4% with an MTU size of 9000, while also increasing thesystem load to 72.9%. Compared to paravirtualization in phase 2.1, we observed ahigher system of 175% load as well as bandwidth increase of 91.3%.

What we observed from analysis of SR-IOV and RoCE is a similar pattern to thatof bare metal. Performance of SR-IOV and RoCE was close to bare metal with anachieved average bandwidth of 39.06Gb/s with the MTU size of 4200. Compared toTCP/IP results with an MTU size of 9000, the difference is 80.2% higher bandwidthwith RoCE. Also here we observed a low standard deviation of 0.02964947Gb/s forbandwidth results using an MTU size of 4200. In terms of imposed system load, thenumbers are similar to bare metal results. RoCE caused no notable IRQ generation onthe hypervisor as per our expectation, although the system load was significant. Aswe’ve seen from figure 5.17, the imposed CPU load plot related to ROCE showed usthat most of the CPU time went to servicing the guest.

113

6.4 Changes in intial plan

It should be mentioned that we experienced an unexpected value of zero in irq columnin the collected data from /proc/stat for the experiments phases. The discovery of thezero values in this columns happened during the analysis part of this study. Thisbehavior seemed consistent across compute7, compute8 and the VM. Also, since wehad access to several servers running Redhat Enterprise Linux (RHEL) 7, a check wasdone on around 50 servers. They all showed the same behavior, except for a few thatshowed non-zero numbers in the irq column, although those were relatively low withregard to uptime.

In order to investigate the issue, we decided to look at the kernel code in orderto try to understand why IRQs are not accounted. The investigation of this issueshowed there are some kernel config parameters that controls the IRQ accountingand we had to look at the code in cputime.c from the Linux kernel scheduler [96].Among other details, we could see that IRQ servicing time share of the CPUcores are only accounted for when the Linux kernel configuration boolean itemCONFIG_IRQ_TIME_ACCOUNTING is set (to "Y"). Additionally, the Linux kernelconfiguration boolean item

CONFIG_HAVE_IRQ_TIME_ACCOUNTING must also be set. For instance, thefollowing listing shows that CONFIG_IRQ_TIME_ACCOUNTING is not set oncompute8 where the hypervisor was installed:

1 [root@compute8 ~]# grep IRQ_TIME /boot/config-3.10.0-693.21.1.el7.x86_642 CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y

The following listing is from a system where CONFIG_IRQ_TIME_ACCOUNTING isset:

1 [susinths@thinkpadx230 ~]$ grep IRQ_TIME /boot/config-4.13.11-100.fc25.x86_642 CONFIG_IRQ_TIME_ACCOUNTING=y3 CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y4 [susinths@thinkpadx230 ~]$

On CentOS systems, the Linux kernel is installed as a binary package distributedthrough Redhat Package Manager (RPM). In order to change any of the Linux kernelconfiguration items, it is required to obtain the source binary package (SRPM), do themodifications and rebuild in order to get a modified binary RPM package which can

114

then be installed. Due to time constrains, we did not attempt to do this.

6.4.1 Libvirt Bug

At a point in time when phase 2 experiments were being carried out, we got into atissue with booting the VM. Looking through the system journal log we could see thefollowing might be directly related to the issue:

1 May 09 10:23:42 compute8.olympuscloud kernel: kvm [17275]: vcpu2 unhandled rdmsr: 0x60d2 May 09 10:23:42 compute8.olympuscloud kernel: kvm [17275]: vcpu2 unhandled rdmsr: 0x3f83 May 09 10:23:42 compute8.olympuscloud kernel: kvm [17275]: vcpu2 unhandled rdmsr: 0x3f94 May 09 10:23:42 compute8.olympuscloud kernel: kvm [17275]: vcpu2 unhandled rdmsr: 0x3fa5 May 09 10:23:42 compute8.olympuscloud kernel: kvm [17275]: vcpu2 unhandled rdmsr: 0x6306 May 09 10:23:42 compute8.olympuscloud kernel: qemu-kvm[17281]: segfault at 30 ip 00007

fe780f31345 sp 00007fe775b60880 error 4 in libspice-server.so.1.8.0[7fe780ed6000+11d000]

7 May 09 10:23:43 compute8.olympuscloud libvirtd[4109]: 2018-05-09 08:23:43.188+0000:4109: error : qemuMonitorIO:697 : internal error: End of file from qemu monitor

8 May 09 10:23:44 compute8.olympuscloud kvm[17466]: 0 guests now active9 May 09 10:23:44 compute8.olympuscloud systemd-machined[17276]: Machine qemu-2-vm8

terminated.10

Based on the system journal log lines above, we did some research and found out thatthe error message "error : qemuMonitorIO:697 : internal error: End of file from qemumonitor" is indeed a bug [83] in the upstream libvirt project. The bug affects libvirtversions prior to 3.4.0. The version used in the experiment is 3.2.0. The followingcommit [84] to the upstream project fixes the issue and the fix is included in theupcoming CentOS 7.5 which was not available at the time of writing. The error lineswith " vcpu2 unhandled rdmsr" are harmless according to [97] [98]. According toWikipedia [99], model-specific register (MSR) is:

any of various control registers in the x86 instruction set used for debugging,program execution tracing, computer performance monitoring, and togglingcertain CPU features.

The fact that KVM does not provide access to all MSRs [97] and in the case that theapplication or OS running is attempting to read or write to MSRs that KVM does nothandle a warning is logged.

115

6.4.2 Issue with booting VM with VF PCI passthrough

In phase 2.2 of the experiment, we experienced issue with booting the VM after theVF has been passed through to the VM. The VM would hang with nothing notable tosee from the system journal log. In an attempt to improve the situation, we chose toupgrade the OS on the hypervisor host which is compute8. This forced us to upgradethe MLNX_OFED since they are bound to specific version of OS distributions and thekernel versions. Subsequently, the OS and MLNX_OFED were also upgraded on theVM as well. After the upgrades, the VM would still hang during the boot, but at leastwe were able to boot some quite some time with hang.

When the VM was up and running with the VF assigned to it, the memory usageon the hypervisor was unstably high as can be seen from figure C.2 on the appendixsection. As the figure shows, the memory usage without SR-IOV VF assignment to theVM, the memory usage on the hypervisor lies more or less on the 2.5GB range. WithSR-IOV VF assignment into the VM, the memory usage spikes to above 17.5GB. Thisis could be a sign of memory leakage from libvirt or device driver.

Also, OS and MLNX_OFED were upgraded for consistency and the fact that differentversions of Perftest are incompatible. For instance, we were not able to run ib_send_bwusing different MLNX_OFED versions on client and server since Mellanox versionof the Perftest package comes bundled with MLNX_OFED. The issue was notinvestigated in details any further due to time constraints. And except the fact thatit caused inconvenience and consumed more time, we were able to conduct theexperiment with the VM after all.

6.5 Future Work

This study was an initial work to investigate the use of 40Gb/s high throughput SR-IOV enabled NIC and utilize it in a virtual environment. Due to time constraints, itwas not possible with further investigation and conduct of some experiment. Forinstance, further study with conducting measurements from VM to another VMwould give us more valuable knowledge about the performance characteristics insuch scenario that would likely be to found in the cloud.

Also, conducting experiments using multiple VMs on each hypervisor, for instanceusing OpenStack as the VM orchestration tool, would enable us to get knowledge

116

about systems under higher load, but also give us knowledge about the scalabilityof SR-IOV. Hence a part of the idea for this study and some other issues need to beexamined and evaluated in future works.

1. This study revealed an interesting issues with paravirtualization (VirtIO) andjumbo frames. We observed that the jumbo frame MTU sizes (9000 and 9500) andthe MTU size of 9900 achieved significantly lower bandwidth, while MTU sizeof 1500 showed the highest bandwidth. The analysis of experiment results fromparavirtualization mentioned some of the possible factors we looked at withoutgetting to the bottom of the issue. This issue should be studied in details in futureworks.

2. During the bare metal to bare metal experiments, we observed some significantspikes and dips to the bandwidth results that we were unable to understand based onthe different collected data as well as some simple investigation. Another interestingbehavior we observed in this study is that by isolating the application CPU corefrom IRQ servicing CPU cores, all the MTU sizes were close to each other in terms ofachieved bandwidth. For instance, we’ve seen that MTU size of 1500 achieved slightlyhigher bandwith than the jumbo frames. As we were unable to understand either ofthe above mentioned, these observations could be basis for future study.

3. Scalability of SR-IOV: how well does SR-IOV scale considering performancefrom each VF when multiple VMs are involved. This would be typical for cloudinfrastructure with multiple VMs and tenants. Such a study can deploy OpenStack asthe VM orchestration tool on two hosts and conduct different experiments. ProvidedAPI from OpenStack can be used to provision multiple VMs for the purpose of thestudy. This would also enable VM to VM measurements, as well as VMs to VMs inparallel to study the system when the underlying components are under higher loadand utilization. Also, since this study revealed some unusual high memory usage onthe hypervisor when an SR-IOV VF was assigned to the VM, studying hypervisormemory usage with SR-IOV could give us more understanding of this behavior aswell as with multiple VFs assigned to multiple VMs.

4. Achieving higher throughput using TCP/IP tools and applications. With theemergence and announcement of 100, 200 and even 400Gbps RoCE rates, we need tobe able to achieve higher throughput that what this study was able to achieve usingtool like iperf. Will these come with hardware offloading capabilities to assist TCP/IPtools and applications?

117

5. Power and energy consumption when using VFs and SR-IOV compared to justusing PFs. Since green computing is a hot topic to reduce the carbon footprint,such an addition to this thesis could give us some insight of power and energyconsumption differences. Servers consolidation is one of the reasons behindvirtualization and it would be interesting to study the added power and energyconsumption the SR-IOV technology brings compared to using physical devices. Suchhave been studied [16] for 10Gb/s NIC, but since higher speed NICs are available atthe time of the writing, such NICs needs to included in further studies.

118

Chapter 7

Conclusion

In this chapter, a brief summary of the key findings from the research will bedescribed, such as, expectation at start compared to findings, importance of thisresearch and recommendations for future research.

From the section 1.1 in chapter 1, we came up with the following research questions(RQs):

• RQ1: What are the challenges of using SR-IOV enabled network adapter in avirtualized environment?

• RQ2: What are the challenges and issues of deploying VMs for high perform-ance and high throughput networking?

• RQ3: How can we achieve close to 40Gb/s without creating significant load tothe CPU cores?

• RQ4: What are the considerations to be made when deploying VMs for highthroughput networking in a cloud environment?

To understand features and capabilities of SR-IOV, this technique was comparedto another form of I/O virtualization, namely paravirtualization, as well as baremetal. In addition, in both scenario, both RDMA and TCP/IP measurements wereconducted. Also, different MTU sizes were evaluated for RDMA and TCP/IPmeasurements. All of two available methods to provide virtual network interface tothe VM were implemented in a virtual environment using KVM hypervisor. Duringsimilar experiments their functionality were tested and evaluated from different

119

aspects. Scripts were written to run and collect data from the experiments. All of theinitial questions have been answered but while running the project some questionsemerged which are suggested for the future works which are suggested for the futureworks.

Following list is the answers to corresponding research questions of section 1.1:

• RQ1: SR-IOV’s VFs are a great technique to assign VMs PCI devices directlyin a hypervisor like KVM used in this study. However, in order to leveragemaximum capabilites of SR-IOV-enabled high througput NIC such as MellnoxConnextX-3 VPI used in this study, it’s required to prepare the NIC by installingvendor supplied driver, follow the vendor’s setup and tuning guide carefully,and enable the Virtual Functions (VFs). Multiple reboots of the hypervisor hostare required during such a setup stage. In other words, downtime is required forthe hypervisor. In a virtualized production cloud environment, an installationof SR-IOV-enabled NIC must be carefully planned taking the above mentionedfactors into considerations.

Another important key factor is the maintenance of such an SR-IOV-enabledNIC, that all needs more thoughts around in terms of patching, upgrade orother type of OS related maintenance. As we’ve experienced is this study,an upgrade to OS and the kernel, forced us to re-install an updated versionof NIC driver corresponding to the updated OS kernel. Ideally, a virtualizedcloud environment would have a staging/testing environment as close to theproduction environment as possible to test implement any changes in order toavoid any surprise and extended downtime.

• RQ2: There are multiple key factors to consider when configuring anddeploying VMs for high performance and high throughout networking. Thehypervisor host should be put into the appropriate tuning profile depending onthe intended workload. CPU and IRQ affinity are two major factors to carefullytake into consideration. Even though we used vendor supplied script for IRQaffinity, we’ve observed that there is nothing that can avoid one and the sameCPU core being used for any given application (bandwidth measurement toolin this study) and for servicing hardware IRQs. Even with multiple CPU coreavailable in the system, there is a possibility that such can occur.

Another important factor is NUMA architecture considerations as mentionedn sub section 4.1.2. VMs should ideally be assigned to the the NUMA node

120

the NIC belongs too. This is challenging when you have multiple VMs andthe fact that one NUMA node only has limited resources available such asCPU cores and memory. Depending on the virtual environment and workloadrequirements, there will likely be a need to utilize more than resources from thesingle NUMA node the NIC belongs to in terms of CPU cores and memory.

SR-IOV’s VFs have a significant higher TCP/IP network throughput comparedto paravirtualization, but using a VF requires driver installation as wellmaintenance inside the VMs. As experienced in this study, vendor’s drivers canbe bound to specific OS distribution and kernel version making upgrade of boththe hypervisor host and VMs challenging since it requires upgrade of the NICdriver on hypervisor host as well as within the VM. This was the case in thisstudy. For paravirtualization, the VM and OS has the needed drivers in placeand requires no additional steps inside the VM.

• RQ3: This study shows that the we can achieve close to the line rate of 40Gb/sby leveraging RoCE and RDMA. The offloading and kernel bypass capabilitiesof RDMA as explained in section 2.5 contributes to achieving such high speedthroughput performance. The experiments in this thesis also showed that RoCEachieved a predictable performance on bare metal as well as on the VM usingSR-IOV compared to paravirtualization in sub section 5.2.1 and SR-IOV TCP/IPin sub section 5.2.2, both from the chapter 5.

Although this study showed higher CPU usage for RoCE, the CPU time wasmainly used by the bandwidth measurement tool to generate data for themeasurements. To be able to achieve such high network throughout close to40Gb/s with a NIC such as Mellanox ConnectX-3 VPI, it is required that thenetwork applications to be built with RDMA library to leverage all the benefitsof RDMA.

• RQ4: Although we did not come thus far to be able to deploy OpenStackinstances to study SR-IOV in such a cloud environment and multiple VMs,this study has gathered some valuable knowledge from our experiences witha virtualized environment in phase 2. The following is a list of considerationsto make before deploying VMs for high throughput networking in a cloudenvironment:

– Since the vendor drivers for SR-IOV enabled NICs are bound to a Linuxdistribution version and it’s kernel, it will limit the number of VM images

121

a cloud provider can offer to the tenants. For the purpose of VMs for highthroughput networking, a cloud provider will not be able to provide VMimages with Linux distributions that aren’t supported by the NIC vendor.

– The maintenance of the VM images will be time consuming since a kernelupgrade for any given VM image, the SR-IOV enabled NIC driver mustalso be upgraded. As we have seen in this study, upgrading to CentOS 7.4from 7.3 required a new NIC driver download and installations.

– Live migration is not supported using SR-IOV enabled NIC, which is amajor concern for any cloud provider. Live migration can be achievedusing different techniques and significant effort as Ghaemi’s study [16]shows.

– To achieve high network throughout close to 40Gb/s with a NIC suchas Mellanox ConnectX-3 VPI, applications compatibility and support forRDMA is a major concern.

Eventually, a cloud provider’s intention of deploying SR-IOV enabled NICs isto be able to leverage the benefits it provides. The befits of SR-IOV in an IaaScloud, as observed from this study are:

– Scalabitliy with Virtual Function and avoid using multiple PCI slots on thehypervisor host

– Higher TCP/IP network throughput compared to paravirtualization.

– The ability to use RDMA supported application to achieve networkthroughput close to 40Gb/s.

122

Appendices

123

Appendix A

System setup and configuration

Some of system configuration is listed in this chapter. The upgrade of MLNX_OFEDdrivers:

1 [root@compute8 MLNX_OFED_LINUX-4.3-1.0.1.0-rhel7.4-x86_64]# ./mlnxofedinstall2 Logs dir: /tmp/MLNX_OFED_LINUX.17033.logs3 General log file: /tmp/MLNX_OFED_LINUX.17033.logs/general.log4 Verifying KMP rpms compatibility with target kernel...5 This program will install the MLNX_OFED_LINUX package on your machine.6 Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be

removed.7 Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.8

9 Do you want to continue?[y/N]:y10

11

12 rpm --nosignature -e --allmatches --nodeps mft13

14 Starting MLNX_OFED_LINUX-4.3-1.0.1.0 installation ...15

16

17 Installing mlnx-ofa_kernel RPM18 Preparing... ########################################19 Updating / installing...20 mlnx-ofa_kernel-4.3-OFED.4.3.1.0.1.1.g########################################21 Installing kmod-mlnx-ofa_kernel 4.3 RPM22 Preparing... ########################################23 kmod-mlnx-ofa_kernel-4.3-OFED.4.3.1.0.########################################24 Installing mlnx-ofa_kernel-devel RPM

124

25 Preparing... ########################################26 Updating / installing...27 mlnx-ofa_kernel-devel-4.3-OFED.4.3.1.0########################################28 Installing kmod-kernel-mft-mlnx 4.9.0 RPM29 Preparing... ########################################30 kmod-kernel-mft-mlnx-4.9.0-1.rhel7u4 ########################################31 Installing knem RPM32 Preparing... ########################################33 Updating / installing...34 knem-1.1.3.90mlnx1-OFED.4.3.0.1.4.1.g8########################################35 Installing kmod-knem 1.1.3.90mlnx1 RPM36 Preparing... ########################################37 kmod-knem-1.1.3.90mlnx1-OFED.4.3.0.1.4########################################38 Installing kmod-iser 4.0 RPM39 Preparing... ########################################40 kmod-iser-4.0-OFED.4.3.1.0.1.1.g8509e4########################################41 Installing kmod-srp 4.0 RPM42 Preparing... ########################################43 kmod-srp-4.0-OFED.4.3.1.0.1.1.g8509e41########################################44 Installing kmod-isert 4.0 RPM45 Preparing... ########################################46 kmod-isert-4.0-OFED.4.3.1.0.1.1.g8509e########################################47 Installing mpi-selector RPM48 Preparing... ########################################49 Updating / installing...50 mpi-selector-1.0.3-1.43101 ########################################51 Cleaning up / removing...52 mpi-selector-1.0.3-1.41102 ########################################53 Installing user level RPMs:54 Preparing... ########################################55 ofed-scripts-4.3-OFED.4.3.1.0.1 ########################################56 Preparing... ########################################57 libibverbs-41mlnx1-OFED.4.3.0.1.8.4310########################################58 Preparing... ########################################59 libibverbs-devel-41mlnx1-OFED.4.3.0.1.########################################60 Preparing... ########################################61 libibverbs-devel-static-41mlnx1-OFED.4########################################62 Preparing... ########################################63 libibverbs-utils-41mlnx1-OFED.4.3.0.1.########################################64 Preparing... ########################################65 libmlx4-41mlnx1-OFED.4.1.0.1.0.43101 ########################################66 Preparing... ########################################67 libmlx4-devel-41mlnx1-OFED.4.1.0.1.0.4########################################68 Preparing... ########################################

125

69 libmlx5-41mlnx1-OFED.4.3.0.2.1.43101 ########################################70 Preparing... ########################################71 libmlx5-devel-41mlnx1-OFED.4.3.0.2.1.4########################################72 Preparing... ########################################73 librxe-41mlnx1-OFED.4.1.0.1.7.43101 ########################################74 Preparing... ########################################75 librxe-devel-static-41mlnx1-OFED.4.1.0########################################76 Preparing... ########################################77 libibcm-41mlnx1-OFED.4.1.0.1.0.43101 ########################################78 Preparing... ########################################79 libibcm-devel-41mlnx1-OFED.4.1.0.1.0.4########################################80 Preparing... ########################################81 libibumad-43.1.1.MLNX20171122.0eb0969-########################################82 Preparing... ########################################83 libibumad-devel-43.1.1.MLNX20171122.0e########################################84 Preparing... ########################################85 libibumad-static-43.1.1.MLNX20171122.0########################################86 Preparing... ########################################87 libibmad-1.3.13.MLNX20170511.267a441-0########################################88 Preparing... ########################################89 libibmad-devel-1.3.13.MLNX20170511.267########################################90 Preparing... ########################################91 libibmad-static-1.3.13.MLNX20170511.26########################################92 Preparing... ########################################93 ibsim-0.6mlnx1-0.8.g9d76581.43101 ########################################94 Preparing... ########################################95 ibacm-41mlnx1-OFED.4.1.0.1.0.43101 ########################################96 Preparing... ########################################97 librdmacm-41mlnx1-OFED.4.2.0.1.3.43101########################################98 Preparing... ########################################99 librdmacm-utils-41mlnx1-OFED.4.2.0.1.3########################################

100 Preparing... ########################################101 librdmacm-devel-41mlnx1-OFED.4.2.0.1.3########################################102 Preparing... ########################################103 opensm-libs-5.0.0.MLNX20180219.c610c42########################################104 Preparing... ########################################105 opensm-5.0.0.MLNX20180219.c610c42-0.1.########################################106 Preparing... ########################################107 opensm-devel-5.0.0.MLNX20180219.c610c4########################################108 Preparing... ########################################109 opensm-static-5.0.0.MLNX20180219.c610c########################################110 Preparing... ########################################111 dapl-2.1.10mlnx-OFED.3.4.2.1.0.43101 ########################################112 Preparing... ########################################

126

113 dapl-devel-2.1.10mlnx-OFED.3.4.2.1.0.4########################################114 Preparing... ########################################115 dapl-devel-static-2.1.10mlnx-OFED.3.4.########################################116 Preparing... ########################################117 dapl-utils-2.1.10mlnx-OFED.3.4.2.1.0.4########################################118 Preparing... ########################################119 perftest-4.2-0.4.g848b0a2.43101 ########################################120 Preparing... ########################################121 mstflint-4.9.0-1.2.gb839ec8.43101 ########################################122 Preparing... ########################################123 mft-4.9.0-38 ########################################124 Preparing... ########################################125 srptools-41mlnx1-4.43101 ########################################126 Preparing... ########################################127 ibutils2-2.1.1-0.94.MLNX20180214.g4b02########################################128 Preparing... ########################################129 ibutils-1.5.7.1-0.12.gdcaeae2.43101 ########################################130 Preparing... ########################################131 cc_mgr-1.0-0.35.g0ac39b8.43101 ########################################132 Preparing... ########################################133 dump_pr-1.0-0.31.g0ac39b8.43101 ########################################134 Preparing... ########################################135 ar_mgr-1.0-0.36.g0ac39b8.43101 ########################################136 Preparing... ########################################137 ibdump-5.0.0-1.43101 ########################################138 Preparing... ########################################139 infiniband-diags-5.0.0.MLNX20180124.df########################################140 Preparing... ########################################141 infiniband-diags-compat-5.0.0.MLNX2018########################################142 Preparing... ########################################143 qperf-0.4.9-9.43101 ########################################144 Preparing... ########################################145 mxm-3.7.3111-1.43101 ########################################146 Preparing... ########################################147 ucx-1.3.0-1.43101 ########################################148 Preparing... ########################################149 ucx-devel-1.3.0-1.43101 ########################################150 Preparing... ########################################151 ucx-static-1.3.0-1.43101 ########################################152 Preparing... ########################################153 sharp-1.5.2.MLNX20180220.e7ab6fd-1.431########################################154 Preparing... ########################################155 hcoll-4.0.2127-1.43101 ########################################156 Preparing... ########################################

127

157 openmpi-3.1.0rc2-1.43101 ########################################158 Preparing... ########################################159 libibprof-1.1.44-1.43101 ########################################160 Preparing... ########################################161 mlnx-ethtool-4.2-1.43101 ########################################162 Preparing... ########################################163 mlnxofed-docs-4.3-1.0.1.0 ########################################164 Preparing... ########################################165 mpitests_openmpi-3.2.19-84f02b3.43101 ########################################166 Device (07:00.0):167 07:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]168 Link Width: x8169 PCI Link Speed: 8GT/s170

171

172 Installation finished successfully.173

174

175 Preparing... ################################# [100%]176 Updating / installing...177 1:mlnx-fw-updater-4.3-1.0.1.0 ################################# [100%]178

179 Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.conf180

181 Attempting to perform Firmware update...182 Querying Mellanox devices firmware ...183

184 Device #1:185 ----------186

187 Device Type: ConnectX3188 Part Number: MCX354A-FCB_A2-A5189 Description: ConnectX-3 VPI adapter card; dual-port QSFP; FDR IB (56Gb/s) and 40

GigE; PCIe3.0 x8 8GT/s; RoHS R6190 PSID: MT_1090120019191 PCI Device Name: 07:00.0192 Port1 MAC: f452147cdb41193 Port2 MAC: f452147cdb42194 Versions: Current Available195 FW 2.42.5000 2.42.5000196 PXE 3.4.0752 3.4.0752197

198 Status: Up to date199

128

200

201 Log File: /tmp/MLNX_OFED_LINUX.17033.logs/fw_update.log202

203 WARNING: Original /etc/infiniband/openib.conf saved as /etc/infiniband/openib.conf.rpmsave

204 To load the new driver, run:205 /etc/init.d/openibd restart206 [root@compute8 MLNX_OFED_LINUX-4.3-1.0.1.0-rhel7.4-x86_64]#

After running TCP/IP based measurement, the following NIC statistics show therewere no packet loss on the server receiving and discarding the packets:

1 NIC statistics:2 rx_packets: 29298547353 tx_packets: 2404718514 rx_bytes: 263518811134865 tx_bytes: 144283135386 rx_errors: 07 tx_errors: 08 rx_dropped: 09 tx_dropped: 0

10 multicast: 011 collisions: 012 rx_length_errors: 013 rx_over_errors: 014 rx_crc_errors: 015 rx_frame_errors: 016 rx_fifo_errors: 017 rx_missed_errors: 018 tx_aborted_errors: 019 tx_carrier_errors: 020 tx_fifo_errors: 021 tx_heartbeat_errors: 022 tx_window_errors: 023 rx_lro_aggregated: 292985271524 rx_lro_flushed: 106375890425 rx_lro_no_desc: 026 tso_packets: 027 xmit_more: 428 queue_stopped: 029 wake_queue: 030 tx_timeout: 031 rx_alloc_failed: 032 rx_csum_good: 2929854330

129

33 rx_csum_none: 40534 rx_csum_complete: 035 tx_chksum_offload: 24047144536 rx_pause: 037 rx_pause_duration: 1638 rx_pause_transition: 339 tx_pause: 640 tx_pause_duration: 041 tx_pause_transition: 0

As well for the the client generating and sending the data (the sender):

1 NIC statistics:2 rx_packets: 2404718493 tx_packets: 29298547354 rx_bytes: 153902008145 tx_bytes: 263636005324266 rx_errors: 07 tx_errors: 08 rx_dropped: 09 tx_dropped: 0

10 multicast: 011 collisions: 012 rx_length_errors: 013 rx_over_errors: 014 rx_crc_errors: 015 rx_frame_errors: 016 rx_fifo_errors: 017 rx_missed_errors: 018 tx_aborted_errors: 019 tx_carrier_errors: 020 tx_fifo_errors: 021 tx_heartbeat_errors: 022 tx_window_errors: 023 rx_lro_aggregated: 024 rx_lro_flushed: 025 rx_lro_no_desc: 026 tso_packets: 42211148027 xmit_more: 15128 queue_stopped: 029 wake_queue: 030 tx_timeout: 031 rx_alloc_failed: 032 rx_csum_good: 24047144533 rx_csum_none: 404

130

34 rx_csum_complete: 035 tx_chksum_offload: 42498795036 pf_rx_packets: 24047184937 pf_rx_bytes: 1539020081438 pf_tx_packets: 292985473539 pf_tx_bytes: 2636360053242640 rx_pause: 641 rx_pause_duration: 042 rx_pause_transition: 043 tx_pause: 044 tx_pause_duration: 1645 tx_pause_transition: 3

Mellanox mlnx_tune script to check if the system is well tuned for high networkthroughput:

1 [root@compute7 ~]# mlnx_tune -p HIGH_THROUGHPUT2 2018-05-07 12:25:40,783 INFO Collecting node information3 2018-05-07 12:25:40,784 INFO Collecting OS information4 2018-05-07 12:25:40,796 INFO Collecting cpupower information5 2018-05-07 12:25:40,799 INFO Collecting watchdog information6 2018-05-07 12:25:40,802 INFO Collecting abrt-ccpp information7 2018-05-07 12:25:40,805 INFO Collecting abrtd information8 2018-05-07 12:25:40,808 INFO Collecting abrt-oops information9 2018-05-07 12:25:40,811 INFO Collecting alsa-state information

10 2018-05-07 12:25:40,813 INFO Collecting anacorn information11 2018-05-07 12:25:40,816 INFO Collecting atd information12 2018-05-07 12:25:40,819 INFO Collecting avahi-daemon information13 2018-05-07 12:25:40,822 INFO Collecting bluetooth information14 2018-05-07 12:25:40,825 INFO Collecting certmonger information15 2018-05-07 12:25:40,828 INFO Collecting cups information16 2018-05-07 12:25:40,831 INFO Collecting halddaemon information17 2018-05-07 12:25:40,834 INFO Collecting hidd information18 2018-05-07 12:25:40,837 INFO Collecting iprdump information19 2018-05-07 12:25:40,840 INFO Collecting iprinit information20 2018-05-07 12:25:40,843 INFO Collecting iprupdate information21 2018-05-07 12:25:40,845 INFO Collecting mdmonitor information22 2018-05-07 12:25:40,848 INFO Collecting polkit information23 2018-05-07 12:25:40,851 INFO Collecting rsyslog information24 2018-05-07 12:25:41,013 INFO Collecting CPU information25 2018-05-07 12:25:41,054 INFO Collecting memory information26 2018-05-07 12:25:41,054 INFO Collecting hugepages information27 2018-05-07 12:25:41,077 INFO Collecting IRQ Balancer information28 2018-05-07 12:25:41,080 INFO Collecting Firewall information

131

29 2018-05-07 12:25:41,083 INFO Collecting IP table information30 2018-05-07 12:25:41,086 INFO Collecting IPv6 table information31 2018-05-07 12:25:41,089 INFO Collecting IP forwarding information32 2018-05-07 12:25:41,093 INFO Collecting hyper threading information33 2018-05-07 12:25:41,093 INFO Collecting IOMMU information34 2018-05-07 12:25:41,096 INFO Collecting driver information35 ^[[O2018-05-07 12:25:44,989 INFO Collecting Mellanox devices information36 2018-05-07 12:25:46,396 INFO Applying High Throughput profile.37 2018-05-07 12:25:46,439 INFO Some devices' properties might have changed - re-query

system information.38 2018-05-07 12:25:46,439 INFO Collecting node information39 2018-05-07 12:25:46,439 INFO Collecting OS information40 2018-05-07 12:25:46,439 INFO Collecting cpupower information41 2018-05-07 12:25:46,442 INFO Collecting watchdog information42 2018-05-07 12:25:46,445 INFO Collecting abrt-ccpp information43 2018-05-07 12:25:46,448 INFO Collecting abrtd information44 2018-05-07 12:25:46,451 INFO Collecting abrt-oops information45 2018-05-07 12:25:46,454 INFO Collecting alsa-state information46 2018-05-07 12:25:46,457 INFO Collecting anacorn information47 2018-05-07 12:25:46,460 INFO Collecting atd information48 2018-05-07 12:25:46,463 INFO Collecting avahi-daemon information49 2018-05-07 12:25:46,465 INFO Collecting bluetooth information50 2018-05-07 12:25:46,468 INFO Collecting certmonger information51 2018-05-07 12:25:46,471 INFO Collecting cups information52 2018-05-07 12:25:46,474 INFO Collecting halddaemon information53 2018-05-07 12:25:46,477 INFO Collecting hidd information54 2018-05-07 12:25:46,480 INFO Collecting iprdump information55 2018-05-07 12:25:46,483 INFO Collecting iprinit information56 2018-05-07 12:25:46,486 INFO Collecting iprupdate information57 2018-05-07 12:25:46,489 INFO Collecting mdmonitor information58 2018-05-07 12:25:46,492 INFO Collecting polkit information59 2018-05-07 12:25:46,495 INFO Collecting rsyslog information60 2018-05-07 12:25:46,657 INFO Collecting CPU information61 2018-05-07 12:25:46,698 INFO Collecting memory information62 2018-05-07 12:25:46,698 INFO Collecting hugepages information63 2018-05-07 12:25:46,721 INFO Collecting IRQ Balancer information64 2018-05-07 12:25:46,724 INFO Collecting Firewall information65 2018-05-07 12:25:46,727 INFO Collecting IP table information66 2018-05-07 12:25:46,729 INFO Collecting IPv6 table information67 2018-05-07 12:25:46,732 INFO Collecting IP forwarding information68 2018-05-07 12:25:46,737 INFO Collecting hyper threading information69 2018-05-07 12:25:46,737 INFO Collecting IOMMU information70 2018-05-07 12:25:46,739 INFO Collecting driver information71 2018-05-07 12:25:47,661 INFO Collecting Mellanox devices information

132

72

73 Mellanox Technologies - System Report74

75 Operation System Status76 CENTOS77 3.10.0-514.26.2.el7.x86_6478

79 CPU Status80 Intel Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz Sandy Bridge81 Warning: Frequency 2400.0MHz82

83 Memory Status84 Total: 31.29 GB85 Free: 30.34 GB86

87 Hugepages Status88 On NUMA 1:89 Transparent enabled: always90 Transparent defrag: always91

92 Hyper Threading Status93 INACTIVE94

95 IRQ Balancer Status96 NOT PRESENT97

98 Firewall Status99 NOT PRESENT

100 IP table Status101 NOT PRESENT102 IPv6 table Status103 NOT PRESENT104

105 Driver Status106 OK: MLNX_OFED_LINUX-4.1-1.0.2.0 (OFED-4.1-1.0.2)107

108 ConnectX-3 Device Status on PCI 07:00.0109 FW version 2.40.7000110 OK: PCI Width x8111 OK: PCI Speed 8GT/s112 PCI Max Payload Size 256113 PCI Max Read Request 4096114 Local CPUs list [0, 1, 2, 3]115

133

116 ens2 (Port 1) Status117 Link Type eth118 OK: Link status Up119 Speed 40GbE120 MTU 1500121 OK: TX nocache copy 'off'122

123 ens2d1 (Port 2) Status124 Link Type eth125 OK: Link status Up126 Speed 40GbE127 MTU 1500128 OK: TX nocache copy 'off'129

130 2018-05-07 12:25:49,040 INFO System info file: /tmp/mlnx_tune_180507_122540.log131 [root@compute7 ~]# '

134

Appendix B

Scripts and Automation Tools

Scripts and Ansible playbooks used as part of this study are pushed to GitHub andcan be found at [100] and [92]

135

136

Appendix C

Graphs

0

250

500

750

1000

Bare Metal Memory Usage

MB

MTU

RoCE 1200

RoCE 2200

RoCE 4200

TCP/IP 1500

TCP/IP 9000

TCP/IP 9500

TCP/IP 9900

Figure C.1: Bare metal memory usage RoCE and TCP/IP

137

0

5000

10000

15000

Hypervisor Memory Usage: Paravirtualization (PV) vs SR−IOV

MB

MTU

PV 1500

PV 9000

PV 9500

PV 9900

SR−IOV RoCE 1200

SR−IOV RoCE 2200

SR−IOV RoCE 4200

SR−IOV TCP/IP 1500

SR−IOV TCP/IP 9000

SR−IOV TCP/IP 9500

SR−IOV TCP/IP 9900

Figure C.2: Hypervisor memory usage Paravirtualization and SR-IOV (RoCE andTCP/IP)

138

Literature

[1] Michael Armbrust et al. ‘A view of cloud computing’. In: Communications of theACM 53.4 (2010), pp. 50–58.

[2] J. Erbes, H. R. Motahari Nezhad and S. Graupner. ‘The Future of Enterprise ITin the Cloud’. In: Computer 45.5 (May 2012), pp. 66–72. ISSN: 0018-9162. DOI:10.1109/MC.2012.73.

[3] Sean Marston et al. ‘Cloud computing—The business perspective’. In: Decisionsupport systems 51.1 (2011), pp. 176–189.

[4] Borko Furht. ‘Cloud computing fundamentals’. In: Handbook of cloud computing.Springer, 2010, pp. 3–19.

[5] IDG. Cloud Computing Survey. URL: http://www. idgenterprise .com/resource/research/2016-idg-enterprise-cloud-computing-survey (visited on 30/03/2017).

[6] RightScale. rightscalecloudsurvey. URL: http://www.rightscale.com/blog/cloud-industry- insights/cloud- computing- trends- 2017- state- cloud- survey (visited on30/03/2017).

[7] tbri. 70 percentage of private cloud adopters utilize third parties to manage theirenvironments. URL: http://tbri.com/analyst-perspectives/press-releases/pgView.cfm?release=10438 (visited on 29/04/2017).

[8] Gregor Von Laszewski et al. ‘Comparison of multiple cloud frameworks’. In:Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on. IEEE.2012, pp. 734–741.

[9] OpenStack. Companies » OpenStack Open Source Cloud Computing Software. URL:https://www.openstack.org/foundation/companies/ (visited on 28/12/2016).

139

[10] Evangelos Tasoulas, Hårek Haugerund and Kyrre Begnum. ‘Bayllocator:a proactive system to predict server utilization and dynamically allocatememory resources using bayesian networks and ballooning’. In: Proceedingsof the 26th international conference on Large Installation System Administration:strategies, tools, and techniques. USENIX Association. 2012, pp. 111–122.

[11] Yaozu Dong et al. ‘High performance network virtualization with SR-IOV’.In: Journal of Parallel and Distributed Computing 72.11 (2012). CommunicationArchitectures for Scalable Systems, pp. 1471–1480. ISSN: 0743-7315. DOI: http://dx.doi.org/10.1016/j.jpdc.2012.01.020. URL: //www.sciencedirect.com/science/article/pii/S0743731512000329.

[12] Mendel Rosenblum and Tal Garfinkel. ‘Virtual machine monitors: Currenttechnology and future trends’. In: Computer 38.5 (2005), pp. 39–47.

[13] Aravind Menon et al. ‘Diagnosing performance overheads in the xen virtualmachine environment’. In: Proceedings of the 1st ACM/USENIX internationalconference on Virtual execution environments. ACM. 2005, pp. 13–23.

[14] Jose Renato Santos et al. ‘Bridging the Gap between Software and HardwareTechniques for I/O Virtualization.’ In: USENIX Annual Technical Conference.2008, pp. 29–42.

[15] Jeremy Sugerman, Ganesh Venkitachalam and Beng-Hong Lim. ‘VirtualizingI/O Devices on VMware Workstation’s Hosted Virtual Machine Monitor.’ In:USENIX Annual Technical Conference, General Track. 2001, pp. 1–14.

[16] Mohsen Ghaemi. ‘Performance analysis and dynamic reconfiguration of a SR-IOV enabled OpenStack cloud’. MA thesis. 2014.

[17] Andre Richter et al. ‘Resolving Performance Interference in SR-IOV Setupswith PCIe Quality-of-Service Extensions’. In: Digital System Design (DSD), 2016Euromicro Conference on. IEEE. 2016, pp. 454–462.

[18] Jeffrey Voas and Jia Zhang. ‘Cloud computing: New wine or just a new bottle?’In: IT professional 11.2 (2009), pp. 15–17.

[19] Abhishek Gupta and Dejan Milojicic. ‘Evaluation of hpc applications on cloud’.In: Open Cirrus Summit (OCS), 2011 Sixth. IEEE. 2011, pp. 22–26.

[20] Peter Mell, Tim Grance et al. ‘The NIST definition of cloud computing’. In:(2011).

140

[21] Wenying Zeng et al. ‘Research on cloud storage architecture and keytechnologies’. In: Proceedings of the 2nd International Conference on InteractionSciences: Information Technology, Culture and Human. ACM. 2009, pp. 1044–1048.

[22] Ian Foster et al. ‘Cloud computing and grid computing 360-degree compared’.In: Grid Computing Environments Workshop, 2008. GCE’08. Ieee. 2008, pp. 1–10.

[23] Tharam Dillon, Chen Wu and Elizabeth Chang. ‘Cloud computing: issues andchallenges’. In: Advanced Information Networking and Applications (AINA), 201024th IEEE International Conference on. Ieee. 2010, pp. 27–33.

[24] Qi Zhang, Lu Cheng and Raouf Boutaba. ‘Cloud computing: state-of-the-artand research challenges’. In: Journal of internet services and applications 1.1 (2010),pp. 7–18.

[25] Abhinivesh Jain and Niraj Mahajan. ‘Introduction to Database as a Service’. In:The Cloud DBA-Oracle. Springer, 2017, pp. 11–22.

[26] Yucong Duan et al. ‘Everything as a service (XaaS) on the cloud: origins,current and future trends’. In: Cloud Computing (CLOUD), 2015 IEEE 8thInternational Conference on. IEEE. 2015, pp. 621–628.

[27] Michael Armbrust et al. Above the clouds: A berkeley view of cloud computing.Tech. rep. Technical Report UCB/EECS-2009-28, EECS Department, Universityof California, Berkeley, 2009.

[28] Rajkumar Buyya, Rajiv Ranjan and Rodrigo Calheiros. ‘Intercloud: Utility-oriented federation of cloud computing environments for scaling ofapplication services’. In: Algorithms and architectures for parallel processing (2010),pp. 13–31.

[29] Uri Lublin et al. kvm: the Linux Virtual Machine Monitor.

[30] Mendel Rosenblum. ‘The Reincarnation of Virtual Machines’. In: Queue 2.5(July 2004), pp. 34–40. ISSN: 1542-7730. DOI: 10 .1145/1016998 .1017000. URL:http://doi.acm.org/10.1145/1016998.1017000.

[31] Radhwan Y. Ameen and Asmaa Y. Hamo. ‘Survey of Server Virtualization’. In:CoRR abs/1304.3557 (2013). URL: http://arxiv.org/abs/1304.3557.

[32] Alan Murphy. ‘Virtualization defined-eight different ways’. In: White paper.2010.

141

[33] Jyotiprakash Sahoo, Subasish Mohapatra and Radha Lath. ‘Virtualization: Asurvey on concepts, taxonomy and associated security issues’. In: Computer andNetwork Technology (ICCNT), 2010 Second International Conference on. IEEE. 2010,pp. 222–226.

[34] Andi Mann and EMA Senior Analyst. ‘Virtualization 101: Technologies,Benefits, and Challenges’. In: Enterprise Management Associates, Inc (2006).

[35] Amit Singh. ‘An introduction to virtualization’. In: kernelthread. com, January(2004).

[36] Rogier Dittner and David Rule Jr. The Best Damn Server Virtualization BookPeriod: Including Vmware, Xen, and Microsoft Virtual Server. Syngress, 2011.

[37] Simon Grinberg and Shlomo Weiss. ‘Architectural virtualization extensions: Asystems perspective’. In: Computer Science Review 6.5 (2012), pp. 209–224.

[38] Qian Lin et al. ‘Optimizing virtual machines using hybrid virtualization’. In:Journal of Systems and Software 85.11 (2012), pp. 2593–2603.

[39] Jean S Bozman and Gary P Chen. ‘Optimizing hardware for x86 servervirtualization’. In: IDC White Paper (2009).

[40] Ryan Shea and Jiangchuan Liu. ‘Network interface virtualization: challengesand solutions’. In: IEEE Network 26.5 (2012).

[41] Joshua LeVasseur et al. ‘Standardized but flexible I/O for self-virtualizingdevices’. In: Proceedings of the First conference on I/O virtualization. USENIXAssociation. 2008, pp. 9–9.

[42] Abel Gordon et al. ‘ELI: Bare-metal Performance for I/O Virtualization’. In:SIGPLAN Not. 47.4 (Mar. 2012), pp. 411–422. ISSN: 0362-1340. DOI: 10 . 1145/2248487.2151020. URL: http://doi.acm.org/10.1145/2248487.2151020.

[43] Muli Ben-Yehuda et al. ‘Utilizing IOMMUs for virtualization in Linux andXen’. In: OLS’06: The 2006 Ottawa Linux Symposium. Citeseer. 2006, pp. 71–86.

[44] AMD. AMD I/O Virtualization Technology. URL: https : / / support . amd . com/TechDocs/48882_IOMMU.pdf (visited on 12/06/2017).

[45] Intel. Intel® Virtualization Technology for Directed I/O. URL: https://www.intel .com/content/dam/www/public/us/en/documents/product- specifications/vt-directed-io-spec.pdf (visited on 12/06/2017).

[46] PCI-SIG. Specifications. URL: https://pcisig.com/specifications/iov/ (visited on26/11/2017).

142

[47] Brian Tierney et al. ‘Efficient data transfer protocols for big data’. In: E-Science(e-Science), 2012 IEEE 8th International Conference on. IEEE. 2012, pp. 1–9.

[48] Michael Oberg et al. ‘Evaluation of rdma over ethernet technology for buildingcost effective linux clusters’. In: 7th LCI International Conference on LinuxClusters: The HPC Revolution. 2006.

[49] Dotan, Barak. Introduction to Remote Direct Memory Access (RDMA). URL: http://www.rdmamojo.com/2014/03/31/remote-direct-memory-access-rdma/ (visitedon 20/11/2017).

[50] Unknown. Quick Concepts Part 1 – Introduction to RDMA. URL: https://zcopy.wordpress.com/tag/rdma/ (visited on 20/11/2017).

[51] Mellanox. RoCE in the Data Center. URL: http ://www.mellanox . com/related -docs/whitepapers/roce_in_the_data_center.pdf (visited on 20/11/2017).

[52] infinibandta.org. RDMA Over Converged Ethernet (RoCE). URL: https : / / cw .infinibandta.org/document/dl/7148 (visited on 08/03/2017).

[53] Nichole Boscia, Harjot S. Sidhu. Comparison of 40G RDMA and TraditionalEthernet Technologies. URL: https : //nas .nasa .gov/assets/pdf/papers/NAS_Technical_Report_NAS-2014-01.pdf (visited on 20/11/2017).

[54] Paul, Grun. RoCE and InfiniBand: Which should I choose. URL: http : / / blog .infinibandta . org/2012/02/13/ roce - and - infiniband - which - should - i - choose/(visited on 20/11/2017).

[55] linux-kvm.org. Kernel-based Virtual Machine. URL: https://www.linux-kvm.org/page/Main_Page (visited on 20/11/2017).

[56] Todd Deshane et al. ‘Quantitative comparison of Xen and KVM’. In: XenSummit, Boston, MA, USA (2008), pp. 1–2.

[57] wikipedia. Kernel-based Virtual Machine. URL: https ://en.wikipedia .org/wiki/Kernel-based_Virtual_Machine (visited on 20/11/2017).

[58] Binbin Zhang et al. ‘Evaluating and optimizing I/O virtualization in kernel-based virtual machine (KVM)’. In: Network and Parallel Computing (2010),pp. 220–231.

[59] linux-kvm.org. Kernel-based Virtual Machine. URL: https://www.linux-kvm.org/page/Processor_support (visited on 20/11/2017).

143

[60] Matthias Bolte et al. ‘Non-intrusive virtualization management using libvirt’.In: Proceedings of the Conference on Design, Automation and Test in Europe.European Design and Automation Association. 2010, pp. 574–579.

[61] Libvirt. Libirt Wiki. URL: https://wiki.libvirt.org/ (visited on 08/03/2017).

[62] wiki.qemu.org. QEMU. URL: https ://wiki .qemu.org/Main_Page (visited on20/11/2017).

[63] Fabrice Bellard. ‘QEMU, a fast and portable dynamic translator.’ In: USENIXAnnual Technical Conference, FREENIX Track. 2005, pp. 41–46.

[64] OpenStack. penStack Docs: Overview. URL: https ://docs .openstack .org/pike/(visited on 20/08/2017).

[65] OpenStack. OpenStack Releases: OpenStack Releases. URL: https : / / releases .openstack.org/ (visited on 20/08/2017).

[66] OpenStack. OpenStack Releases: OpenStack Releases. URL: https : //governance .openstack.org/tc/reference/release-naming.html (visited on 20/08/2017).

[67] Tiago Rosado and Jorge Bernardino. ‘An Overview of Openstack Architecture’.In: Proceedings of the 18th International Database Engineering & ApplicationsSymposium. IDEAS ’14. Porto, Portugal: ACM, 2014, pp. 366–367. ISBN: 978-1-4503-2627-8. DOI: 10.1145/2628194.2628195. URL: http://doi.acm.org/10.1145/2628194.2628195.

[68] Yoji Yamato et al. ‘Development of template management technology for easydeployment of virtual resources on OpenStack’. In: Journal of Cloud Computing3.1 (June 2014), p. 7. ISSN: 2192-113X. DOI: 10.1186/s13677- 014- 0007- 3. URL:https://doi.org/10.1186/s13677-014-0007-3.

[69] Antonio Corradi, Mario Fanelli and Luca Foschini. ‘VM consolidation: A realcase based on OpenStack Cloud’. In: Future Generation Computer Systems 32(2014), pp. 118–127.

[70] George Almási et al. ‘Toward building highly available and scalable OpenStackclouds’. In: IBM Journal of Research and Development 60.2-3 (2016), pp. 5–1.

[71] OpenStack. OpenStack Docs. URL: https://docs.openstack.org/ocata/install-guide-rdo/overview.html (visited on 20/08/2017).

144

[72] Jiuxing Liu. ‘Evaluating standard-based self-virtualizing devices: Aperformance study on 10 GbE NICs with SR-IOV support’. In: Parallel &Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE.2010, pp. 1–12.

[73] S. Gugnani, X. Lu and D. K. Panda. ‘Swift-X: Accelerating OpenStack Swiftwith RDMA for Building an Efficient HPC Cloud’. In: 2017 17th IEEE/ACMInternational Symposium on Cluster, Cloud and Grid Computing (CCGRID). May2017, pp. 238–247. DOI: 10.1109/CCGRID.2017.103.

[74] Chuanxiong Guo et al. ‘RDMA over commodity ethernet at scale’. In:Proceedings of the 2016 ACM SIGCOMM Conference. ACM. 2016, pp. 202–215.

[75] Jon Dugan, Seth Elliott, Bruce A. Mah, Jeff Poskanzer, Kaustubh Prabhu.iPerf is a tool for active measurements of the maximum achievable bandwidth on IPnetworks. URL: https://iperf.fr/ (visited on 26/02/2017).

[76] rperf. rperf - RDMA performance evaluation. URL: http://ftp100.cewit.stonybrook.edu/rperf/ (visited on 16/11/2017).

[77] qperf. GitHub qperf. URL: https ://github .com/ linux - rdma/qperf/ (visited on16/11/2017).

[78] Mellanox. Perftest Package. URL: https://community.mellanox.com/docs/DOC-2802 (visited on 25/02/2018).

[79] Redhat. Ansible is Simple IT Automation. URL: https://www.ansible.com/ (visitedon 25/02/2018).

[80] Rik van Riel. /proc/meminfo: provide estimated available memory. URL: https : / /git . kernel . org/pub/ scm/ linux/kernel /git / torvalds/ linux . git / commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773 (visited on 08/05/2017).

[81] Vangelis. Accurate calculation of CPU usage given in percentage in Linux? URL:https://stackoverflow.com/questions/23367857/accurate- calculation- of- cpu-usage-given-in-percentage-in-linux (visited on 20/05/2018).

[82] Patricia Gilfeather and Todd Underwood. ‘Fragmentation and HighPerformance IP.’ In: IPDPS. 2001, p. 165.

[83] Christian Ehrhardt. qemu: monitor: do not report error on shutdown. URL:https : / / libvirt . org / git / ?p = libvirt . git ; a = commit ; h =aeda1b8c56dc58b0a413acc61bbea938b40499e1 (visited on 25/02/2018).

145

[84] Redhat. Error ’internal error: End of file from qemu monitor" is received aftershutting down the VM. URL: https ://bugzilla . redhat .com/show_bug.cgi? id=1523314 (visited on 25/02/2018).

[85] Red Hat Bugzilla. Bug 1460760 - Virtio-net interface MTU overwritten to 1500bytes. URL: https ://bugzilla . redhat .com/show_bug.cgi? id=1460760 (visitedon 10/05/2018).

[86] Redhat. libvirt NUMA Tuning. URL: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-numa_and_libvirt(visited on 25/02/2018).

[87] Mellanox. Performance Tuning Guidelines for Mellanox Network Adapters. URL:https://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters_Archive.pdf (visited on 20/11/2017).

[88] Jon Dugan, Seth Elliott, Bruce A. Mah, Jeff Poskanzer, Kaustubh Prabhu. iPerf- The ultimate speed test tool for TCP, UDP and SCTP. URL: https://iperf.fr/iperf-doc.php (visited on 26/02/2017).

[89] Mellanox. iperf, iperf2, iperf3. URL: https://community.mellanox.com/docs/DOC-2851 (visited on 29/04/2017).

[90] Vangelis. Cannot get 40Gbps on Ethernet mode with ConnectX-3 VPI. URL: https://community.mellanox.com/thread/1860 (visited on 29/04/2017).

[91] Susinthiran, Sithamparanathan. Cannot get 40Gbps on Ethernet mode withConnectX-3 VPI. URL: https ://community.mellanox.com/thread/3963 (visitedon 01/01/2018).

[92] Susinthiran, Sithamparanathan. Ansible Playbook for master thesis. URL: https ://github.com/susinths/master/ansible/ (visited on 25/02/2018).

[93] Libvirt.org. Networking. URL: https://wiki.libvirt.org/page/Networking (visited on20/05/2018).

[94] Libvirt.org. Network XML format. URL: https://libvirt .org/formatnetwork.html(visited on 20/05/2018).

[95] Mark McLoughlin. Checksums, Scatter-Gather I/O and Segmentation Offload. URL:https://blogs.gnome.org/markmc/category/virtio/ (visited on 20/05/2018).

[96] Linux kernel developers. cputime.c. URL: https://github.com/torvalds/linux/blob/master/kernel/sched/cputime.c (visited on 20/05/2018).

146

[97] Ubuntu community. vcpu0 unhandled rdmsr. URL: https://bugs.launchpad.net/ubuntu/+source/kvm/+bug/1583819 (visited on 25/02/2018).

[98] Redhat Archives. BSOD occuring with 1 game in VM. URL: https://www.redhat.com/archives/vfio-users/2016-May/msg00099.html (visited on 25/02/2018).

[99] Wikipedia. Model-specific register. URL: https://en.wikipedia.org/wiki/Model-specific_register (visited on 25/02/2018).

[100] Susinthiran, Sithamparanathan. Developed scripts for UiO master study. URL:https://github.com/susinths/uiomaster/ (visited on 20/05/2018).

147