efficient workload and resource management in datacenters...workout at hart house that improved my...

Efficient Workload and Resource Management

in Datacenters

by

Hong Xu

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c� Copyright 2013 by Hong Xu

Abstract

E�cient Workload and Resource Management

in Datacenters

Hong Xu

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2013

This dissertation focuses on developing algorithms and systems to improve the e�ciency

of operating mega datacenters with hundreds of thousands of servers. In particular, it

seeks to address two challenges: First, how to distribute the workload among the set of

datacenters geographically deployed across the wide area? Second, how to manage the

server resources of datacenters using virtualization technology?

In the first part, we consider the workload management problem in geo-distributed

datacenters. We first present a novel distributed workload management algorithm that

jointly considers request mapping, which determines how to direct user requests to an

appropriate datacenter for processing, and response routing, which decides how to select

a path among the set of ISP links of a datacenter to route the response packets back

to a user. In the next chapter, we study some key aspects of cost and workload in geo-

distributed datacenters that have not been fully understood before. Through extensive

empirical studies of climate data and cooling systems, we make a case for temperature

aware workload management, where the geographical diversity of temperature and its

impact on cooling energy e�ciency can be used to reduce the overall cooling energy.

Moreover, we advocate for holistic workload management for both interactive and batch

jobs, where the delay-tolerant elastic nature of batch jobs can be exploited to further

reduce the energy cost. A consistent 15% to 20% cooling energy reduction, and a 5% to

ii

20% overall cost reduction are observed from extensive trace-driven simulations.

In the second part of the thesis, we consider the resource management problem in

virtualized datacenters. We design Anchor, a scalable and flexible architecture that

e�ciently supports a variety of resource management policies. We implement a prototype

of Anchor on a small-scale in-house datacenter with 20 servers. Experimental results

and trace-driven simulations show that Anchor is e↵ective in realizing various resource

management policies, and its simple algorithms are practical to solve virtual machine

allocation with thousands of VMs and servers in just ten seconds.

iii

Dedication

to my parents

iv

Acknowledgements

The first one is easy: Baochun Li. I started working with him as a Master student

back in 2007. Throughout the years his enthusiasm and vision spur my curiosity towards

many di↵erent topics of computer networking, and his patience and revision work polish

up almost every single paper of mine. His non-technical advice also serves well in various

aspects, such as investments and musical instruments. Of course there is also the regular

workout at Hart House that improved my 2km jogging record many times.

I couldn’t overstate my colleague Chen Feng’s contribution to this work. I benefited

hugely from many discussions where my rough thoughts and ideas would be challenged

and completed with his clear insights and broad knowledge on the subject of mathematics.

His e↵orts lead to technical proofs of the convergence of the m-block ADMM algorithm.

And actually ADMM was introduced to me when he was sharing Emmanuel Candes’s

talk in the group.

I am also grateful for many members of BA4176 who added more flavors to my six-

year stay here. The list is in the order of seniority: Mea Wang (with her Beagle Snoopy),

Chuan Wu, Xinyu Zhang, Jin Jin (with his wife Amber and son Aaron), Hui Wang, Di

Niu, Zimu Liu, Jiahua Wu, Elias Kehdi, Boyang Wang, Yuan Feng, Wei Wang, Yuefei

Zhu, and Yiwei Pu.

Finally, it is always the family. My work is not possible without the unconditional

support and love from my parents and my girlfriend. This thesis is dedicated to them.

v

Contents

1 Introduction 1

1.1 Workload Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Contribution 1: Joint Request Mapping and Response Routing . . 3

1.1.2 Contribution 2: Temperature Awareness and Capacity Allocation 5

1.2 Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Contribution 3: Anchor — A Flexible and Scalable Resource Man-

agement System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Related Work 10

2.1 Workload Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Joint Request Mapping and Response Routing 19

3.1 An Optimization Framework . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.3 Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.5 Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 A Primer on ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

vi

3.3 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Our Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.2 A Parallel Implementation in the Cloud . . . . . . . . . . . . . . 32

3.3.3 Case Study: A�ne Utility Functions . . . . . . . . . . . . . . . . 33

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Temperature Aware Workload Management 43

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Background and Empirical Studies . . . . . . . . . . . . . . . . . . . . . 46

4.2.1 Datacenter Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.2 Outside Air Cooling . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.3 An Empirical Climate Study . . . . . . . . . . . . . . . . . . . . . 48

4.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.2 Energy Cost and Cooling E�ciency . . . . . . . . . . . . . . . . . 52

4.3.3 Utility Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.5 Transforming to the ADMM Form . . . . . . . . . . . . . . . . . 56

4.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 A Distributed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

vii

4.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.6.2 Temperature Aware Workload Management . . . . . . . . . . . . 73

4.6.3 Algorithm Convergence . . . . . . . . . . . . . . . . . . . . . . . . 78

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5 Resource Management in Virtualized Datacenters 83

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 Background and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2.1 A Primer on Stable Matching . . . . . . . . . . . . . . . . . . . . 86

5.2.2 Models and Assumptions . . . . . . . . . . . . . . . . . . . . . . . 88

5.3 Theoretical Challenges of Job-Machine Stable Matching . . . . . . . . . . 89

5.3.1 Non-existence of Strongly Stable Matchings . . . . . . . . . . . . 91

5.3.2 Failure of the DA Algorithm . . . . . . . . . . . . . . . . . . . . . 92

5.3.3 Optimal Weakly Stable Matching . . . . . . . . . . . . . . . . . . 93

5.4 A New Theory of Job-Machine Stable Matching . . . . . . . . . . . . . . 94

5.4.1 A Revised DA Algorithm . . . . . . . . . . . . . . . . . . . . . . . 94

5.4.2 A Multi-stage DA Algorithm . . . . . . . . . . . . . . . . . . . . . 97

5.4.3 An Online Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5 Showcases of Resource Management Policies . . . . . . . . . . . . . . . . 102

5.5.1 Cloud Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.5.2 Cloud Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.5.3 Additional Commonly Used Policies . . . . . . . . . . . . . . . . . 106

5.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.6 Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 107

5.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.6.2 E�ciency of Resource Allocation . . . . . . . . . . . . . . . . . . 108

5.6.3 Anchor’s Performance at Scale . . . . . . . . . . . . . . . . . . . . 111

5.6.4 Evaluation of Online DA . . . . . . . . . . . . . . . . . . . . . . . 115

viii

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6 Concluding Remarks 119

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Proofs 125

7.1 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.2 Proof of Lemma 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.3 Proof of Inequality (4.21) . . . . . . . . . . . . . . . . . . . . . . . . . . . 127



7.6 Per-datacenter Sub-problem (4.28) is a SOCP . . . . . . . . . . . . . . . 132

7.7 Proof of Theorem 5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.8 Proof of Theorem 5.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Bibliography 136

ix

List of Tables

3.1 2011 annual average day ahead on peak price ($/MWh) in di↵erent re-

gional markets. Source: [37]. . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Tiered bandwidth prices. Source: Amazon EC2 . . . . . . . . . . . . . . 36

4.1 E�ciency of Emerson’s DSETM

cooling system with an EconoPhase air-

side economizer [33]. Return air is set at 29.4�C(85�F). . . . . . . . . . . 48

4.2 Percentages of hours outside air cooling can be used, when temperature is

less than 21�C(70�F). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Table of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Power prices ($USD/MWh) at di↵erent locations. . . . . . . . . . . . . . 72

5.1 Revised DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2 Multi-stage DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.3 Online DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4 Anchor’s policy interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

x

List of Figures

1.1 A cloud service running on geographically distributed datacenters. . . . . 2

3.1 Total request tra�c of the Wikipedia traces [101]. . . . . . . . . . . . . . 35

3.2 The U.S. electricity market and our datacenter map. Source: [37]. . . . . 35

3.3 Optimal average utility gain. . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Optimal average latency performance. . . . . . . . . . . . . . . . . . . . . 38

3.5 Optimal average costs per request. . . . . . . . . . . . . . . . . . . . . . 38

3.6 CDF of per request latency. . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7 CDF of per request costs. . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8 CDF of number of stub datacenters per client. . . . . . . . . . . . . . . . 40

3.9 CDF of the number of iterations to achieve convergence for our ADMM

algorithm and the subgradient method. . . . . . . . . . . . . . . . . . . . 41

4.1 Daily average temperature at three Google datacenter locations. Data

from the Global Daily Weather Data of the National Climate Data Cen-

ter (NCDC) [78]. Time is in UTC. Notice that Chile is in the Southern

Hemisphere and experiences a di↵erent seasonal trend. . . . . . . . . . . 49

4.2 Hourly temperature variations at three Google datacenter locations. Data

from the Hourly Global Surface Data of NCDC [78]. Time is in UTC. . . 50

4.3 The relationship between correlation coe�cients of hourly temperatures

and latitude for 13 Google datacenter locations. . . . . . . . . . . . . . . 50

xi

4.4 Model fitting of pPUE as a function of the outside temperature T for

Emerson’s DSETM

CRAC [33]. Small circles denote empirical data points. 53

4.5 Cooling energy savings. Time is in UTC. . . . . . . . . . . . . . . . . . . 75

4.6 Utility loss reductions. Time is in UTC. . . . . . . . . . . . . . . . . . . 76

4.7 Baseline and Cooling optimized induce larger average latency. . . . . . . . 78

4.8 Overall cost saving is insensitive to seasonal changes of the climate. . . . 78

4.9 Convergence results of our distributed ADMM algorithm compared against

subgradient methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1 The Anchor architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 A simple example where there is no strongly stable matching. Recall that

p() denotes the preference of an agent. . . . . . . . . . . . . . . . . . . . 90

5.3 An example where a possible execution of the DA algorithm produces a

type-2 blocking pair (b, A). . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4 An example where simply running the revised DA algorithm multiple times

will produce a new type-2 blocking pair (c, C). . . . . . . . . . . . . . . . 96

5.5 VM1 CPU usage on S1 when using the resource hunting policy. . . . . . 109

5.6 VM1 CPU usage on S2 when using the consolidation policy. . . . . . . . 109

5.7 VM2 memory usage on S1 and S2. . . . . . . . . . . . . . . . . . . . . . 110

5.8 S1 CPU and memory usage. . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.9 Consolidation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.10 Load Balancing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.11 VM happiness in a static setting. . . . . . . . . . . . . . . . . . . . . . . 113

5.12 Server happiness in a static setting. . . . . . . . . . . . . . . . . . . . . . 113

5.13 Running time in the static setting. . . . . . . . . . . . . . . . . . . . . . 114

5.14 Number of iterations in the static setting. . . . . . . . . . . . . . . . . . 114

5.15 VM happiness in a dynamic setting. . . . . . . . . . . . . . . . . . . . . . 116

5.16 Server happiness in a dynamic setting. . . . . . . . . . . . . . . . . . . . 116

xii

5.17 Running times of Online DA. . . . . . . . . . . . . . . . . . . . . . . . . 117

5.18 Convergence of Online DA. . . . . . . . . . . . . . . . . . . . . . . . . . . 117

xiii

Chapter 1

Introduction

The unprecedented growth of datacenters, in which machines are assembled to process

a massive amount of data for Internet-scale services, has been driving the evolution of

computing. An industry census reports that global investment in datacenter facilities

amounts to $30–$35 billion during 2012 and will continue to grow at the strong pace

[27]. These datacenters run hundreds of thousands of servers, consume megawatts of

power with massive carbon footprint, and incur electricity bills of millions of dollars

[40, 88]. In 2010, it was disclosed that Google had grown to over 800K servers, and that

they were redesigning their infrastructure to scale to millions of servers [30]. Similarly,

Facebook’s infrastructure is estimated to have 180K servers running as of August 2012

[28]. These datacenters are used to help us solve large-scale “big data” problems. Google,

for example, uses its datacenters to index 20 billion pages a day, process over 100 billion

search queries a month, and provide email for 425 million Gmail users [52].

The scale of the infrastructures and their increasing prevalence with the rise of the

cloud computing model pose many technical challenges in e�ciently managing and op-

erating these datacenters. Motivated by this trend, my doctoral research has focused on

modeling, analysis, and design of algorithms and systems that improve the e�ciency and

manageability of datacenters. Specifically, they address two challenges that face today’s

1

2 CHAPTER 1. INTRODUCTION

datacenter operators:

1. Workload Management. Increasingly, large technology companies such as Google

and Facebook are deploying datacenters across geographical areas in order to pro-

vide better latency and reliability. How do we design an e�cient workload manage-

ment system to optimally disseminate the requests from tens of millions of users

among the set of datacenters?

2. Resource Management. Datacenters are increasingly virtualized, where the re-

sources of physical machines are shared across tenants in the form of virtual ma-

chines. How do we build a flexible and practical resource management system that

satisfies the diverse and often conflicting goals of the operator and tenants?

Before we dig into details of these algorithms and systems as our main contributions,

we first give the high-level idea underlying each of them and their relation to prior work.

1.1 Workload Management

Requests

Mapping nodes

Datacenters

Clients

Figure 1.1: A cloud service running on geographically distributed datacenters.

We consider a cloud service provider, such as Google, Facebook, and Microsoft, that

runs geo-distributed datacenters across the globe. The provider deploys some mapping

1.1. WORKLOAD MANAGEMENT 3

nodes as shown in Figure 1.1 to map client requests to an appropriate datacenter based on

certain criteria. This is the request mapping decision that distributes workload geograph-

ically. In practice, these mapping nodes can be authoritative DNS servers as used by

Akamai and most CDNs, or HTTP ingress proxies as used by Google and Yahoo [82,110].

When a client accesses the URL of the service, the request is sent to a mapping node

which then returns the IP address(es) of a particular datacenter. A mapping node can

arbitrarily split a client’s request tra�c among the set of datacenters. DNS servers and

HTTP proxies can achieve such flexibility in commercial products [61,110]. To cope with

dynamic request tra�c, request mapping decisions are re-computed periodically on an

hourly or daily basis.

By tuning how and which datacenter IP addresses should be returned for a request,

we essentially control how the workload is distributed across the datacenters. For this

reason, request mapping (also referred to as geographical load balancing) is generally

deemed to represent the entire workload management problem, and has been the focus

of research on this subject. However, we show that the problem design space embraces

several other important dimensions, by revisiting the key characteristics of datacenter

cost and workload.

1.1.1 Contribution 1: Joint Request Mapping and Response

Routing

For a datacenter housing on the order of 50,000 servers and using good quality commodity

o↵-the-shelf equipment, Greenberg et al. [45] calculate that power draw and network

including equipment gears and transit bandwidth each represents 15% of the total cost

of building and running the facility. Greenberg et al. also report that wide-area transit

bandwidth costs more than building and maintaining the internal network of a datacenter

[46].

Therefore, workload management should consider both energy and bandwidth costs.


As we will see in detail in Sec 2.1, many e↵orts have been devoted to saving energy

cost by designing intelligent request mapping schemes. The basic idea is to exploit

the geographical diversity of electricity prices and map requests to cheaper locations

whenever possible. Bandwidth cost is largely ignored, because it is determined by the

tra�c engineering decision that dictates the routing of packets, after a user request has

been processed1. This is termed response routing and is usually managed separately with

request mapping in the current practice.

A datacenter is multi-homed to many ISPs that provide transit bandwidth with vary-

ing capacities and prices, similar to a multi-homed ISP. Thus it uses the same response

routing solutions designed for multi-homed ISPs to optimize the bandwidth cost [43].

These solutions assume the egress tra�c demand is given, and compute the routing ac-

cordingly. While this holds for multi-homed ISPs who are in general not the source of the

tra�c and cannot control the egress demand, it is not the case for geo-distributed data-

centers, whose egress demand is tunable and precisely controlled by the request mapping

decision.

We thus advocate for a joint optimization of request mapping and response routing

for geo-distributed datacenters. The joint consideration is mutually beneficial: Request

mapping decisions are made taking into account the bandwidth price diversity of ISP links

of datacenters to save both energy and bandwidth costs. Response routing decisions do

not have to deal with long queueing delays and poor performance due to excessive demand

caused by uncoordinated request mapping mechanisms. The design space of workload

management is augmented with response routing.

Our first contribution in this thesis is a study of the joint request mapping and re-

sponse routing for workload management of geo-distributed datacenters, which has been

published in [116]. We study the workload management problem under a general model.

Performance is abstracted using a general convex utility function of the average latency

1It is almost always the case that the request is of negligible size compared to the response, whichcan be a video clip or a webpage, sent from the datacenter.


that may incorporate various issues such as the notion of fairness. Cost, including electric-

ity and bandwidth, is captured with geographical diversity to reflect the electricity and

bandwidth price di↵erences at di↵erent locations and ISPs. The optimization achieves a

desired cost-performance trade-o↵ specified by the operator. By exploiting the intrinsic

structure of the problem, we develop a novel distributed algorithm based on the alter-

nating direction method of multipliers (ADMM). It decomposes the problem into many

small sub-problems that can be independently solved, leading to a parallel implemen-

tation in datacenters. We demonstrate, through extensive simulations with Wikipedia

workload traces [101], that our algorithm optimally solves the problem with only tens of

iterations while traditional subgradient methods take hundreds of iterations. Moreover,

it can be prematurely terminated with a remarkably small performance loss.

1.1.2 Contribution 2: Temperature Awareness and Capacity

Allocation

Through the previous study of workload management, we find that some key aspects

of datacenter cost and workload have not been fully understood and explored. First,

cooling systems consume 30% to 50% of the energy consumption of a datacenter [87,

121]. In existing work, the energy-gobbling cooling systems are modeled with a constant

energy e�ciency factor to calculate the total energy consumption. Yet, this tends to be

an over-simplification at best. Through extensive empirical studies, we find that their

actual energy e�ciency depends directly on the ambient temperature, which exhibits a

significant degree of geographical diversity. Temperature diversity can be used along with

price diversity in making request routing decisions to reduce the overall cooling energy

overhead for geo-distributed datacenters. The design space of workload management is

augmented with temperature awareness.

Second, many measurements find that datacenters generally run two kinds of workload

simultaneously: delay-sensitive interactive jobs driven by external user requests, and


delay-tolerant batch jobs driven by back-end batch processing tasks such as indexing

and data mining jobs [91]. This phenomenon makes intuitive sense if one understands

the working of Internet and cloud services. For example, a Google datacenter needs to

handle requests for the entire suite of its Internet services such as Gmail and YouTube.

At the same time, it also runs many batch jobs in the back-end to provide search and

targeted advertising in these services.

The mixed nature of datacenter workloads actually provides more opportunities to

utilize the cost diversity of energy. Since batch jobs are delay tolerant, we may optimize

the capacity allocation between batch and interactive jobs together with price-aware re-

quest mapping to further utilize the geographical diversity of electricity prices. The

design space of workload management is augmented with a new dimension of capacity

allocation, which fundamentally changes the problem formulation, where capacity for

interactive jobs becomes an optimizing variable instead of a given constant as in prior

work.

Our second contribution is a holistic workload management approach that considers

both interactive and batch jobs with temperature-awareness in [115]. We motivate our

idea with extensive empirical studies of the dependence of cooling energy e�ciency on

temperature, and daily and hourly climate data for 13 Google datacenters for a 1-year

period of time. We develop a novel distributed m-block (m � 3) ADMM algorithm that

extends the classical 2-block algorithm used before in [116] to solve the problem. We

conduct extensive trace-driven simulations with real-world electricity prices, historical

temperature data, and an empirical cooling e�ciency model to realistically assess the

potential of our approach. We find that it is consistently able to deliver a 15%–20% cool-

ing energy reduction and a 5%–20% overall cost reduction for geo-distributed datacenters.

The distributed m-block ADMM algorithm converges quickly within 70 iterations, while

subgradient methods fail to converge within 200 iterations.

1.2. RESOURCE MANAGEMENT 7

1.2 Resource Management

Modern datacenters rely heavily on virtualization [9] to flexibly multiplex applications

onto physical servers, in order to e�ciently utilize resources. With virtualization, the

resources of a physical server, including CPU, memory, storage, and network bandwidth,

can be sliced into so-called virtual machines. These virtual machines can run their own

operating systems and applications that are di↵erent from the operating system of the

hosting server.

Virtualization enables the booming of cloud computing, a new computing paradigm

that allows applications as tenants to dynamically request resources on-demand in the

form of virtual machines. If the application demands more resources, we simply need

to deploy more virtual machines. If the demand decreases, we simply shut down the

abundant virtual machines, and the vacant server resources can be used to support other

applications in need. They may do so by going to public cloud providers that sell virtual

machines for profit, such as Amazon EC2 [5]. Or enterprises can also deploy virtual-

ization technology to their premise infrastructures and manager their own virtualized

environment with their own IT support, which is often termed private clouds.

No matter what types of clouds we look at, public or private, one thing is in common:

applications are packaged and run in the form of virtual machines (VMs) that share the

infrastructure, and these applications are owned by di↵erent tenants of the cloud. Due

to the multi-tenant nature, resource management becomes a major challenge for cloud

operators. The key question here is, how could we e�ciently allocate the resources of

physical servers to virtual machines with di↵erent resource requirements and performance

objectives? According to a 2010 survey [119], resource management is the second most

concerned problem that CTOs express after security.

Most existing resource management solutions are designed to achieve some notion of

performance optimality. We will illustrate in more detail the related work in Sec. 2.2.

Instead, we advocate to design for generality and manageability for several reasons. Op-


erators often have di↵erent definitions for optimality. Even the same operator may wish

to optimize for di↵erent objectives in di↵erent scenarios. Also, the individual tenants

have vastly di↵erent performance considerations. Thus, it is more important to design a

flexible architecture that supports a variety of resource management policies with a single

underlying mechanism. The clean separation between policies and mechanism here is the

key design principle that makes the system practical.

1.2.1 Contribution 3: Anchor — A Flexible and Scalable Re-

source Management System

Our third contribution in this thesis is the design and implementation of Anchor, a

flexible and scalable resource management substrate based on this design principle [117].

We draw upon stable matching theory as the underpinning of resource management in

virtualized datacenters. The concept of preferences is used so the operator and tenants

can configure the way preferences are generated to expressive their goals. Rather than

optimality, stability, in the sense that no alternative pair of VMs and servers can be both

better o↵ in the matching, is used as the solution concept to resolve the possible conflict

of interests.

The highlight of our contribution here is a new theory of stable matching that models

VMs with varying resource requirements as jobs with heterogeneous sizes. Size hetero-

geneity, which has not been studied before in both computer science and economics,

greatly complicates the problem since even the classical definition of stability becomes

unclear here. We propose a new stability concept to resolve this challenge, design o✏ine

and online algorithms to e�ciently find stable matchings with respect to the new defini-

tion, and prove convergence results. We implement a prototype of Anchor, consisting of

a resource monitor, a policy API that allows users to configure policies, and a matching

engine that uses our algorithms, on a small-scale in-house datacenter with 20 servers.

Experimental results and trace-driven simulations show that Anchor is e↵ective in re-

1.3. THESIS ORGANIZATION 9

alizing various resource management policies, and its simple algorithms are practical to

solve problems with thousands of VMs and servers in just ten seconds.

1.3 Thesis Organization

This thesis is organized as follows. Chapter 2 summarizes prior work for both workload

and resource management in datacenters. Chapter 3 presents our study of joint request

mapping and response routing. Chapter 4 presents the study of temperature aware

workload management with capacity allocation. Chapter 5 discusses the design and

implementation of Anchor. Chapter 6 provides possible future work and concluding

remarks. Chapter 7 contains detailed proofs for the results in previous chapters.

Chapter 2

Related Work

This section surveys the landscape of research in workload and resource management for

datacenters, respectively.

2.1 Workload Management

Industry Practices. Request mapping is an indispensable component of any distributed

large-scale systems. Most small- and medium-scale production systems adopt a primitive

round-robin DNS technique that distributes the workload by rotating IP addresses in

DNS responses to each user [32], and large-scale systems often balance a number of

concerns such as network latency and bandwidth cost in mapping users to clusters of

machines [20, 88, 120].

Related tra�c engineering schemes. Before the rise of datacenters, there exists

a body of related work that has investigated how to jointly optimize the routing for

bandwidth costs and performance. Goldberg et al. [43] propose a method to optimize

network cost and performance for multihomed users. They develop algorithms to take

advantage of burst billing schemes for bandwidth costs, such as the 95-percentile charging,

using an optimization framework. Zhang et al. [120] describe a comprehensive system to

jointly optimize for performance and bandwidth cost of a geo-distributed cloud service.

10


They first develop a method to measure in real time the performance and cost of routing

tra�c to a destination prefix via any one of the many alternative paths not currently

being used, without actually redirecting tra�c to those alternative paths. They then

design an optimization algorithm to compute the optimal tra�c engineering strategy

based on the measurements. These two works focus on bandwidth cost for multihomed

systems, and do not consider energy costs. Finally, Agarwal et al. [3] describe a method

to optimize the placement of objects and replicas (or shards) in a large geo-distributed

cloud service.

Request mapping to save energy cost. It appears that request mapping has

nothing to do with energy cost, since no matter how we distribute the workload, the total

workload and the incurred energy consumption are the same. [88] seminally points out

that, to the contrary, the electricity prices vary substantially across geographical areas,

which can and should be used to aid the request mapping optimization in reducing the

energy cost. They conduct an extensive and careful analysis of the empirical wholesale

electricity prices, and show that electricity prices exhibit both temporal and geographic

variation, due to regional demand di↵erences, transmission ine�ciencies, and generation

diversity. In those parts of the U.S. with wholesale electricity markets, prices vary on

an hourly basis and are often not well correlated at di↵erent locations. Moreover, these

variations are substantial, as much as a factor of 10 from one hour to the next. It is

then possible for large-scale distributed systems, such as datacenters, to dynamically

move demand (i.e. route requests) to places with lower prices in order to reduce energy

cost. They propose a simple formulation that uses energy cost minimization as the

objective, and incorporate bandwidth and performance goals as constraints. By trace-

driven simulations with 24-days worth of Akamai request trace, and 39 months of hourly

electricity prices from 29 US location, they show that existing systems can potentially

save millions of dollars in the electric bill without any increase in bandwidth costs or

significant reduction in performance [88].

12 CHAPTER 2. RELATED WORK

Inspired by [88], request mapping in geo-distributed datacenters has started to gain

increasing attention in recent years [40, 65, 68, 72, 90, 110]. [110] focuses on algorithmic

design, and develops a decentralized request mapping system that supports a variety of

configurable policies for di↵erent cloud service providers. The authors design a simple

interface where each service location specifies a split weight or a bandwidth cap, and a

performance penalty, to take into account both cost and performance. This API is shown

to capture diverse request mapping goals based on client proximity, server capacity, 95th-

percentile billing, and so on. It also leads to a formal linear program. The optimization

is decomposed into into a distributed algorithm that requires only minimal coordination

between the nodes. It is shown that the decentralized algorithm provably converges to

the solution of the optimization problem. A prototype system is also provided and eval-

uated in detail. [90] considers a queueing theory approach to model the load-dependent

processing delay of requests but ignores the network latency. The request mapping prob-

lem incorporates a quality of service guarantee on the total delay a user experiences, and

is formulated as a mixed integer program. A polynomial time approximation algorithm

based on the minimum cost flow problem is developed.

[65] considers the e↵ect of request mapping on providing environmental gains by using

green renewable energy. The authors propose a general model where each data center

minimizes its cost, which is a linear combination of an energy cost and the lost revenue

due to the delay of requests (which includes both network propagation delay and load-

dependent queueing delay within a data center). The optimal geographical load balancing

solutions are characterized, and shown to have practically appealing properties, such as

sparse routing tables. Then, two distributed algorithms are developed that provably

compute the optimal routing and provisioning decisions. The distributed algorithms

are evaluated through trace-driven simulation of a realistic, distributed, Internet-scale

system. The results show that a cost saving of over 40% during light-tra�c periods

is possible. Then the feasibility and benefits of using geographical load balancing to


facilitate the integration of renewable sources into the grid are studied. It is shown

that when the datacenter incentive is aligned with the social objective of reducing brown

energy by dynamically pricing electricity proportionally to the fraction of the total energy

from brown sources, a renewable-aware request mapping scheme brings significant social

benefits.

In a similar line, [72] shows that request mapping is also helpful for the smart grid.

It is instrumental in controlling the impact of datacenter energy consumption on the

grid, and in helping load balance the electric grid and make the grid more robust with

respect to link breakage and load demand variations. For example, we can reduce the

computation load and consequently the energy consumption at a data center in an area

where the grid is prone to circuit overflow. In this regard, the authors of [72] formulate the

request mapping problem jointly with the power flow analysis in the smart grid within

an optimization framework, and explain how these problems are related. Simulation

results based on the settings in the IEEE 24-bus Reliability Test System show that a

grid-aware request mapping design can significantly help for better load balancing and

more robustness in the grid.

Following [65], Gao et al. in [40] use carbon footprint as the explicit metric of a dat-

acenter’s environmental impact, instead of putting a dollar cost on carbon emissions as

in [65]. It considers optimizing request mapping to reduce both energy cost and carbon

footprint of geo-distributed datacenters. The key idea is that, the electricity generation

sources also have geographical diversity similar to prices: some datacenter location has

access to greener electricity generated from renewable sources such as solar and wind.

Through empirical studies of the electricity fuel mix and price in each of the 50 U.S.

states, it is first demonstrated that there is no correlation between the “cleanness,” i.e.

the carbon footprint of electricity generation, and the price. Consequently, schemes to

reduce the electricity cost of powering an Internet service using request routing do not

also reduce the carbon emissions of datacenter operations. By profiling and modeling the


composition of generation sources through empirical data, the average carbon footprint

of a location can be calculated. Then the authors propose FORTE, a Flow Optimization

based framework for request Routing and Tra�c Engineering, that o↵ers a principled

approach to assigning users and data objects to datacenters. FORTE performs user

assignment by weighting each assignment’s e↵ect on three metrics: access latency, elec-

tricity cost, and carbon footprint. By making this three-way trade-o↵ explicit, FORTE

enables providers to determine their optimal operating point that balances performance

with cost and carbon footprint. Fast approximation algorithms are proposed to solve

the problem in large-scale settings, where users are aggregated into tens of thousands of

locations. Using simulations with traces from Akamai CDN, it is shown that FORTE

can reduce Akamai’s carbon footprint by 10% without increasing electricity cost, while

simultaneously bounding the access latency.

Other issues related to request mapping, such as fairness of performance among users,

and the dynamics of request tra�c, have also been studied before [93, 113, 114]. Ren

et al. [93] propose a provably e�cient online scheduling algorithm based on Lyapunov

optimization that optimizes for both energy cost and performance fairness among users.

However they focus on batch jobs, which may be di�cult to be re-distributed from one

location to another due to the data availability constraints. [113, 114] propose similar

ideas to exploit the statistical multiplexing of tra�c demands from di↵erent users to

reduce the overall resources required. Although we currently do not explicitly account

for multiplexing, this aspect can be incorporated into our framework fairly easily.

Reducing energy consumption of datacenters. Instead of looking at energy

cost, another thread of related work focuses on reducing the total energy consumption of

datacenters. Power management of servers has long been studied, where the processor

frequency is adjusted according to load to save energy, a technique known as dynamic

speed/voltage scaling [6,66,109]. A more aggressive idea is to dynamically turn o↵ servers

when capacity is not needed, and turn them back on when demand increases. This is


termed dynamic right-sizing in [63]. In [63], the authors formulate the problem as an

optimization, considering the server energy cost and the switching cost that models the

cost of toggling a server between active and energy saving modes. It is proved that the

optimal o✏ine algorithm for this optimization has a simple structure when viewed in

reverse time, and this structure is exploited to develop a new online algorithm, which

is proved to be 3-competitive. The algorithm is validated using traces from two real

workloads from Hotmail and Microsoft research datacenter and 40% cost-savings are

observed.

Dynamic right-sizing is further studied in [68,112] towards a power-proportional dat-

acenter design. In [68], several online and o✏ine algorithms are proposed to turn o↵

CDN servers during periods of low load while seeking to balance three key design goals:

maximize energy reduction, minimize the impact on client-perceived service availability

(SLAs), and limit the frequency of on-o↵ server transitions to reduce its impact on re-

liability. Trace-driven simulation results show that it is possible to reduce the energy

consumption of a CDN by more than 55% while ensuring a high level of availability that

meets customer SLA requirements and incurring an average of one on-o↵ transition per

server per day. In [112], Xu et. al. explicitly di↵erentiate delay-sensitive and delay-

tolerant jobs when considering dynamic right-sizing. They focus on the problem of using

delay-tolerant jobs to fill the extra capacity of datacenters, referred to as trough/valley

filling. Giving a higher priority to delay-sensitive jobs, their schemes complement to the

previous right-sizing schemes. Specifically, they propose two joint dynamic speed scaling

and tra�c shifting schemes, and use simulations to validate the algorithms. Similarly,

virtualization and server consolidation have been utilized to reduce the number of active

servers, e.g., in [19, 77, 105]. However, it is worthwhile noting here that due to concerns

related to the wear/tear cost during transition periods and the reliability of the service,

operators often leave all the servers on in practice [22, 45].

A large body of work has been done on thermal and temperature management to


reduce the cooling energy consumption of datacenters. Approaches that have been in-

vestigated include, for example, methods to minimize air flow ine�ciencies in server

rooms [86,99], load balancing and the incorporation of temperature awareness into work-

load placement among servers [10,36,73,89], and the optimization and control of various

cooling technologies with load-dependent energy e�ciency [64,121].

All these lines of work focus on the energy consumption of a single datacenter. Thus

they are orthogonal and complimentary to our work that considers geo-distributed dat-

acenters.

2.2 Resource Management

We now present related work on resource management in virtualized datacenters.

VM Placement. VM placement on a shared infrastructure has been extensively

studied. Current industry solutions from virtualization vendors such as VMware vSphere

[107] and Eucalyptus [34], and open-source e↵orts such as Nimbus [79] and Cloud-

Stack [24], only provide a limited set of pre-defined placement policies and simple greedy

heuristics [97]. Existing papers develop specifically crafted algorithms and systems for

specific scenarios, such as consolidation based on CPU usage [21, 70, 102], energy con-

sumption [19, 77, 106], bandwidth multiplexing [16, 56, 69], and storage dependence [60].

The problem is usually formulated as some di�cult combinatorial optimization, which

leads to many di↵erent heuristic designs.

Specifically, in [102], Urgaonkar et al. present techniques for provisioning CPU and

network bandwidth with overbooking in a shared environment. The primary contribu-

tion of this work is to demonstrate the feasibility and benefits of overbooking resources

in shared platforms, to maximize the platform yield: the revenue generated by the avail-

able resources. OS kernel based application profiling is used to develop a token bucket

based model of application resource needs, which is then used to guide the placement of

2.2. RESOURCE MANAGEMENT 17

application components. Experiments on a Linux cluster implementation indicate that

overbooking resources by as little as 1% can increase the utilization of the cluster by a

factor of two, and a 5% overbooking yields a 300%–500% improvement, while still provid-

ing useful resource guarantees to applications. In [21, 70], the authors exploit statistical

multiplexing among the workload patterns of multiple VMs, i.e., the peaks and valleys

in one workload pattern do not necessarily coincide with the others. The idea is that the

unused resources of a low utilized VM can be borrowed by the other co-located VMs with

high utilization. Compared to individual-VM based provisioning, joint-VM provisioning

could lead to much higher resource utilization. An algorithm for estimating the aggre-

gate size of multiplexed VMs, and a VM selection algorithm that seeks to find those VM

combinations with complementary workload patterns are developed and evaluated.

In [19,77,106], minimizing power consumption is the primary objective of VM place-

ment. Substantial implementation e↵orts are presented in [77] to support the isolated and

independent operation assumed by VMs and to make it possible to control and globally

coordinate the e↵ects of the diverse power management policies applied by these VMs

to virtualized resources. Algorithms and heuristics on exploiting the energy e�ciency

di↵erence of server hardware to save power consumption are developed in [19,106]. Fur-

ther, [16, 56, 69] consider the communication patterns and bandwidth cost between VM

pairs when developing placement decisions. VMs that frequently talk to each other tend

to be co-located in order to save network bandwidth.

These techniques and proposals are complementary to our system Anchor, as the

insights and strategies can be incorporated as policies to serve di↵erent purposes without

the need to design new algorithms from the ground up.

OpenNebula [97], a resource management system for virtualized infrastructures, is

the only related work to our knowledge that also decouples management policies with

mechanisms. It is an open source system that deploys virtualized services on both a

local pool of resources and external IaaS clouds. It uses a simple first fit algorithm


based on a configurable ranking scheme by the administrator to place VMs on physical

resources. We use the stable matching framework for VM placement. The algorithms

are able to consider the interests of both the operator and clients and orchestrate the

conflict between them, as will be demonstrated in Chapter 5.

There is a small literature on online VM placement. [44, 58, 96] develop systems to

predict the dynamic resource demand of VMs and guide the placement process. [56]

considers minimizing the long-term routing cost between VMs. These works consider

various aspects to refine the placement process and are orthogonal to our work that

addresses the fundamental problem of VM size heterogeneity.

Job Scheduling. There have been tremendous e↵orts on job scheduling in a com-

puter cluster or grid. [54, 95] represent several early works that focus on minimizing

the expected completion time of all jobs by strategically choosing their execution se-

quence. [62, 74] discuss the online algorithm design problem for grid computing. For

more complete coverage of the literature readers are directed to [15, 75]. The focus of

this line of work is on the completion times of jobs, which in our case are not known a

prior to the scheduler of a cloud since users may use VMs for as long as they wish.

Chapter 3

Joint Request Mapping and

Response Routing

As seen in Sec. 2.1, previous work on workload management has largely focused on request

mapping to reduce the energy cost. Bandwidth, another significant cost contributor, is

managed by the response routing decision, which decides how the response packets are

sent back to the user through one of the multi-homed ISP links connected a datacenter

[43]. These two decisions are managed and studied separately in today’s practice, leading

to poor performance and high costs in many cases [61,76]. For example, too many requests

may be directed to a datacenter whose upstream links then become congested, resulting

in long queueing delays and poor performance. The objectives of the two decisions can

also be misaligned and lead to sub-optimal equilibria.

We consider the joint request mapping and response routing problem in geo-distributed

datacenters in this chapter, with an objective of reducing the overall energy and band-

width costs. Specifically, we formulate the problem as a general workload management

optimization, where key performance and cost issues are realistically modeled. We use a

utility function of the average latency [122] to capture various performance goals providers

wish to achieve for their services. We consider both the electricity and bandwidth costs,

19

20 CHAPTER 3. JOINT REQUEST MAPPING AND RESPONSE ROUTING

which exhibit significant location and provider diversity [88,98,103] and together account

for the majority of the datacenter operational expense (OPEX) [45].

The workload management problem is a convex optimization, and can be solved

in a centralized way. However, it is inherently a very large-scale problem that makes

a centralized algorithm ine�cient. In a production system, the problem typically has

millions of variables and hundreds of thousands of constraints as we shall illustrate in

Sec. 3.1.4. Commercial solvers with centralized algorithms will take hours to solve a

simpler linear program of the same size [50]. A centralized algorithm also introduces a

single point of failure, and makes the system less responsive to handle sudden changes

in request rates (i.e. flash crowds) or network conditions (i.e. link failures). In these

situations, a solution with fast computation and modest accuracy is more desirable, and

can take advantage of the abundant server resources in a datacenter to parallelize the

computation for such large-scale problems.

Thus, for reasons of performance, scalability, and robustness, we are motivated to

develop a distributed solution for the workload management problem. Our algorithm is

based on the alternating direction method of multipliers (ADMM), a simple yet powerful

algorithm that recently has found practical use in many large-scale distributed convex

optimization problems [13]. ADMM works by first separating the objective and variables

into two parts, and then alternatively optimizing one set of variables that accounts for

one part of the objective to iteratively reach the optimum. Merits of ADMM, compared

to conventional methods such as subgradient methods [12], are its fast convergence to

modest accuracy, insensitivity to step sizes, and robustness without strong assumptions

such as strict convexity of the objective function [11,13].

This chapter is organized as follows. First, Sec. 3.1 presents a general formulation of

the joint request mapping and response routing problem for geo-distributed datacenters.

We use utility functions to capture various performance objectives, and consider the

location diversity of the associated electricity and bandwidth costs. Sec. 3.3 then discusses

3.1. AN OPTIMIZATION FRAMEWORK 21

a novel distributed algorithm based on ADMM to solve the large-scale optimization

problem e�ciently. We demonstrate that after a transformation, the problem can be

decomposed into many small sub-problems, the solutions of which are coordinated to

find the global optimal solution, and can be e�ciently solved in the general case. We

further provide solutions in analytical form for the case when the utility function is a�ne,

and discuss issues pertaining to a parallel implementation of the algorithm in the cloud.

Sec. 3.4 details an empirical evaluation of the algorithm using the Wikipedia workload

traces [101], as well as real-world latency measurements [67]. It is demonstrated that our

algorithm o↵ers near-optimal performance within 20 iterations.

We stress that the techniques developed in this chapter to transform the problem and

apply ADMM are fairly general. It will also be used in Chapter 4, and may be applicable

to problems in datacenters and other domains, where an e�cient parallel algorithm is

required to solve large-scale convex optimization problems.

3.1 An Optimization Framework

Let us start by presenting our model and optimization framework.

3.1.1 Infrastructure

We consider a provider that runs her cloud service over a set of datacenters N in distinct

geographical regions. Each datacenter n is multi-homed to a set of ISP links Mn

, each

with a fixed capacity. Let I denote the set of clients, where in this work a client i is

simply a unique IP prefix similar to [82].

The provider deploys a number of mapping nodes as shown in Fig. 1.1 to map client

requests to an appropriate datacenter based on certain criteria. This is the request

mapping decision. When a datacenter finishes serving a request, it sends the response

packets back to the client through one of the available ISP links. This corresponds to


the response routing decision. Today’s BGP routing picks a single egress ISP link for

each IP prefix. We relax this constraint and allow the provider to arbitrarily split the

response tra�c among all ISP links, which is commonly accepted in the literature [76,81].

Such fractional routing can be achieved by hash-based tra�c splitting in practice [18].

Since the datacenter does not provide IP transit, such routing changes do not cause BGP

convergence issues.

Without loss of generality, we view every possible combination of datacenter and

ISP link as a virtual stub datacenter, a concept we use to facilitate our analyses in

the sequel. We let j 2 J ,J := N ⇥ {Mn

} denote a stub datacenter, i.e. the tuple

hn,mi, n 2 N ,m 2 Mn

. Each stub datacenter then has a finite capacity Cj

determined

by its corresponding ISP link’s capacity. Here we implicitly assume that the link capacity

is the bottleneck of the service compared to the datacenter’s computational capability,

which is generally the case in reality. The request mapping and response routing decisions

can then be treated jointly as a single workload management optimization between the

clients and the stub datacenters.

The provider periodically, e.g. hourly or daily, computes the workload management

decisions to better cope with demand dynamics under normal operations. We use ↵ij

2

[0, 1] to denote the proportion of requests distributed to stub datacenter j from client i.

↵ij

is our decision variable. We assume the provider employs statistical machine learning

techniques [81,85] to predict the tra�c demand of each client Di

before each optimization

interval. Such an assumption is commonly made in the literature [76, 110,114].

3.1.2 Performance

Latency is arguably the most important performance metric for most cloud services. A

small increase in the user-perceived latency can cause substantial revenue loss for the

provider [59]. In this work we focus on the end-to-end propagation latency between users

and datacenters, which largely accounts for the user-perceived latency compared to other

3.1. AN OPTIMIZATION FRAMEWORK 23

factors such as request processing times at datacenters [38, 76].

The provider obtains the propagation latency Lij

between client i and stub datacenter

j through active measurements [67] or other means. A client’s performance depends on

the average propagation latency its requests receiveP

j

↵ij

Lij

through a generic utility

function U . U can take various forms depending on the performance goals the provider

pursues. We only require that U is a decreasing, di↵erentiable, and concave function.

This utility notion allows us a considerable amount of expressiveness. For example, it can

incorporate fairness among clients by using the canonical alpha-fair utility functions [71].

3.1.3 Costs

Two kinds of operating costs — electricity and bandwidth — are involved in serving client

requests, both of which scale with the total volume of the workload. The electricity

price exhibits significant location diversity which has been exploited to save costs for

datacenters. We use PE

j

to denote the power price of the stub datacenter j, which is

determined by the location of the corresponding datacenter. The bandwidth price varies

across ISPs and also exhibits location diversity in practice [98, 103], and is denoted by

PB

j

depending on the corresponding ISP and location of the stub datacenter. In reality

many ISPs adopt the 95-percentile charging scheme. However we assume the bandwidth

cost is linear with the tra�c volume. Optimizing a linear cost on each interval can reduce

the monthly 95-percentile bill [120]. For more elaborate schemes that optimize for the

95-percentile charging model one may refer to [43].

3.1.4 Problem Formulation

We are now in a position to formally formulate the workload management problem as an

optimization that maximizes the total utility of serving the requests, minus the electricity


and bandwidth costs incurred.

min↵

X

i2I

X

j2J

Di

↵ij

�

PE

j

+ PB

j

�

�X

i2I

Di

U⇣

X

j2J

↵ij

Lij

⌘

(3.1)

s.t.X

j2J

↵ij

= 1, 8i 2 I, (3.2)

X

i2I

↵ij

Di

Cj

, 8j 2 J , (3.3)

↵ij

� 0, 8i 2 I, j 2 J . (3.4)

(3.1) is the objective function that poses the maximization problem in an equivalent

minimization form. Note that by adding a scalar weight factor in front of the utility

function, any desired trade-o↵ point between performance and cost can be achieved.

For simplicity we assume the weight is 1. (3.2) is the workload conservation constraint

that dictates each client’s demand has to be satisfied. (3.3) is the capacity constraint

that prevents the ISP link of a stub datacenter from overflow. (3.4) is simply the non-

negativity constraint for the variables.

The optimization (3.1) is a very large-scale problem. To have a rough understanding,

the number of clients represented by the number of unique IP addresses is O(105), and the

number of datacenters and ISP links is around O(102) in some production clouds [61,76].

This implies that the problem can have O(107) variables, and O(105) constraints for a

production system.

3.1.5 Existing Approaches

As argued in the beginning of this chapter, the lack of e�ciency and robustness in

centralized algorithms motivates our design of a distributed solution amenable to parallel

implementations in the cloud. The common approach to develop distributed algorithms

is to relax the constraints and employ dual decomposition to decompose the problem into

many independent sub-problems [23]. Subgradient methods can then be used to update

3.2. A PRIMER ON ADMM 25

the dual variables towards the optimality of the dual problem [12].

Yet, these approaches are not applicable here. First of all, dual decomposition requires

the utility function to be strictly convex, for an a�ne function will make the Lagrangian

unbounded below in ↵. However, for workload management in cloud computing, an

a�ne utility function is in fact one of the most popular and commonly studied utility

functions [76, 114]. Second, subgradient methods su↵er from the curse of step size. For

the output to be close to the optimum, we need to strategically pick the step size at

each iteration, leading to the well-known problems of slow convergence and performance

oscillation when the problem scale is large. Even if U were indeed a strictly convex

function, subgradient methods are not well suited in our problem.

Summarizing the discussions, we need a scalable and practical distributed algorithm

that converges fast to modest accuracy, and is not sensitive to step sizes. In the following,

we present such an algorithm based on the alternating direction method of multipliers

(ADMM) [13].

3.2 A Primer on ADMM

Alternating direction method of multipliers (ADMM) is a simple yet powerful algorithm

well suited to large-scale distributed convex optimization. It is able to overcome the draw-

backs of traditional approaches, such as dual decomposition with subgradient methods,

in terms of slow convergence. Though developed in the 1970s [11], ADMM has recently

received renewed interest, and found practical use in many large-scale distributed convex

optimization problems in statistics, machine learning, etc., mainly due to the availability

of large-scale distributed systems (such as datacenters) [13]. In this section, we introduce

the basics of ADMM which is the corner stone of our algorithm designs in this and the

next chapters.


ADMM solves problems in the form

min f1(x1) + f2(x2) (3.5)

s.t. A1x1 + A2x2 = b,

x1 2 C1, x2 2 C2,

with variables x`

2 Rn` , where A`

2 Rp⇥n` , b 2 Rp, f`

’s are convex functions, and C`

’s

are non-empty polyhedral sets. Thus, the objective function is separable over two sets of

variables, which are coupled through an equality constraint.

We can form the augmented Lagrangian [51] by introducing an extra L-2 norm term

kA1x1 + A2x2 � bk22 to the objective:

L⇢

(x1, x2; y) = f1(x1) + f2(x2) + yT (A1x1 + A2x2 � b) + (⇢/2)kA1x1 + A2x2 � bk22.

Here, ⇢ > 0 is the penalty parameter (L0 is the standard Lagrangian for the problem).

The augmented Lagrangian can be viewed as the unaugmented Lagrangian associated

with the problem

min f1(x1) + f2(x2) + (⇢/2)kA1x1 + A2x2 � bk22

s.t. A1x1 + A2x2 = b,

x1 2 C1, x2 2 C2.

Clearly the problem with the penalty term is equivalent to the original problem (3.5),

since for any feasible x`

the penalty term added to the objective is zero. The benefits of

introducing the penalty term are improved numerical stability and faster convergence in

practice [13]. Moreover, L⇢

is strictly convex even when f1 and f2 are a�ne, and we can

work on the dual problem without strong assumptions on f1 and f2.

3.3. ALGORITHM DESIGN 27

ADMM solves the dual problem with the iterations:

xt+11 := argmin

x12C1

L⇢

(x1, xt

2; yt), (3.6)

xt+12 := argmin

x22C2

L⇢

(xt+11 , x2; y

t), (3.7)

yt+1 := yt + ⇢(A1xt+11 + A2x

t+12 � b). (3.8)

It consists of an x1-minimization step (3.6), an x2-minimization step (3.7), and a dual

variable update (3.8). Note the step size for the dual update is simply the penalty

parameter ⇢. Thus, x1 and x2 are updated in an alternating or sequential fashion, which

accounts for the term alternating direction. Since two sets of variables are involved, the

algorithm is also referred to as 2-block ADMM. Separating the minimization over variables

is precisely what allows for decomposition when f`

is separable, which will prove to be

useful in our algorithm design.

The optimality and convergence of 2-block ADMM can be guaranteed under mild

assumptions [11]. In practice, it is often the case that ADMM converges to modest

accuracy within a few tens of iterations [13], which makes it attractive in practical use.

Theorem 3.1. [11] Assume that the optimal solution set of problem (3.5) is non-empty,

and either C1 is bounded or else the matrix ATA is invertible. Then a sequence {xt

1, xt

2,�t}

generated by (3.6)–(3.8) is bounded, and every limit point of {xt

1, xt

2} is an optimal solu-

tion of (3.5).

3.3 Algorithm Design

3.3.1 Our Algorithm

Our problem (3.1) cannot be readily solved using ADMM. The constraints (3.2) and (3.3)

couple all variables together, whereas in ADMM problems the constraints are separable


for each set of variables. The coupling is especially di�cult, because it happens on two

orthogonal dimensions simultaneously: The per-client workload conservation constraint

(3.2) couples ↵ across stub datacenters, and the per-stub datacenter capacity constraint

(3.3) couples ↵ across clients.

To address this challenge, we introduce a new set of auxiliary variables � = ↵, and

re-formulate the optimization:

min↵,�

X

i2I

Di

X

j2J

↵ij

PE

j

� U⇣

X

j2J

↵ij

Lij

⌘

!

+X

j2J

X

i2I

Di

�ij

PB

j

s.t.X

j2J

↵ij

= 1, 8i 2 I;

X

i2I

�ij

Di

Cj

, 8j 2 J ;

↵ij

= �ij

� 0, 8i 2 I, j 2 J . (3.9)

This problem (3.9) is clearly equivalent to the original problem (3.1). We observe that the

new formulation is in the ADMM form (3.5). The objective function is now separable over

two sets of variables ↵ and �. ↵ controls the net utility gain of processing the requests,

i.e. utility minus electricity cost, while � determines the bandwidth cost of transmitting

the response packets. ↵ and � are connected through an equality constraint. Overall,

they control the provider’s total utility gain of running the cloud service.

The use of auxiliary variables also enables the separation of per-client and per-stub

datacenter constraint sets, which is the key step towards decomposing the problem as we

demonstrate now. The augmented Lagrangian of (3.9) is

L⇢

(↵, �,�) =X

i

Di

X

j

↵ij

PE

j

� U⇣

X

j

↵ij

Lij

⌘

!

+X

j

X

i

Di

�ij

PB

j

+X

i

X

j

�

�ij

(↵ij

� �ij

) + ⇢/2(↵ij

� �ij

)2�

. (3.10)


The dual problem is solved by updating ↵ and � sequentially. At the (t+1)-th iteration,

the ↵-minimization step involves solving the following problem according to (3.6):

min↵

X

i

X

j

↵ij

⇣

Di

PE

j

+ �t

ij

+⇢

2

�

↵ij

� 2�t

ij

�

⌘

�Di

U(↵i

)

!

s.t.X

j

↵ij

= 1, U(↵i

) = U⇣

X

j

↵ij

Lij

⌘

,↵i

� 0, 8i, (3.11)

where ↵i

is the vector of ↵ij

for client i, and U(↵i

) is a shorthand for i’s utility function.

This problem is decomposable over clients, since the objective function and constraints

are separable over i. E↵ectively, each client needs to independently solve the following

sub-problem:

min↵i

X

j

↵ij

⇣

Di

PE

j

+ �t

ij

+⇢

2

�

↵ij

� 2�t

ij

�

⌘

�Di

U(↵i

)

s.t.X

j

↵ij

= 1, U(↵i

) = U⇣

X

j

↵ij

Lij

⌘

,↵i

� 0. (3.12)

The per-client sub-problem is of a much smaller scale, with |J | variables and |J |+1

constraints, and can be e�ciently solved by a standard optimization solver. In reality

the number of stub datacenters |J | = O(102) and is much smaller than the number of

clients |I|. Depending on the exact shape of the utility function, in some cases we can

even provide analytical solution as we shall see in Sec. 3.3.3.

We have solved the ↵-minimization step distributively across all clients by decompos-

ing the problem (3.11) into |I| per-client sub-problems (3.12). After obtaining ↵t+1, the

�-minimization step can also be similarly attacked as we show now.


According to (3.7), the �-minimization step consists of solving the following:

min�

X

j

X

i

�ij

⇣

Di

PB

j

� �t

ij

+⇢

2

�

�ij

� 2↵t+1ij

�

⌘

s.t.X

i

�ij

Di

Cj

, 8j, �ij

� 0, 8i, j. (3.13)

This problem is also decomposable over the set of stub datacenters J into |J | sub-

problems. Specifically, each stub datacenter needs to solve

min�1j ,�2j ,...

X

i

�ij

⇣

Di

PB

j

� �t

ij

+⇢

2

�

�ij

� 2↵t+1ij

�

⌘

s.t.X

i

�ij

Di

Cj

, �ij

� 0, 8j. (3.14)

The per-stub datacenter problem is a quadratic program, whose solutions can be provided

in analytical form as follows.

Lemma 3.1. At the (t+ 1)-th iteration, for all i 2 I such that �t

ij

�Di

PB

j

+ ⇢↵t+1ij

0,

�t+1ij

= 0. Denote the remaining set {i 2 I|�t

ij

�Di

PB

j

+ ⇢↵t+1ij

> 0} as It+1j

. Then �t+1ij

for i 2 It+1j

is: IfP

i2It+1j

(�t

ij

�Di

PB

j

+ ⇢↵t+1ij

)Di

⇢Cj

,

�t+1ij

=�t

ij

�Di

PB

j

⇢+ ↵t+1

ij

,

otherwise,

�t+1ij

= max

(

�t

ij

�Di

(PB

j

+ ⌫t+1j

)

⇢+ ↵t+1

ij

, 0

)

,

where ⌫t+1j

� 0 is determined by the following

X

i2It+1j

�t+1ij

Di

= Cj

.

The proof can be found in Appendix 7.1.


Having obtained the optimal ↵t+1 and �t+1, the final step is to perform the dual

variable update:

�t+1ij

= �t

ij

+ ⇢(↵t+1ij

� �t+1ij

). (3.15)

The entire procedure is summarized in Algorithm 1. Since the constraint set C1 for ↵ is

clearly bounded in our problem (3.9), according to Theorem 3.1 the algorithm converges

to the optimal solution.

Lemma 3.2. Our algorithm based on ADMM converges to the optimal solution ↵⇤ and

�⇤ of (3.9) and equivalently (3.1).

Algorithm 1 Optimal Distributed Solution for (3.1)

1. Each stub datacenter j initializes �0ij

= 0, �0ij

= 0, and broadcasts its electricity pricePE

j

to each client.2. Given �t

i

= [�t

i1, �t

i2, . . .] and �t

i

= [�t

i1,�t

i2, . . .], each client i solves the per-clientsub-problem (3.12), and sends the optimal solution ↵t+1

ij

to the corresponding stubdatacenter j.

3. Given ↵t+1j

= [↵t+11j ,↵t+1

2j , . . .], each stub datacenter solves the sub-problem (3.14) asin Lemma 3.1 with local information PB

j

and �t

j

= [�t

1j,�t

2j, . . .].4. Each stub datacenter j updates the dual variables �t

j

= [�t

1j,�t

2j, . . .] as in (3.15). Itthen sends the optimal solution �t+1

ij

and updated dual variable �t+1ij

to the corre-sponding client i.

5. Return to step 2 until convergence.

Intuitively, the working of our algorithm follows a divide-and-conquer paradigm. Re-

call that ↵ controls the net utility gain of processing the requests, while � determines

the bandwidth cost of transmitting the response packets. Our algorithm first optimizes

↵ for the mapping aspect of the problem given the response routing solution �t. It then

optimizes � for the response routing aspect of the problem given the previously computed

mapping solution ↵t+1. The dual update ensures the two sets of solutions converge to

the same workload management solution, which is also optimal.


3.3.2 A Parallel Implementation in the Cloud

The distributed nature of Algorithm 1 allows for an e�cient parallel implementation in

the cloud that has abundant server resources. Here we discuss several issues pertaining

to such an implementation in reality.

First, at each iteration, each client solves the per-client sub-problem in step 2. This

can be readily implemented in a parallel fashion on each server of one of the datacenters

the provider owns, which we call the designated datacenter. A production datacenter

typically has O(104)–O(105) servers [45]. Thus each server only needs to solve O(10)–

O(1) per-client sub-problems at each iteration. Since the per-client sub-problem (3.12)

is a small-scale convex optimization, the computational complexity is low. A multi-

threaded implementation can further speed up the algorithm on multi-core hardware.

The penalty parameter ⇢ and utility function U can be configured across all servers

before the algorithm starts o↵.

Similarly, step 3 of Algorithm 1, which solves the per-stub datacenter sub-problem,

also has a parallel implementation in the designated datacenter. Only |J | servers are

required, each responsible for solving one instance of (3.14) according to the solution in

Lemma 3.1. It can even be implemented on the same servers that implement step 2 for

the per-client sub-problems. The parallel implementation of our algorithm thus makes it

well suited in the cloud environment.

Second, our algorithm can be terminated before convergence is reached. ADMM is

not sensitive to step size ⇢, and usually finds a solution with modest accuracy within

tens of iterations [13]. As argued in the beginning of this chapter, a solution with mod-

est accuracy is su�cient in situations of flash crowds of requests and failure recovery.

A provider can apply an early-braking mechanism in these scenarios to terminate the

algorithm after several tens of iterations without worrying about performance issues.

We finally comment that the message passing overhead of our algorithm is also low.

As a prerequisite, the electricity and bandwidth prices of each datacenter and ISP needs


to be gathered at the designated datacenter. The final output of the algorithm ↵ij

needs

to be disseminated to the mapping nodes and datacenters (recall Fig. 1.1). All the other

message passing, for exchanging ↵, �, and � amongst servers, happens in the internal

network of the designated datacenter, which in many cases is specifically designed to

handle the broadcast and shu✏e transmission patterns of HPC applications such as

MapReduce [4,46]. Note that the amount of intermediate data our algorithm produces is

much smaller than the bulky data of HPC applications [104]. Thus the message passing

overhead incurred in the datacenter network is low.

3.3.3 Case Study: A�ne Utility Functions

Before concluding this section, we provide a case study of the workload management

problem with an a�ne utility function. A�ne utility functions are the de facto utility

function widely used in the literature [76], though some studies have argued for more

complicated utility functions with fairness considerations [114].

An a�ne utility function has the following form:

U⇣

X

j

↵ij

Lij

⌘

= �aX

j

↵ij

Lij

, (3.16)

where a > 0 is a conversion factor that translates user-perceived latency into utility (e.g.,

revenue). With an a�ne utility function, the per-client sub-problem (3.12) becomes a

quadratic program in the following form:

min↵i

X

j

↵ij

⇣

Di

�

aLij

+ PE

j

�

+ �t

ij

+⇢

2

�

↵ij

� 2�t

ij

�

⌘

s.t.X

j

↵ij

= 1,↵i

� 0. (3.17)

Optimal solutions can then be derived in an analytical form through the KKT con-

ditions.


Lemma 3.3. At the (t+1)-th iteration, the optimal solution of the per-client sub-problem

(3.17) with an a�ne utility function for a given client i is as follows.

↵t+1ij

= max

(

�t

ij

�D

i

(aLij

+ PE

j

) + �t

ij

+ µt+1i

⇢, 0

)

,

where µt+1i

6= 0 is determined by the following

X

j2J

↵t+1ij

= 1.

The proof can be found in Appendix 7.2. Essentially, this is a system of |J | + 1

equations with |J |+ 1 variables, whose solution can be e�ciently computed.

Thus, in the case of an a�ne utility function, the per-client sub-problem reduces to

a quadratic program and is particularly easy to solve.

3.4 Evaluation

To realistically evaluate the performance of our algorithm, we conduct trace-driven sim-

ulations in this section.

3.4.1 Setup

We use the Wikipedia request traces [101] to represent the request tra�c of a cloud

service. The dataset we use contains, among other things, 10% of all user requests issued

to Wikipedia from 3:56PM, January 1, 2008 GMT to 4:57PM, January 2, 2008 GMT. The

prediction of workload can be done accurately as demonstrated by previous work [80,81],

and in the simulation we simply adopt the measured request tra�c as the total demand.

We assume the optimization is done hourly, and Fig. 3.1 plots the hourly request tra�c

of the traces for 24 hours of the measurement period.

3.4. EVALUATION 35

0 5 10 15 20

0.8

1

1.2

1.4

Hour

Requ

est t

raffi

c (1

07 )

Figure 3.1: Total request tra�c of the Wikipedia traces [101].

Figure 3.2: The U.S. electricity market and our datacenter map. Source: [37].

We simulate a cloud that deploys ten datacenters across the continental U.S. Accord-

ing to the Federal Energy Regulatory Commission (FERC), the U.S. electricity market

is consisted of multiple regional markets as shown in Fig. 3.2 [37]. Each regional market

has several hubs with their own pricing. Thus for the ease of exploration, we assume

that one datacenter is deployed in a randomly chosen hub in each of the ten regional

markets as shown in Fig. 3.2. We use the 2011 annual average day-ahead on peak price

as the electricity price for each datacenter, i.e. PE

, as summarized in Table 3.1. In the

simulations we calculate the cost by assuming that one request consumes 10W of energy


on average, including the server, network, and cooling energy consumption.

Table 3.1: 2011 annual average day ahead on peak price ($/MWh) in di↵erent regionalmarkets. Source: [37].

Region Hub PriceCalifornia NP15 $35.83Midwest Michigan Hub $42.73

New England Mass Hub $52.64New York NY Zone J $62.71Northwest California-Oregon Border (COB) $32.57

PJM PJM West $51.99Southeast VACAR $44.44Southwest Four Corners $36.36

SPP SPP North $36.41Texas ERCOT North $61.55

Table 3.2: Tiered bandwidth prices. Source: Amazon EC2Link capacity (requests/hour) Pricing ($/request)

< 1.4⇥ 105 0.00121.4⇥ 105–5.6⇥ 105 0.00095.6⇥ 105–1.4⇥ 106 0.0007

> 1.4⇥ 106 0.0005

Each datacenter has 3 ISP links. Thus the number of stub datacenters |J | = 30. The

prices of the ISP links are estimated in two steps. First, the capacity of each ISP link

is randomly set such that the total capacity across the 30 links is 1.2⇥ 107 requests per

hour. Then, the price of an ISP link is determined from a tiered structure based on the

link capacity, where a link with larger capacity has a lower cost. We assume a request’s

response packets contain 1 MB of data on average, and use Amazon EC2 bandwidth prices

in the U.S. east region to determine the exact price per request presented in Table 3.2.

This setup resembles the volume discount strategy commonly used in the industry.

We rely on iPlane [67], a system that collects wide-area network statistics from Plan-

etlab vantage points, to obtain the latency information. We set the number of clients

|I| = 105, and choose 105 IP prefixes from a RouteViews [1] dump. We then extract the

corresponding round trip latency information from the iPlane logs, which contain tracer-

outes made to a large number of IP addresses from Planetlab nodes. We only use latency

3.4. EVALUATION 37

measurements from Planetlab nodes that are close to our datacenter locations. Therefore

the propagation latency depends on the datacenter location but not on the specific ISP

link used. We believe this is a reasonable approximation when the geographical distance

instead of link condition dominates the propagation delay.

Now since the Wikipedia traces do not contain any client information, to emulate the

geographical distribution of requests, we split the total request tra�c among the clients

following a normal distribution. The utility function is the simple a�ne function as in

(3.16) with a = 10�4. That is, a request with 100 ms latency translates to $0.01 revenue

for the provider. Finally, the penalty parameter ⇢ is set to 1 in all our simulations.

3.4.2 Performance

0 5 10 15 2040

50

60

70

80

90

Hour

Util

ity g

ain

($0.

0001

)

ADMMADMM−20

Figure 3.3: Optimal average utility gain.

We evaluate two variants of our algorithm in the simulations. The first variant,

referred to as ADMM in the figures, runs Algorithm 1 until convergence is reached.

The second variant, referred to as ADMM-20, applies an early-braking method and runs

Algorithm 1 for only 20 iterations. Fig. 3.3 plots the average utility gain per request for

the two variants. Throughout the day, we observe that ADMM-20 with 20 iterations can

achieve utility gains close to optimum within $0.0008 di↵erence, while the regular ADMM


converges within 56 iterations in all the cases (more on convergence in Sec. 3.4.3). The

average value of |↵ij

� �ij

| after 20 iterations is merely 2.7133 ⇥ 10�5. Therefore, our

algorithm converges quickly to near optimum.

0 5 10 15 2040

60

80

100

120

Hour

Late

ncy

(ms)

ADMMADMM−20

Figure 3.4: Optimal average latency performance.

Fig. 3.4–3.5 further plot the average latency and serving costs per request. Observe

that the average client latency stands below 80 ms most of the time, and never exceeds

120 ms. The average serving costs is approximately $0.0015 per request throughout the

day. Both metrics fluctuate closely with the total tra�c as shown in Fig. 3.1.

0 5 10 15 2010

15

20

25

Hour

Costs

($0.

0001

)

ADMMADMM−20

Figure 3.5: Optimal average costs per request.

3.4. EVALUATION 39

To understand the performance of our algorithm on a microscopic level, we plot the

CDF of the request latency and serving costs across all clients and all hours for the

ADMM-20 variant in Fig. 3.6 and 3.7. Most of the requests, more than 90%, are served

with latency less than 100 ms. The CDF of costs is more skewed, implying that the

per-request costs vary significantly across clients. This is because the (bandwidth) cost

di↵erence amongst the ISP links of the same datacenter is clearly larger than the latency

di↵erence, which is assumed to be zero.

50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Latency (ms)

CDF

ADMM−20

Figure 3.6: CDF of per request latency.

10 15 20 25 300

0.2

0.4

0.6

0.8

1

Costs ($0.0001/request)

CDF

ADMM−20

Figure 3.7: CDF of per request costs.


One may wonder at this point that, our algorithm may direct requests only to the best

stub datacenter for a client, which is not preferable for diversity and resilience purposes.

However, Fig. 3.8 shows that this is not the case. We plot the CDF of the number of

stub datacenters a client’s requests are directed to (for hour 0 data and the ADMM-

20 variant). The figure shows that for more than 80% of the clients, the requests are

directed to 2-5 stub datacenters. On average, each client has 3.6 stub datacenters to

serve its requests. This leads us to believe that our algorithm distributes the workload

in a balanced way.

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Number of stub datacenters

CDF

ADMM−20

Figure 3.8: CDF of number of stub datacenters per client.

3.4.3 Convergence

We now investigate the convergence and running time of our algorithm. For comparison,

we use the subgradient method [12] to solve the dual problem of the transformed opti-

mization (3.9) with the augmented Lagrangian (3.10). Specifically, the primal variables

↵ and � are jointly optimized instead of sequentially updated as in our ADMM algorithm

to speed up the convergence, and the dual variables � are updated by the subgradient

method. The step size has to be carefully chosen, since a too large value will make the

final output far away from the real optimum, and a too small value will slow down the

3.4. EVALUATION 41

20 40 60 80 100 120 1400

0.2

0.4

0.6

0.8

1

Number of iterations

CDF

ADMMSubgradient

Figure 3.9: CDF of the number of iterations to achieve convergence for our ADMMalgorithm and the subgradient method.

convergence. We choose the step sizes according to the diminishing step size rule [12].

Fig. 3.9 plots the CDF of the number of iterations the two algorithms take to achieve

convergence for the 24 runs on the traces. We can clearly see that our ADMM algorithm

converges much faster than the subgradient method. Our algorithm takes at most 56

iterations to converge, while the subgradient method takes at least 72 iterations. For

80% of the time our algorithm converges within 40 iterations, while the subgradient

method takes 110 iterations. This demonstrates the fast convergence of our algorithm

compared to conventional methods.

We finally study the running time of our algorithm. Note that since we do not

have enough hardware resources to experiment with a parallel implementation, our al-

gorithm is implemented on a single server machine where each per-client and per-stub

datacenter sub-problem is sequentially solved. We observe that one iteration takes on

average 1500.6447 seconds on a Dual Dual-Core Intel Xeon 3.0 Ghz (64-bit) server. Since

|I| = 105 and |J | = 30, solving each sub-problem takes around 0.015 second. Thus, a

parallel implementation on 1000 servers will take less than a second to run one iteration,

which demonstrates the e�ciency of our algorithm for large-scale problems.


3.5 Summary

In this chapter, we studied the joint request mapping and response routing problem

for geographically distributed cloud services. We formulated the problem as a general

convex optimization, where the location diversity of performance and costs are modeled.

We developed an e�cient distributed algorithm based on ADMM to decompose the

large-scale global problem into many sub-problems, each of which can be quickly solved.

We discussed a parallel implementation of the algorithm that is well suited in a cloud

environment with abundant server resources. Trace-driven simulations are conducted to

evaluate the algorithm’s performance.

Chapter 4

Temperature Aware Workload

Management

In the previous chapter, we examined the workload management problem for interactive

workloads of geo-distributed datacenters. In this chapter, we visit the problem of holis-

tically managing both interactive and batch workloads, which has been largely ignored in

previous work.

4.1 Motivation

As discussed in Sec. 1.1.2, Datacenters run two categories of workloads: interactive and

batch jobs [91]. Previous work have only studied the interactive workload with request

mapping. The mixed nature of datacenter workloads actually provides more opportuni-

ties to utilize the cost diversity of energy. The key observation is that batch workloads

are elastic to resource allocations, whereas interactive workloads are highly sensitive to

latency and have more profound impact on revenue [59]. Thus at times when one location

is comparatively cost e�cient, we can increase the capacity for interactive workloads by

reducing the resources reserved for batch jobs. More requests can then be routed to and

processed at this location, and the cost saving can be more substantial. We are thus mo-

43

44 CHAPTER 4. TEMPERATURE AWARE WORKLOAD MANAGEMENT

tivated to advocate a holistic workload management approach, where capacity allocation

between interactive and batch workloads is dynamically optimized with request routing.

Such dynamic capacity allocation is also technically feasible because jobs run on highly

scalable systems such as MapReduce and Spanner [26].

Moreover, we realize that one key aspect of cost in geo-distributed datacenters have

not been fully understood in the literature: cooling energy. Cooling systems, which

consume 30% to 50% of the total energy [87, 121], are often modeled with a constant

and location-independent energy e�ciency factor in existing e↵orts. This tends to be an

over-simplification in reality. Through our study of a state-of-the-art production cooling

system in Sec. 4.2, we find that temperature has direct and profound impact on cooling

energy e�ciency. This is especially true with outside air cooling technology, which has

seen increasing adoption in mission-critical datacenters [25, 29, 47]. As we will show, its

partial PUE (power usage e↵ectiveness), defined as the sum of server power and cooling

overhead divided by server power, varies from 1.30 to 1.05 when temperature drops from

35 �C (90 �F) to -3.9 �C (25 �F).

Through an extensive empirical analysis of daily and hourly climate data for 13 Google

datacenters, we further find that temperature varies significantly across both time and

location, which is intuitive to understand. The short-term volatilities are not well corre-

lated across locations. These observations suggest that datacenters at di↵erent locations

have distinct and time-varying cooling energy e�ciency. This establishes a strong case

for making workload management temperature aware, where such temperature diversity

can be used along with price diversity in making request routing decisions to reduce the

overall cooling energy overhead for geo-distributed datacenters.

Towards temperature aware workload management, we propose a general framework

to capture the important trade-o↵s involved in Sec. 4.3. We model both energy cost and

utility loss, which corresponds to performance-related revenue reduction. We develop an

empirical cooling e�ciency model based on a production system with both outside air

4.1. MOTIVATION 45

and mechanical cooling capabilities. The problem is formulated as a joint optimization

of request routing and capacity allocation. The technical challenge is then to develop

a distributed algorithm to solve the large-scale optimization with tens of millions of

variables for a production geo-distributed cloud. Dual decomposition with subgradient

methods are often used to develop distributed optimization algorithms. However they

require delicate adjustment of step sizes that makes convergence di�cult to achieve for

large-scale problems. The method of multipliers [51] achieves fast convergence, at the

cost of introducing tight coupling among variables.

We again rely on the alternating direction method of multipliers (ADMM) introduced

in Sec. 3.2 for our algorithm design. The classical ADMM algorithm works for problems

with two blocks of variables. Our formulation has three blocks of variables, yet little

is known about the convergence of m-block (m � 3) ADMM algorithms, with two ex-

ceptions [48, 53] very recently. [48] establishes the convergence of m-block ADMM for

strongly convex objective functions, but not linear convergence; [53] shows the linear

convergence of m-block ADMM under the assumption that the relation matrix is full col-

umn rank, which is, however, not the case in our formation. This motivates us to refine

the framework in [53] so that it can be applied to our setup. In particular, in Sec. 4.4

we show that by replacing the full-rank assumption with some mild assumptions on the

objective functions, we are not only able to obtain the same convergence and rate of

convergence result, but also to simplify the proof of [53]. The m-block ADMM algorithm

is general and can be applied in other problem domains. For our case, we further develop

a distributed algorithm in Sec. 4.5, which is amenable to a parallel implementation in

datacenters.

We conduct extensive trace-driven simulations with real-world electricity prices, his-

torical temperature data, and an empirical cooling e�ciency model to realistically assess

the potential of our approach in Sec. 4.6. We find that temperature aware workload

management is consistently able to deliver a 15%-20% cooling energy reduction and a


5%-20% overall cost reduction for geo-distributed datacenters. The distributed ADMM

algorithm converges quickly within 70 iterations, while a dual decomposition approach

with subgradient methods fails to converge within 200 iterations. We thus believe our

algorithm is practical for large-scale real-world problems.

4.2 Background and Empirical Studies

Before we make a case for temperature aware workload management, it is necessary to

introduce some background of datacenter cooling, and empirically assess the geographical

diversity of temperature.

4.2.1 Datacenter Cooling

Datacenter cooling is provided by the computer room air conditioners (CRACs) placed

on the raised floor of the facility. The CRACs cool down the hot air exhausted from

server racks by forcing it to travel through a cooling coil. The cool air is then pumped

into the plenum under the raised floor, and recirculated back to the racks through vented

tiles. Fans draw the cool air in and through the servers, and hot air exits from the rear

of servers, resulting in the alternating cold aisles and hot aisles [10].

Heat is often extracted by chilled water in the cooling coil, and the returned hot water

is cooled through mechanical refrigeration cycles in an outside chiller plant continuously.

The compressor of a chiller consumes a massive amount of energy, and accounts for the

majority of the overall cooling cost [121]. The result is an energy-gobbling cooling system

that typically consumes a significant portion (⇠30%) of the total datacenter power [121].

4.2.2 Outside Air Cooling

To improve energy e�ciency, various so-called free cooling technologies that operate

without mechanical chillers have recently been adopted. In this work, we focus on a more

4.2. BACKGROUND AND EMPIRICAL STUDIES 47

economically viable technology called outside air cooling that uses an air-side economizer

to direct cold outside air into the datacenter to cool down servers. The hot exhaust air

is simply rejected out instead of being cooled and recirculated. The advantage of outside

air cooling can be significant: Intel ran a 10-month experiment using 900 blade servers,

and reported that 67% of the cooling energy can be saved with only slightly increased

hardware failure rates [55]. Companies like Google [29], Facebook [47], and HP [25] have

been operating their datacenters with up to 100% outside air cooling, which brings down

the average PUE below 1.2 with millions of savings annually.

The energy e�ciency of outside air cooling heavily depends on ambient temperature

among other factors. When temperature is lower, less air is needed for heat exchange, and

the air handler fan speed can be reduced to save energy. Thus, a CRAC with an air-side

economizer usually operates in three modes. When ambient temperature is high, outside

air cooling cannot be used, and the CRAC falls back to mechanical cooling with chillers.

When temperature falls below a certain threshold, outside air cooling is utilized to provide

partial or entire cooling capacity. When temperature is too low, outside air is mixed with

exhaust air to maintain a suitable supply air temperature. In this mode, CRAC energy

e�ciency cannot be further improved since fans need to operate at a minimum speed to

maintain airflow. Table 4.1 shows the empirical COP1 and partial PUE (pPUE)2 data

of a state-of-the-art CRAC with an air-side economizer. Clearly, as outdoor temperature

drops, the CRAC switches the operating mode to use more outside air cooling. As a

result the COP improves six-fold from 3.3 to 19.5, and the pPUE decreases dramatically

from 1.30 to 1.05. Due to the sheer amount of energy a datacenter draws, the numbers

imply huge monetary savings for the energy bill.

With the increasing use of outside air cooling, this finding motivates our proposal

to make workload management temperature aware. Intuitively, datacenters at colder

1COP, coe�cient of performance, is defined for a cooling device as the ratio between cooling capacityand power.

2pPUE is defined as the sum of cooling capacity and cooling power divided by cooling capacity.Nearly all the power delivered to servers translates to heat, which matches the CRAC cooling capacity.


Outdoor ambient Cooling mode COP pPUE35�C(90�F) Mechanical 3.3 1.3021.1�C(70�F) Mechanical 4.7 1.2115.6�C(60�F) Mixed 5.9 1.1710�C(50�F) Outside air 10.4 1.1-3.9�C(25�F) Outside air 19.5 1.05

Table 4.1: E�ciency of Emerson’s DSETM

cooling system with an EconoPhase air-sideeconomizer [33]. Return air is set at 29.4�C(85�F).

and thus more energy e�cient locations should be better utilized to reduce the overall

energy consumption and cost simultaneously. Our idea also applies to datacenters using

mechanical cooling, because contrary to previous work’s assumption [64], as shown in

Table 4.1, the chiller energy e�ciency also depends on outside temperature, albeit milder.

4.2.3 An Empirical Climate Study

Our idea hinges upon a key assumption: Temperatures are diverse and not well correlated

at di↵erent locations. In this section, we make our case concrete by supporting it with

an empirical analysis of historical climate data, which provides supportive evidence for

the assumption.

We use Google’s datacenter locations for our study, as they represent a global pro-

duction infrastructure and the location information is publicly available [2]. Google has

6 datacenters in the U.S., 1 in South America, 3 in Europe, and 3 in Asia. We acquire

historical temperature data from various data repositories of the National Climate Data

Center [78] for all 13 locations, covering the entire one-year period of 2011.

It is useful to first understand the climate profiles at individual locations. Figure 4.1

plots the daily average temperatures for three select locations in North America, Europe,

and South America, respectively. Geographical diversity exists despite the clear seasonal

pattern shared among all locations. For example, Finland appears to be especially favor-

able for cooling during winter months. Diversity is more salient for locations in di↵erent

hemispheres (e.g. Chile). We also observe a significant amount of day-to-day volatility,

4.2. BACKGROUND AND EMPIRICAL STUDIES 49

−200

2040

Tem

p. (C

)

The Dalles, OR

−200

2040

Tem

p. (C

)

Hamina, Finland

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec −20

02040

Tem

p. (C

)

Quilicura, Chile

Figure 4.1: Daily average temperature at three Google datacenter locations. Data fromthe Global Daily Weather Data of the National Climate Data Center (NCDC) [78]. Timeis in UTC. Notice that Chile is in the Southern Hemisphere and experiences a di↵erentseasonal trend.

suggesting that the availability and capability of outside air cooling constantly varies

across regions, and there is no single location that is always cooling e�cient.

We then examine short-term temperature volatility. As shown in Figure 4.2, the

hourly temperature variations are more dramatic and highly correlated with time-of-day,

which is intuitive to understand. Further, the highs and lows do not occur at the same

time for di↵erent regions due to time di↵erences.

Our approach would fail if hourly temperatures are well correlated at di↵erent lo-

cations. However, we find that this is not the case for datacenters that are usually far

apart from each other. Figure 4.3 shows a scatter plot of pairwise temperature correla-

tion coe�cients for all 13 locations. A few pairs are negatively correlated, and most lie

in between the 0.6 and -0.6 correlation lines.

The analysis above reveals that for globally deployed datacenters, local temperature at

individual locations exhibits both time and geographical diversity. Therefore, a carefully

designed workload management scheme is critically important and necessary, in order

to dynamically adjust datacenter operations to the ambient conditions, and to save the


0

10

20

Tem

p. (C

)

Council bluffs, IA

0

10

20

Tem

p. (C

)

Dublin, Ireland

Apr 16 Apr 17 Apr 18 Apr 19 Apr 20 Apr 21 Apr 2210

20

30

Tem

p. (C

)

Tseung Kwan, Hong Kong

Figure 4.2: Hourly temperature variations at three Google datacenter locations. Datafrom the Hourly Global Surface Data of NCDC [78]. Time is in UTC.

0 20 40 60 80 100−1

−0.5

0

0.5

1

Latitude difference

Corre

latio

n co

effic

ient

Figure 4.3: The relationship between correlation coe�cients of hourly temperatures andlatitude for 13 Google datacenter locations.

4.3. MODEL 51

overall energy costs.

We finally assess the extent to which outside air cooling can be exploited in di↵erent

regions. Based on the technical data of the representative air-side economizer studied

before [33], we assume that outside cooling cooling can be utilized (partially at least)

when the outside temperature is below 21�C(70�F). We thus calculate the percentages of

hours that outside air cooling can be used for all 13 locations. Table 4.2 shows that since

datacenters run 24/7, outside air cooling can actually be utilized very extensively in most

location, except Singapore which is located close to the Earth’s equator. Even locations

in hot climate, such as Hong Kong and Taiwan, are able to use outside air cooling for more

than 30% of the time, mostly during night hours and winter seasons. The observation

further demonstrates the need of temperature aware workload management fully reap

the potential of outside air cooling.

Council blu↵s, IA 73.63 Berkeley County, SC 53.70The Dalles, OR 85.19 Lenoir, NC 73.26

Mayes County, OK 65.65 Douglas County, GA 67.00Quilicura, Chile 79.05 St. Ghislain, Belgium 94.75Hamina, Finland 95.98 Dublin, Ireland 99.84

Hong Kong 36.96 Taiwan 40.02Singapore 0

Table 4.2: Percentages of hours outside air cooling can be used, when temperature is lessthan 21�C(70�F).

4.3 Model

In this section, we introduce our model first and then formulate the temperature aware

workload management problem of joint request routing and capacity allocation.

4.3.1 System Model

We consider a discrete time model where the length of a time slot matches the time scale

at which request routing and capacity allocation decisions are made, e.g., hourly. The


joint optimization is periodically solved at each time slot. We therefore focus only on a

single time slot.

We consider a provider that runs a set of datacenters J in distinct geographical

regions. Each datacenter j 2 J has a fixed capacity Cj

in terms of the number of servers.

To model datacenter operating costs, we consider both the energy cost and utility loss

of request routing and capacity allocation, which are detailed below. Before we proceed,

we summarize the important notations used throughout the chapter in Table 4.3.

Notation Meaningi 2 I user of front-end servicesj 2 J datacenterC

j

capacity of jPj

(Tj

) power price of j at Tj

Lij

network latency between i and jUi

(), Vj

() utility loss functions of i and jE

j

() power consumption function of j↵ij

interactive workload of i directed to j�j

batch workload in j

Table 4.3: Table of notations

4.3.2 Energy Cost and Cooling E�ciency

We focus on servers and the cooling system in our energy model. Other energy consumers,

such as network switches, power distribution systems, etc., have constant power draw

independent of workloads [36], and are not relevant to our problem.

For servers, we adopt the empirical model from [36] that calculates the individual

server power consumption as an a�ne function of CPU utilization, Pidle+(Ppeak � Pidle) u.

Pidle is the server power when idle, Ppeak is the server power when fully utilized, and u is

the CPU load. This model is especially accurate for calculating the aggregated power of

a large number of servers [36]. Thus, assuming that workloads are perfectly dispatched

and servers have a uniform utilization as a result, the server power of datacenter j can

be modeled as Cj

Pidle + (Ppeak � Pidle)Wj

, where W denotes the total workload in terms

of the number of servers required.

4.3. MODEL 53

For the cooling system, we take an empirical approach based on energy e�ciency data

of production CRACs. We choose not to rely on simplifying models for the individual

components of a CRAC and their interactions [121], because of the di�culty involved

in and the inaccuracy resulted from the process, especially for hybrid CRACs with both

outside air and mechanical cooling. As noted in Sec. 4.2.2, a production CRAC auto-

matically switches between mechanical, mixed, and outside air cooling modes according

to the ambient condition. Therefore, we study CRACs as a black box, with outside

temperature as the input, and its overall energy e�ciency as the output.

Specifically, we use partial PUE (pPUE) to measure the CRAC energy e�ciency3. As

in Sec. 4.2.2, pPUE is defined as

pPUE =Server power + Cooling power

Server power.

A smaller value indicates a more e�cient system with less overhead. We apply regres-

sion techniques to the empirical pPUE data of the Emerson CRAC [33] introduced in

Table 4.1. We find that the best fitting model describes pPUE as a quadratic function

of the outside temperature as plotted below.

−25 −15 −5 5 15 25 35 451

1.1

1.2

1.3

1.4

1.5

Outside temperature (C)

pPU

E

pPUE=7.1705e−5 T2+0.0041T+1.0743

Figure 4.4: Model fitting of pPUE as a function of the outside temperature T for Emer-son’s DSE

TMCRAC [33]. Small circles denote empirical data points.

3The conventional PUE metric reflects the energy e�ciency of the entire facility.


The curve can be calibrated given access to more data from measurements. For the

purpose of this work, our approach yields a tractable model that captures the overall

CRAC e�ciency for the entire spectrum of its operating modes. Our model is also useful

for future studies on datacenter cooling energy.

Given the outside temperature Tj

which captures the geographical diversity of tem-

perature, the total datacenter energy as a function of the workload Wj

can be expressed

as

Ej

(Wj

) = (Cj

Pidle + (Ppeak � Pidle)Wj

) · pPUE(Tj

). (4.1)

Here we implicitly assume that Tj

is known a priori and do not include it as the function

variable. This is valid since short-term weather forecast is fairly accurate and accessible.

A datacenter’s electricity price is denoted as Pj

. The price may additionally incor-

porate the environmental cost of generating electricity [40], which we do not consider

here. In reality, electricity can be purchased from local day-ahead or hour-ahead forward

markets at a pre-determined price [88]. Thus, we assume that Pj

is known a priori and

remains fixed for the duration of a time slot. The total energy cost, including server and

cooling power, is simply Pj

Ej

(Wj

).

4.3.3 Utility Loss

Request routing. The concept of utility loss captures the lost revenue due to the user-

perceived latency for request routing decisions. Latency is arguably the most important

performance metric for most cloud services. A small increase in the user-perceived latency

can cause substantial revenue loss for the provider [59]. We focus on the end-to-end

propagation latency, which largely accounts for the user-perceived latency compared to

other factors such as request processing times at datacenters [76]. The provider obtains

the propagation latency Lij

between user i and datacenter j through active measurements

[67] or other means.

4.3. MODEL 55

We use ↵ij

to denote the volume of requests routed to datacenter j from user i 2 I,

and Di

to denote the demand of each user that can be predicted using machine learning

techniques [64, 81]. Here, a user is an aggregated group of customers from a common

geographical region, which may be identified by a unique IP prefix. The lost revenue

from user i then depends on the average propagation latencyP

j

↵ij

Lij

/Di

through a

generic delay utility loss function Ui

. Ui

can take various forms depending on the cloud

service. Our algorithm and proof work for general utility loss functions as long as Ui

is

increasing, di↵erentiable, and convex.

As a case study, here we use a quadratic function to model user’s increased tendency

to leave the service with increased latency.

Ui

(↵i

) = qDi

X

j2J

↵ij

Lij

/Di

!2

, (4.2)

where q is the delay price that translates latency to monetary terms, and ↵i

= (↵i1, . . . ,↵i|J |)T .

Utility loss is clearly zero when latency is zero between the user and the datacenter.

Capacity allocation. We denote the utility loss of allocating �j

servers for batch

workloads as a di↵erentiable, decreasing, and convex function Vj

(�j

), since allocating

more resources increases the performance of batch jobs. Unlike interactive services, batch

jobs are delay tolerant and resource elastic. Utility functions such as the log function

are often used to capture such elasticity. However, utility functions model the benefit of

resource allocation. To model the utility loss of resource allocation, since the loss is zero

when the capacity is fully allocated to batch jobs, an intuitive definition can be of the

following form:

Vj

(�j

) = r(logCj

� log �j

), (4.3)

where r is the utility price that converts the loss to monetary terms. Observe that it

captures the intuition that increasing resources results in a decreasing marginal reduction

of utility loss.


4.3.4 Problem Formulation

We now formulate the temperature aware workload management problem. For a given

request routing scheme ↵, the total cost associated with interactive workloads can be

written asX

j2J

Ej

✓

X

i2I

↵ij

◆

Pj

+X

i2I

Ui

(↵i

) . (4.4)

For a given capacity allocation decision �, the total cost associated with batch workloads

is:X

j2J

Ej

(�j

)Pj

+X

j2J

Vj

(�j

). (4.5)

Putting everything together, the optimization can be formulated as:

minimize (4.4) + (4.5) (4.6)

subject to: 8i :X

j2J

↵ij

= Di

, (4.7)

8j :X

i2I

↵ij

Cj

� �j

, (4.8)

↵, � ⌫ 0, (4.9)

variables: ↵ 2 R|I|⇥|J |, � 2 R|J |.

(4.6) is the objective function that jointly considers the cost of request routing and capac-

ity allocation. (4.7) is the workload conservation constraint to ensure the user demand

is satisfied. (4.8) is the datacenter capacity constraint, and (4.9) is the nonnegativity

constraint.

4.3.5 Transforming to the ADMM Form

Problem (4.6) is a large-scale convex optimization problem. The number of users, i.e.,

unique IP prefixes, is typically O(105)–O(106) for production systems. Hence, our prob-

4.3. MODEL 57

lem can have tens of millions of variables, and millions of constraints. In such a setting,

a distributed algorithm is preferable to fully utilize the computing resources of datacen-

ters. Traditionally, dual decomposition with subgradient methods [12] are often used to

develop distributed optimization algorithms. However, they su↵er from the curse of step

sizes. For the final output to be close to the optimum, we need to strategically pick the

step size at each iteration, leading to well-known problems of slow convergence and per-

formance oscillation with large-scale problems. Therefore, we are motivated to develop a

novel distributed algorithm based on ADMM, which is well-suited to solving large-scale

distributed convex optimization problems as introduced in Chapter 3.2.

Our formulation (4.6) has a separable objective function due to the joint nature of the

workload management problem. However, the request routing decision ↵ and capacity

allocation decision � are coupled by an inequality constraint rather than an equality

constraint as in ADMM problems. Thus we introduce a slack variable � 2 R|J |, and

transform (4.6) to the following

minimize (4.4) + (4.5) + IR|J |+(�) (4.10)

subject to: (4.7), (4.9),

8j :X

i

↵ij

+ �j

+ �j

= Cj

, (4.11)

variables: ↵ 2 R|I|⇥|J |, � 2 R|J |, � 2 R|J |.

Here, IR|J |+(�) is an indicator function defined as

IR|J |+(�) =

8

>

<

>

:

0, � ⌫ 0,

+1, otherwise.(4.12)

The new formulation (4.10) is equivalent to (4.6), since for any feasible ↵ and �,

� ⌫ 0 holds, and the indicator function in the objective values to zero. Clearly, it is in


the ADMM form, with a key di↵erence that it has three sets of variables in the objective

function and equality constraint (4.11). The convergence of the generalized m-block

ADMM, where m � 3, has long remained an open question. Though it seems natural

to directly extend the classical 2-block algorithm to the m-block case, such an algorithm

may not converge unless some additional back-substitution step is taken [49]. Recently,

some progresses have been made by [48,53] that prove the convergence ofm-block ADMM

for strongly convex objective functions and the linear convergence of m-block ADMM

under a full-column-rank relation matrix. However, the relation matrix in our setup is

not full column rank. Thus, we need a new proof for the linear convergence under a

general relation matrix, together with a distributed algorithm inspired by the proof.

4.4 Theory

This section first introduces a generalized m-block ADMM algorithm inspired by [48,53].

Then a new convergence proof is presented, which replaces the full column rank assump-

tion with some mild assumptions on the objective function, and further simplifies the

proof in [53]. The notations and discussions in this section are made intentionally inde-

pendent of the other parts of the chapter in order to present the proof in a mathematically

general way.

4.4.1 Algorithm

We consider a convex optimization problem in the form

minm

X

i=1

fi

(xi

) (4.13)

s.t.m

X

i=1

Ai

xi

= b

4.4. THEORY 59

with variables xi

2 Rni (i = 1, . . . ,m), where fi

: Rni ! R (i = 1, . . . ,m) are closed

proper convex functions; Ai

2 Rl⇥ni (i = 1, . . . ,m) are given matrices; and b 2 Rl is a

given vector.

We form the augmented Lagrangian

L⇢

(x1, . . . , xm

; y) =m

X

i=1

fi

(xi

) + yT (m

X

i=1

Ai

xi

� b) + (⇢/2)km

X

i=1

Ai

xi

� bk22. (4.14)

As in [53], a generalized ADMM algorithm has the following:

xk+1i

= argminxi

L⇢

(xk+11 , . . . , xk+1

i�1 , xi

, xk

i+1, . . . , xk

m

; yk), i = 1, . . . ,m,

yk+1 = yk + %(m

X

i=1

Ai

xk+1i

� b),

where % > 0 is the step size for the dual update. Note that the step size % is di↵erent from

the penalty parameter ⇢ in the generalized m-block ADMM algorithm, for otherwise it

may not converge [49].

For comparison, the method of multipliers for (4.13) has the form

(xk+11 , . . . , xk+1

m

) = argminx1,...,xm

L⇢

(x1, . . . , xm

; yk), (4.15)

yk+1 = yk + %(m

X

i=1

Ai

xk+1i

� b).

Here, the augmented Lagrangian is minimized jointly rather than sequentially.

4.4.2 Assumptions

We discuss several assumptions and their implications, which are needed for the conver-

gence analysis. For convenience, we write

x = (x1 · · · xm

)T , f(x) =m

X

i=1

fi

(xi

), and A = [A1 . . . Am

].


Clearly, x 2 Rn, A 2 Rl⇥n, where n =P

m

i=1 ni

. The problem (4.13) can be rewritten as

min f(x)

s.t. Ax = b

with the optimal value denoted by

p⇤ = inf{f(x) | Ax = b}.

Similarly, the augmented Lagrangian can be rewritten as

L⇢

(x; y) = f(x) + yT (Ax� b) + (⇢/2)kAx� bk22,

with the associated dual function defined by

d(y) = infx

L⇢

(x; y).

The dual problem is

max d(y)

with the optimal value

d⇤ = sup{d(y)}.

Assumption 4.1. The unaugmented Lagrangian L0 has a saddle point.

Explicitly, there exist (x⇤, y⇤), not necessarily unique, for which

L0(x⇤; y) L0(x

⇤; y⇤) L0(x; y⇤)

holds for all x, y.

Assumption 4.1 implies that x⇤ is primal optimal, y⇤ is dual optimal, and the optimal

4.4. THEORY 61

duality gap is zero, i.e., p⇤ = d⇤.

When Assumption 4.1 fails to hold, some subproblems in the generalized ADMM

algorithm are either unsolvable or unbounded, or the sequence {yk} in the algorithm

diverges.

Assumption 4.2. The functions fi

(i = 1, . . . ,m) are strongly convex.

A function f : Rn ! R is strongly convex with constant ⌫ > 0, if for all x1, x2 2 Rn,

and all ✓ 2 [0, 1],

f(✓x1 + (1� ✓)x2) ✓f(x1) + (1� ✓)f(x2)�1

2⌫✓(1� ✓)kx1 � x2k22.

When f is di↵erentiable, f is strongly convex with constant ⌫ if and only if

f(x1) � f(x2) +rf(x2)T (x1 � x2) +

⌫

2kx1 � x2k22 (4.16)

holds for all x1, x2 2 Rn. Thus, (4.16) can be viewed as a first-order condition for strong

convexity.

When f is twice continuously di↵erentiable, a second-order condition for strong con-

vexity is the following

r2f(x) ⌫ ⌫In

, 8x 2 Rn.

Here, r2f denotes the Hessian matrix of f , In

denotes the n ⇥ n identity matrix, and

the inequality ⌫ means that r2f(x)� ⌫In

is positive semi-definite.

Note that strong convexity is not a strong assumption in engineering practice. This

is because a convex function f(x) can be always well-approximated by a strongly convex

function f(x). For instance, if we choose f(x) = f(x) + ✏kxk22 for some su�ciently small

✏ > 0, then f(x) is strongly convex with constant 2✏.

Assumption 4.3. The gradients rfi

(i = 1, . . . ,m) are Lipschitz continuous.


That is, for each i, there exists some constant i

> 0 such that for all x1, x2 2 Rni ,

krfi

(x1)�rfi

(x2)k2 i

kx1 � x2k2.

4.4.3 Convergence

In this section, we prove the convergence of the generalized ADMM algorithm. We outline

the main idea of the analysis here. Some technical details are deferred in the Appendix

for better readability.

Define the primal and dual optimality gaps as

�k

p

= L⇢

(xk+1; yk)� d(yk),

�k

d

= d⇤ � d(yk),

respectively. By the definition of d(y), �k

p

� 0. Similarly, �k

d

� 0. Define

V k = �k

p

+�k

d

.

We will see that V k is a Lyapunov function for the algorithm, i.e., a nonnegative quantity

that decreases in each iteration.

Our proof relies on three key inequalities. The first inequality is

V k V k�1 � %kAxk+1 � bk22 � #kxk+1 � xkk22, (4.17)

for some constant # > 0, where xk+1 is defined in (4.15).

This states that V k decreases in each iteration by an amount that depends on the norm

of Axk+1 � b and on the change in each xi

over one iteration. Iterating the inequalities

4.4. THEORY 63

above gives that1X

k=0

�

%kAxk+1 � bk22 + #kxk+1 � xkk22�

V 0,

which implies that

kAxk+1 � bk22 ! 0, and kxk+1 � xkk22 ! 0,

as k ! 1.

Suppose that the level set of �p

+�d

is bounded, i.e.,

� = sup{kxk+ kyk | (L⇢

(x; y)� d(y)) + (d⇤ � d(y)) V 0} < 1. (4.18)

Since V k V 0 for all k, it follows that the sequence {xk+1, yk} in the generalized ADMM

algorithm is bounded by kxk+1k + kykk �, for all k. In particular, this implies that

kxkk, kykk � for all k. By the Bolzano-Weierstrass theorem, the sequence {xk, yk} has

a convergent subsequence, i.e.,

limk2R,k!1

(xk, yk) = (x, y),

for some subsequence R, where (x, y) denotes the limit point. It is natural to expect that

x is primal optimal and y is dual optimal. We will see that this is indeed the case.

The second key inequality says that there exists a constant ⌧ > 0 such that for any

(x, y) satisfying kxk+ kyk 2�, the following Ho↵man-like error bound holds

kx� x(y)k ⌧krx

L⇢

(x; y)k, (4.19)

where x(y) = argminx

L⇢

(x; y).


In particular, inequality (4.19) implies that

kxk � xk+1k22 ⌧ 2krx

L⇢

(xk; yk)k22. (4.20)

This is because kxkk, kykk � for all k, and xk+1 minimizes L⇢

(x; yk).

The third inequality is

krx

L⇢

(xk; yk)k2 ⌘kxk � xk+1k2. (4.21)

Thus, we have

kxk � xk+1k22 ⌧ 2⌘2kxk � xk+1k22.

It follows that

limk2R,k!1

kxk � xk+1k ! 0.

This further implies that the subsequence {xk+1}k2R converges to x, as k ! 1. Since

kAxk+1 � bk2 ! 0, we have kAx� bk = 0, or equivalently, Ax� b = 0.

Now we are ready to show that x is primal optimal and y is dual optimal. Note that

d⇤ � d(yk) = p⇤ � L⇢

(xk+1; yk) (4.22)

= p⇤ � f(xk+1)� (yk)T (Axk+1 � b)

� (⇢/2)kAxk+1 � bk22 (4.23)

where (4.22) follows from Assumption 1 and the fact that xk+1 minimizes L⇢

(x; yk). By

taking limit in (4.23) along the subsequence R, we obtain

d⇤ � d(y) = p⇤ � f(x)� yT (Ax� b)� (⇢/2)kAx� bk22

= p⇤ � f(x),

4.4. THEORY 65

where the last equality follows from the fact that Ax� b = 0.

Recall that d⇤ � d(y), and p⇤ f(x) when Ax � b = 0. Thus, we have d⇤ = d(y)

and p⇤ = f(x). That is, for each convergent subsequence {xk, yk}k2R, the associate limit

point (x, y) is an optimal primal-dual solution.

Next, we show that V k = �k

p

+�k

d

! 0, as k ! 1. On the one hand, we have

limk2R,k!1

�k

d

= d⇤ � d(y) = 0.

On the other hand, we have

limk2R,k!1

�k

p

limk2R,k!1

L⇢

(xk; yk)� d(yk)

= L⇢

(x; y)� d(y)

= p⇤ � d⇤

= 0.

Since �k

p

� 0 for all k, we conclude that

limk2R,k!1

�k

p

= 0.

Thus, we have

limk2R,k!1

�k

p

+�k

d

= 0.

Recall that V k decreases in each iteration, i.e., V k+1 V k for all k. Thus, the

convergence of a subsequence of V k implies the convergence of V k, and we have

limk!1

�k

p

+�k

d

= 0.


This further implies that both �k

p

and �k

d

converge to 0, i.e.,

limk!1

�k

p

= limk!1

�k

d

= 0.

To sum up, we have the following theorem that describes the convergence of the

generalized ADMM algorithm.

Theorem 4.1. Suppose Assumptions 1, 2, 3 hold and that the level set of �p

+ �d

is

bounded. Then both the primal gap �k

p

and the dual gap �k

d

converge to 0. Moreover,

the sequence {xk, yk} has a convergent subsequence, and any convergent subsequence of

{xk, yk} converges to an optimal primal-dual solution for the problem (4.13).

Remark 4.1. Although �k

p

+�k

d

decreases in each iteration, there is no guarantee that

the primal gap �k

p

is reduced in each iteration. The same applies to the dual gap �k

d

as

well.

Remark 4.2. By some additional argument, we can show that the sequence {�k

p

+�k

d

}

converges to zero Q-linearly, and that both �k

p

and �k

d

converge to zero R-linearly4. The

key step is to show that {�k

p

+�k

d

} contracts geometrically, i.e.,

�k+1p

+�k+1d

µ(�k

p

+�k

d

)

for some µ 2 (0, 1). Since the key step can be established by using a similar argument in

the proof of Theorem 3.1 in [53], we omit the proof here due to space constraint.

Remark 4.3. The linear convergence of m-block ADMM is di�cult, if not impossible,

to show under the framework of [48]. Thus, our result extends the result in [48] from

convergence to linear convergence.

4Suppose a sequence {uk} converges to u. We say the convergence is (in some norm k · k)

Q-linear, if there exists µ 2 (0, 1) such that kuk+1 � uk µkuk � uk;R-linear, if there exists a sequence {�k} such that kuk � uk �k and �k ! 0 Q-linearly.

4.5. A DISTRIBUTED ALGORITHM 67

The detailed proofs of the three key inequalities are presented in Sec. 7 for readability

concerns.

4.5 A Distributed Algorithm

We now develop a distributed solution algorithm based on the generalized ADMM algo-

rithm in Sec. 4.4.1. Directly applying the algorithm to our problem (4.10) will lead to a

centralized algorithm. The reason is that when the augmented Lagrangian is minimized

over ↵, the penalty termP

j

⇣

P

i

↵ij

+ �j

+ �j

� Cj

⌘2

couples ↵ij

’s across i, and the

utility lossP

i

Ui

(↵i

) couples ↵ij

’s across j. The joint optimization of utility loss and the

quadratic penalty is particularly di�cult to solve, especially when the number of users

is large, since Ui

(↵i

) can take any general form. If they can be separated, then we will

have a distributed algorithm where each Ui

(↵i

) is optimized in parallel, and the quadratic

penalty term is optimized e�ciently with existing methods.

Towards this end, we introduce a new set of auxiliary variables aij

= ↵ij

, and re-

formulate the problem (4.10) as follows:

minimizeX

j

Ej

(X

i

aij

)Pj

+X

i

Ui

(↵i

) + (4.5) + IR|J |+(�)

subject to: (3.2), (4.9),

8j :X

i

aij

+ �j

+ �j

= Cj

,

8i, j : aij

= ↵ij

,

variables: a,↵ 2 R|I|⇥|J |, �, � 2 R|J |. (4.24)

This is a 4-block ADMM problem, where aij

replaces ↵ij

in the objective function and

constraint (4.11) when the coupling happens across users i. This is the key step that

enables the decomposition of the ↵-minimization problem.


The augmented Lagrangian can then be readily obtained from (4.14). By omitting

the irrelevant terms, we can see that at each iteration k+1, the ↵-minimization problem

is

minX

i

Ui

(↵i

)�X

j

X

i

⇣

'ij

↵ij

� ⇢

2(↵2

ij

� 2↵ij

akij

)⌘

s.t. 8i :X

j

↵ij

= Di

,↵i

⌫ 0, (4.25)

where 'ij

is the dual variable for the equality constraint aij

= ↵ij

. This is clearly

decomposable over i into |I| per-user sub-problems since the objective function and

constraint are separable over i. The per-user sub-problem is of a much smaller scale

with only |J | variables and |J | + 1 constraints, and is easy to solve even though it is a

non-linear problem for a general Ui

.

Some may now wonder if the auxiliary variable a is hard to solve for. The a-

minimization problem is

minX

j

✓

Ej

⇣

X

i

aij

⌘

Pj

+X

i

aij

(�k

j

+ 'k

ij

)

+⇢

2

⇣

(X

i

aij

)2 + 2X

i

aij

(�k

j

+ �k

j

� Cj

+ 0.5aij

� ↵k+1ij

)⌘

◆

s.t. a ⌫ 0, (4.26)

where �j

is the dual variable for the capacity constraint (4.8). This is decomposable over

j into |J | per-datacenter sub-problems. Moreover, each per-datacenter sub-problem is

a standard quadratic program. Though it is large-scale, it can be transformed into a

second-order cone program and solved e�ciently (see Appendix 7.6).

�- and �-minimization steps are clearly decomposable over j. The entire procedure

is summarized below.

Distributed 4-block ADMM. Initialize a,↵, �, �,�,' to 0. For k = 0, 1, . . . , repeat

4.5. A DISTRIBUTED ALGORITHM 69

1. ↵-minimization: Each user solves the following sub-problem for ↵k+1i

:

min Ui

(↵i

)�X

j

⇣

'ij

↵ij

� ⇢

2(↵2

ij

� 2↵ij

akij

)⌘

s.t.X

j

↵ij

= Di

,↵i

⌫ 0. (4.27)

2. a-minimization: Each datacenter solves the following sub-problem for ak+1j

=

(ak+11j , . . . , ak+1

|I|j )T :

min Ej

⇣

X

i

aij

⌘

Pj

+X

i

aij

(�k

j

+ 'k

ij

) +⇢

2(X

i

aij

)2

+ ⇢⇣

X

i

aij

(�k

j

+ �k

j

� Cj

+ 0.5aij

� ↵k+1ij

)⌘

s.t. aj

⌫ 0. (4.28)

3. �-minimization: Each datacenter solves the following sub-problem for �k+1j

:

min Ej

(�j

)Pj

+ Vj

(�j

) + �k

j

�j

+⇢

2

⇣

X

i

ak+1ij

+ �j

+ �k

j

� Cj

⌘2

s.t. �j

� 0.

4. �-minimization:

�k+1j

= max

(

0, Cj

� �j

⇢�X

i

ak+1ij

� �k+1j

)

, 8j.

5. Dual update: Each datacenter updates �j

for the capacity constraint (4.8):

�k+1j

= �k

j

+ %⇣

X

i

ak+1ij

+ �k+1j

+ �k+1j

� Cj

⌘

.


Each user updates 'ij

for the equality constraint aij

= ↵ij

:

'k+1ij

= 'k

ij

+ %(ak+1ij

� ↵k+1ij

), 8j.

The distributed nature of our algorithm allows for an e�cient parallel implementation

in datacenters with a large number of servers. In step 1, the per-user sub-problem (4.27)

can be solved in parallel on each server. Since (4.27) is a small-scale convex optimization

as discussed above, the complexity is low. A multi-threaded implementation can further

speed up the algorithm with multi-core hardware. The penalty parameter ⇢ and utility

loss function Ui

can be configured at each server before the algorithm runs. Step 2

and 3 involve solving |J | per-datacenter sub-problems respectively, which can also be

implemented in parallel with only |J | servers.

For more speed-up, our algorithm can be terminated before convergence is achieved.

This is a feature of ADMM as it is not sensitive to step sizes, and usually finds a solution

with modest accuracy within tens of iterations [13]. An early-braking mechanism may

be safely applied to terminate the algorithm after a certain number of iterations without

unpredictable performance loss. This further highlights the practicality of the algorithm.

The message passing overhead is low for datacenter interconnects that are designed to

handle bulky data transfers [4].

Convergence of our algorithm. It only remains to show that the convergence

result in Sec. 4.4 holds for our problem. Based on Theorem 4.1, it su�ces to examine

if the three assumptions hold for our problem. Assumption 4.1 and 3 clearly hold for

convex and di↵erentiable utility loss functions. Assumption 4.2 requires the objective

function to be strongly convex, which may or may not hold depending on the specific

form of the utility loss functions. When the utility loss of capacity allocation Vj

(�j

) is

modeled by a log function as in (4.3), V (�) =P

j

Vj

(�j

) is strongly convex, because its

4.6. EVALUATION 71

Hessian matrix

r2V (�) = r

0

B

B

B

B

@

1/�21 · · · 0

.... . .

...

0 · · · 1/�2|J |

1

C

C

C

C

A

satisfies r2V (�) � ⌫I|I| ⌫ 0, where ⌫ = minj

{r/C2j

}. When the utility loss of request

routing Ui

(↵i

) is modeled by a quadratic function of the average latency as in (4.2),

Ui

(↵i

) is not strongly convex. Nevertheless, Ui

(↵i

) can be well approximated by

Ui

(↵i

) = Ui

(↵i

)� ✏X

j

✓

↵ij

Di

◆

log

✓

Di

↵ij

◆

for some su�ciently small ✏ > 0. It is easily verified that Ui

(↵i

) is strongly convex with

constant ✏/Di

. Moreover, there is an engineering advantage of such an approximation.

Note that the entropy of a request routing scheme ↵i

is given byP

j

⇣

↵ij

Di

⌘

log⇣

Di↵ij

⌘

. The

larger the entropy is, the better the request routing in terms of the diversity. Thus, the

second term in Ui

(↵i

) tends to increase the entropy, which in turn improves the diversity.

4.6 Evaluation

We perform trace-driven simulations to realistically assess the potential of temperature

aware workload management.

4.6.1 Setup

We rely on the Wikipedia request traces [101] to represent the interactive workloads

of a cloud service. The dataset we use contains, among other things, 10% of all user

requests issued to Wikipedia from the 24-hour period between January 1, 2008 UTC to

January 2, 2008 UTC. The workloads are normalized to a number of servers, assuming

that each request requires 10% of a server’s CPU. The traces reflect the diurnal pattern


of real-world interactive workloads. The prediction of workloads can be done accurately

as demonstrated by previous work [64,81], and we do not consider the e↵ect of prediction

error here. The optimization is solved hourly.

We consider Google’s infrastructure [2] to represent a geo-distributed cloud as dis-

cussed in Sec. 4.2.3. Each datacenter’s capacity Cj

is uniformly distributed between

[1, 2]⇥ 105 servers. The empirical CRAC e�ciency model developed in Sec. 4.3.2 is used

to derive the total energy consumption of all 13 locations under di↵erent temperatures.

We use the 2011 annual average day-ahead on peak prices [37] at the local markets as

the power prices Pj

for the 6 U.S. locations5. For non-U.S. locations, the power price is

calculated based on the retail industrial power price available on the local utility com-

pany websites with a 50% wholesale discount. Table 4.4 lists the power prices at each

location. The servers have peak power Ppeak = 200 W, and consume 50% power at idle.

These numbers represent state-of-the-art datacenter hardware [36, 88].

Council Blu↵s, IA 42.73 Berkeley County, SC 44.44The Dalles, OR 32.57 Lenoir, NC 40.68

Mayes County, OK 36.41 Douglas County, GA 39.97Quilicura, Chile 75.69 St. Ghislain, Belgium 50.50Hamina, Finland 43.84 Dublin, Ireland 50.62

Hong Kong 36.12 Taiwan 31Singapore 66.72

Table 4.4: Power prices ($USD/MWh) at di↵erent locations.

To calculate the utility loss of interactive workloads, we rely on iPlane [67], a system

that collects wide-area network statistics from Planetlab vantage points, to obtain the

latency matrix L. Since the Wikipedia traces do not contain client side information,

we emulate the geographical diversity of user requests by splitting the total interactive

workloads among users following a normal distribution. We set the number of users

|I| = 105, and choose 105 IP prefixes from a RouteViews [1] dump. We then extract

5The U.S. electricity market is consisted of multiple regional markets. Each regional market hasseveral hubs with their own pricing. We thus use the price of the specific hub that each U.S. datacenterlocates in.

4.6. EVALUATION 73

the corresponding round trip times from iPlane logs, which contain traceroutes made to

IP addresses from Planetlab nodes. We only use latency measurements from Planetlab

nodes that are close to our datacenter locations to resemble the user-datacenter latency.

We use utility loss functions defined in (4.2) and (4.3). The delay price q = 4 ⇥ 10�6,

and the utility loss price for batch jobs r = 500.

4.6.2 Temperature Aware Workload Management

We first investigate the performance of temperature aware workload management. We

benchmark our ADMM algorithm, referred to as Optimal, against three baseline strate-

gies, which use di↵erent amounts of information in managing workloads.

The first benchmark, called Baseline, is a temperature agnostic strategy that sep-

arately considers capacity allocation and request routing of the workload management

problem. It first allocates capacity to batch jobs by minimizing the back-end total cost

with (4.5) as the objective. The remaining capacity is used to solve the request routing

optimization with (4.4) as the objective. Only the electricity price diversity is used, and

cooling energy is calculated with a constant pPUE of 1.2 for the two cost minimization

problems. Though naive, such an approach is widely used in current Internet-scale cloud

services. It also allows an implicit comparison with prior work [40, 63, 65, 88, 90].

The second benchmark, called Capacity Optimized, improves upon Baseline by jointly

solving capacity allocation and request routing, but still ignores the cooling energy ef-

ficiency diversity. This demonstrates the impact of capacity allocation in datacenter

workload management.

The third benchmark, called Cooling Optimized, improves upon Baseline by exploiting

the temperature and cooling e�ciency diversity in minimizing cost, but does not adopt

joint management of the interactive and batch workloads. This demonstrates the impact

of being temperature aware.

We run the four benchmarks above with our 24-hour traces at each day of January


2011, using the empirical hourly temperature data we collected in Sec. 4.2.3. The dis-

tributed ADMM algorithm is used to solve them until convergence is achieved. A detailed

discussion of convergence is deferred to Sec. 4.6.3. The results are thus averaged over 31

runs.

Cooling energy savings

The central idea of this chapter is to save datacenter cost through temperature aware

workload management that exploits the cooling e�ciency diversity with capacity allo-

cation. We examine the e↵ectiveness of our approach by comparing the cooling energy

consumption first. Figure 4.5 shows the results.

In particular, Figure 4.5a shows that overall, Optimal saves 15%–20% cooling energy

compared to Baseline. A breakdown of the saving shown in the same figure reveals that

dynamic capacity allocation provides 10%–15% saving, and cooling e�ciency diversity

provides 5%–10% saving, respectively. Note that the cost saving is achieved with cutting-

edge CRACs whose e�ciency is already substantially improved with outside air cooling

capability. The results confirm that our temperature aware workload management is able

to further optimize the cooling e�ciency and cost of geo-distributed datacenters.

Figure 4.5b and 4.5c show a detailed breakdown of cooling energy cost. Cooling cost

attributed to interactive workloads, as in Figure 4.5b, exhibits a diurnal pattern and

peaks between 2:00 and 8:00 UTC (21:00 to 3:00 EST, 18:00 to 0:00 PST), implying that

most of the Wikipedia tra�c originates from the U.S. The four strategies perform fairly

closely, while Baseline and Capacity optimized consistently incur more cooling energy

cost due to their cooling agnostic nature that underestimates the overall energy cost.

Cooling cost attributed to batch workloads is shown in Figure 4.5c. Baseline incurs

the highest cost since it underestimates the energy cost, and runs more batch workloads

than necessary. Cooling optimized improves Baseline by taking into account cooling

e�ciency diversity and reducing batch workloads as a result. Both strategies fail to

4.6. EVALUATION 75

0:00 4:00 8:00 12:00 16:00 20:000

0.05

0.1

0.15

0.2

0.25

OptimalCapacity optimizedCooling optimized

(a) Overall improvement.

0:00 4:00 8:00 12:00 16:00 20:000.5

0.6

0.7

0.8

0.9

1

Coo

ling

cost

($10

3 )

BaselineCapacity optimizedCooling optimizedOptimal

(b) Interactive workloads.

0:00 4:00 8:00 12:00 16:00 20:000.2

0.4

0.6

0.8

Coo

ling

cost

($10

3 )


(c) Batch workloads.

Figure 4.5: Cooling energy savings. Time is in UTC.


0:00 4:00 8:00 12:00 16:00 20:00

0

0.1

0.2

OptimalCapacity optimizedCooling optimized

(a) Overall improvement.

0:00 4:00 8:00 12:00 16:00 20:0010

20

30

40

50

Util

ity lo

ss ($

103 )


(b) Interactive workloads.

0:00 4:00 8:00 12:00 16:00 20:005

6

7

8

9

10

Util

ity lo

ss ($

103 )


(c) Batch workloads.

Figure 4.6: Utility loss reductions. Time is in UTC.

4.6. EVALUATION 77

exploit the trade-o↵ with interactive workloads. Thus their cooling cost closely follows

the daily temperature trend in that it gradually decreases from 0:00 to 12:00 UTC (19:00

to 7:00 EST) and then slowly increases from 12:00 to 20:00 UTC (7:00 to 15:00 EST).

Capacity optimized adjusts capacity allocation with request routing, and further reduces

batch workloads in order to allocate more resources for interactive workloads. Optimal

combines temperature aware cooling optimization with holistic workload management,

and has the lowest cooling cost with least batch workloads. Though this increases the

back-end utility loss, the overall e↵ect is a net reduction of total cost since interactive

workloads enjoy lower latency as will be observed soon.

Utility loss reductions

The other component of datacenter cost is utility loss. From Figure 4.6a, the relative

reduction follows the interactive workloads and also has a visible diurnal pattern. Opti-

mal and Capacity optimized provide the most significant utility loss reductions from 5%

to 25%, while Cooling optimized provides a modest 5% reduction compared to Baseline.

To study the reasons for the varying degrees of reductions, Figure 4.6b and 4.6c show

the respective utility loss of interactive and batch workloads. We observe that interac-

tive workloads incur most of the utility loss, reflecting its importance compared to batch

workloads. Baseline and Cooling optimized have much larger utility loss from interactive

workloads as shown in Figure 4.6b, because of the separate management of two work-

loads. The average latency performances under these two strategies are also worse as

demonstrated in Figure 4.7.

On the other hand, Capacity optimized and Optimal outperform the two by allo-

cating more capacity to interactive workloads at cost-e�cient locations while reducing

batch workloads (recall Figure 4.5c). This is especially e↵ective during peak hours as

shown in Figure 4.6b. Capacity optimized and Optimal do have larger utility loss from

batch workloads as seen in Figure 4.6c. However since interactive workloads attribute


0:00 4:00 8:00 12:00 16:00 20:0060

65

70

75

80M

ean

late

ncy

(ms)


Figure 4.7: Baseline and Cooling optimizedinduce larger average latency.

0:00 4:00 8:00 12:00 16:00 20:000

0.05

0.1

0.15

0.2

0.25

JanuaryMayAugust

Figure 4.8: Overall cost saving is insensitiveto seasonal changes of the climate.

to the majority of the provider’s utility and revenue, the overall e↵ect of joint workload

management is positive.

Sensitivity to seasonal changes

One natural question is, since the results above are obtained in winter times (January),

would the benefits be less significant during summer times when cooling is more expen-

sive? In other words, are the benefits sensitive to the seasonal changes? We thus run our

Optimal approach with Baseline at each day of May, which represents typical Spring/Fall

weather, and August, which represents typical Summer weather, respectively. Figure 4.8

shows the average overall cost savings achieved in di↵erent seasons. We observe that the

cost savings, ranging from 5% to 20%, are consistent and insensitive to seasonal changes.

The reason is that our approach depends on: 1) the geographical diversity of temperature

and cooling e�ciency; 2) the mixed nature of datacenter workloads, both of which exist

at all times of the year no matter which cooling method is used. Temperature aware

workload management is thus expected to o↵er consistent cost benefits.

4.6.3 Algorithm Convergence

We evaluate the convergence of our distributed ADMM algorithm in this section. As a

benchmark, a dual decomposition approach is used to tackle the original optimization

4.6. EVALUATION 79

(3.1), with the standard Lagrangian

L(↵, �;�) =X

j

E(X

i

↵ij

)Pj

+X

i

Ui

(↵i

)

+X

j

(E(�j

)Pj

+ Vj

(�j

)) +X

j

�j

(X

i

↵ij

+ �j

� Cj

).

It can be readily seen that minimizing L(↵, �;�) can be separately done over ↵ and �

since the energy cost function E is linear. The ↵- and �-minimization problems can also

be decomposed into per-user and per-datacenter sub-problems. The dual variable � is

updated with subgradient methods [12] as follows:

�k+1j

= maxn

0,�k

j

+ %k⇣

X

i

↵ij

+ �j

� Cj

⌘o

,

where % is the step size. We optimize the step size %k = 10�6/pk according to the dimin-

ishing step size rule [12]. For our distributed ADMM algorithm, the penalty parameter

⇢ = 3⇥ 10�7, and the step size % = 10�6.

The stopping rules of the algorithms are set as follows. ADMM algorithms are usually

stopped when the primal and dual residuals are smaller than certain tolerance thresh-

olds [13]. The calculations of primal and dual residuals and the tolerance thresholds are

identical to those in [13], and we omit details here. For dual decomposition with subgra-

dient methods, it is terminated when k↵k+1 � ↵kk22 < 10�2k↵kk22, or when the number

of iterations exceeds 200. Other parameter setup, including the scale of the problems, is

the same as in previous simulations.

Figure 4.9a plots the empirical CDF of convergence iterations for the 24 runs using

our traces. We find that the dual decomposition approach with subgradient methods

cannot converge within 200 iterations in all runs. Thus we only show the CDF for

our algorithm. Observe that distributed ADMM converges within 73 iterations in all

runs, and the fastest run uses 53 iterations only. The convergence of distributed ADMM


50 55 60 65 70 750

0.2

0.4

0.6

0.8

1

Empi

rical

CD

F

Distributed ADMM

(a) CDF of convergence iterations.

0 50 100 150 2000

0.5

1

1.5

Prim

al re

sidu

al (1

05 )

0 50 100 150 2000

1

2

3

Prim

al re

sidu

al (1

03 )Distributed ADMMSubgradient method

(b) Primal residuals of algorithms.

0:00 4:00 8:00 12:00 16:00 20:00−2

−1

0

1

2

3

Opt

imal

ity g

ap (1

0−3 )

50 iterations40 iterations

(c) Early braking at 50 and 40 iterations.

Figure 4.9: Convergence results of our distributed ADMM algorithm compared againstsubgradient methods.

4.7. SUMMARY 81

thus significantly outperforms the traditional subgradient methods. Figure 4.9b depicts

a sample path of the convergence of the primal residual for the two algorithms. We

point out the scales of the primal residuals for the two algorithms are di↵erent, since the

distributed ADMM solves the 4-block formulation (4.24) while the subgradient method

solves the original formulation (3.1). We can see that the curve of distributed ADMM

decreases smoothly and reaches below the tolerance threshold after 61 iterations. The

subgradient method su↵ers from oscillations after 70 iterations, and fails to decrease

below the threshold.

Finally, Figure 4.9c shows the performance of early-braking for our distributed ADMM

algorithm. We plot the solutions of the algorithm after 50 and 40 iterations, respectively.

Clearly, the optimality gap of stopping after 40 iterations is larger than stopping after

50 iterations. A more interesting observation is that the optimality gap is strikingly

small — only 10�3 relative to the optimum. There are times when the optimality gap

becomes negative. This is caused by (primal or dual) infeasible solutions produced by

early-braking during demand peak periods. The feasibility gap is rather small, though,

and can be readily fixed.

The results demonstrate that the distributed ADMM algorithm converges quickly,

and is better suited to large-scale convex optimization problems. The early-braking

mechanism can further improve the convergence in practice with negligible performance

loss.

4.7 Summary

We propose temperature aware workload management, which explores two key aspects

of geo-distributed datacenters that have not been well understood in the past. First, as

we show empirically, energy e�ciency of cooling systems, especially outside air cooling,

varies widely with outside temperature which has significant geographical diversity. This


diversity is utilized to aid the workload management decisions to reduce cooling energy

consumption. Second, the elastic nature of batch workloads is further capitalized by dy-

namically adjusting capacity allocation along with the widely studied request routing for

interactive workloads. We formulate the joint optimization under a general framework

with an empirical cooling e�ciency model. To solve large-scale problems for produc-

tion systems, we rely on the ADMM algorithm that has recently found practical use in

large-scale distributed convex optimization. We provide a new convergence proof for a

generalized m-block ADMM algorithm. We further develop a novel distributed ADMM

algorithm for our problem. Extensive simulations highlight that temperature aware work-

load management saves 15%–20% cooling energy and 5%–20% overall energy cost, and

the distributed ADMM algorithm is practical to solve large-scale workload management

problems with only tens of iterations.

Chapter 5

Resource Management in

Virtualized Datacenters

Having discussed about workload management, this chapter concerns about the resource

management in virtualized datacenters.

5.1 Motivation

Typically, cloud operators have a wide variety of distinct resource management objectives

to achieve, such as workload consolidation, cost minimization, and load balancing. For

example, a public operator such as Amazon may wish to use a workload consolidation

policy to minimize its operating costs, while a private enterprise cloud may wish to adopt

a load balancing policy to provide the best quality of service. Further, VMs in a cloud

also impose diverse resource requirements that need to be accommodated, as they run

completely di↵erent applications owned by individual clients.

On the other hand, the infrastructure is usually managed as a whole by the operator,

who relies on a single resource management substrate. Therefore, the resource manage-

ment substrate must be general and expressive to accommodate a wide range of possible

policies for di↵erent use cases, and can be easily customized and extended. It also needs

83

84CHAPTER 5. RESOURCE MANAGEMENT IN VIRTUALIZED DATACENTERS

to be fair to orchestrate the needs and interests of both operators and clients. This is es-

pecially important for private and federated clouds [17] where the use of money may not

be appropriate to share resources fairly. Last but not the least, the resource management

algorithm needs to be e�cient so that large-scale problems can be handled.

Existing solutions fall short of the requirements we outlined. First, they tightly couple

policies with mechanisms. Resource management tools developed by the industry such as

VMware vSphere [107] and Eucalyptus [34], and by the open source community such as

Nimbus [79] and CloudStack [24], do not provide support for configurable policies for VM

placement. Existing papers on cloud resource management develop solutions for specific

scenarios and purposes, such as consolidation based on CPU usage [21, 70, 102], energy

consumption [19, 77, 106], bandwidth multiplexing [16, 56, 69], and storage dependence

[60]. Moreover, these solutions are developed for the operator without considering the

interest of clients.

We design Anchor, a new architecture that decouples policies from mechanisms for

cloud resource management. This is analogous to the design of BGP [92], where ISPs are

given the freedom to express their policies, and the routing mechanism is able to e�ciently

accommodate them. Anchor consists of three components: a resource monitor, a policy

manager, and a matching engine, as shown in Fig. 5.1. Both the operator and its clients

are able to configure their resource management policies, based on performance, cost,

etc., as they deem fit via the policy manager. When VM placement requests arrive, the

policy manager polls information from the resource monitor, and feeds it with the policies

to the matching engine. The matching mechanism resolves conflicts of interest among

stakeholders, and outputs a matching between VMs and servers.

The challenge of Anchor is then to design an expressive, fair, and e�cient match-

ing mechanism as we discussed. Our second major contribution is a novel matching

mechanism based on the stable matching framework [94] from economics, which elegantly

achieves all the design objectives. Specifically, the concept of preferences is used to en-

5.1. MOTIVATION 85

Resource Monitor

Policy Manager

Matching Engine

VirtualBox Hypervisor

VM1

Apache

Xen Hypervisor

Server1 Serverm......

Anchor Control Plane

VM2

Hadoop

VMn

DB

Management API

Figure 5.1: The Anchor architecture.

able stakeholders to express various policies with simple rank-ordered lists, fulfilling the

requirement of generality and expressiveness. Rather than optimality, stability is used

as the central solution concept to address the conflicts of interest among stakeholders,

fulfilling the fairness requirement. Finally, its algorithmic implementations based on the

classical deferred acceptance algorithm have been demonstrated to be practical in many

real-world applications [94], fulfilling the e�ciency requirement.

It may be tempting to formulate the matching problem as an optimization over certain

utility functions, each reflecting a policy goal. However, optimization su↵ers from two

important deficiencies in this case. First, as system-wide objectives are optimized, the

solutions may not be appealing to clients, whose interest do not necessarily align well with

the operator’s. In this regard, a cloud resembles a resource market in which clients and

the operator are autonomous selfish agents. Individual rationality needs to be respected

for the matching to be acceptable to all participants. Second, optimization solvers are

computationally expensive due to their combinatorial nature, and do not scale well.

The novelty of our stable matching mechanism lies in a rigorous treatment of size

heterogeneity in Sec. 5.4. Specifically, classical stable matching theory cannot be directly

applied here. Each VM has a di↵erent “size,” corresponding to its demand for CPU,

memory, and storage resources. Yet the economics literature assumes that each agent is


uniform in size. Size heterogeneity makes the problem much more di�cult, because even

the very definition of stability becomes unclear in this case. We formulate a general job-

machine stable matching problem with size heterogeneous jobs. We clarify the ambiguity

of the conventional stability definition in our model, propose a new stability concept,

develop algorithms to e�ciently find stable matchings with respect to the new definition,

and prove convergence and optimality results.

The performance of Anchor is evaluated realistically. We design a simple policy inter-

face, and showcase several common policy examples in Sec. 5.5. We present a prototype

implementation of Anchor on a 20-node server cluster, and conduct detailed performance

evaluation using both experiments and large-scale simulations based on real-world work-

load traces in Sec. 5.6.

5.2 Background and Model

5.2.1 A Primer on Stable Matching

We start by introducing the classical theory of stable matching in the basic one-to-one

marriage model [39]. There are two disjoint sets of agents, M = {m1,m2, . . . ,mn

} and

W = {w1, w2, . . . , wp

}, men and women. Each agent has a transitive preference over

individuals on the other side, and the possibility of being unmatched [94]. Preferences

can be represented as rank order lists of the form p(m1) = w4, w2, . . . , wi

, meaning that

man m1’s first choice of partner is w4, second choice is w2 and so on, until at some point

he prefers to be unmatched (i.e. matched to the empty set). We use �i

to denote the

ordering relationship of agent i (on either side of the market). If i prefers to remain

unmatched instead of being matched to agent j, i.e. ; �i

j, j is said to be unacceptable

to i, and preferences can be represented just by the list of acceptable partners.

Definition 5.1. An outcome is a matching µ : M ⇥ W ⇥ ; ! M ⇥ W ⇥ ; such that

w = µ(m) if and only if µ(w) = m, and µ(m) 2 W [ ;, µ(w) 2 M [ ;, 8m,w.

5.2. BACKGROUND AND MODEL 87

It is clear that we need further criteria to distill a “good” set of matchings from all

the possible outcomes. The first obvious criterion is individual rationality.

Definition 5.2. A matching is individual rational to all agents, if and only if there

does not exist an agent i who prefers being unmatched to being matched with µ(i), i.e.,

; �i

µ(i).

This implies that for a matched agent, its assigned partner should rank higher than the

empty set in its preference. Between a pair of matched agents, they are not unacceptable

to each other.

The second natural criterion is that a blocking set should not occur in a good matching:

Definition 5.3. A matching µ is blocked by a pair of agents (m,w) if they each prefer

each other to the partner they receive at µ. That is, w �m

µ(m) and m �w

µ(w). Such

a pair is called a blocking pair in general.

When a blocking pair exists, the agents involved have a natural incentive to break up

and form a new marriage. Therefore such an “unstable” matching is undesirable.

Definition 5.4. A matching µ is stable if and only if it is individual rational, and not

blocked by any pair of agents.

Theorem 5.1. A stable matching exists for every marriage market.

This can be readily proved by the classic deferred acceptance algorithm (DA), or the

Gale-Shapley algorithm [39]. It works by having agents on one side of the market, say

men, propose to the other side, in order of their preferences. As long as there exists a

man who is free and has not yet proposed to every woman in his preference, he proposes

to the most preferred woman who has not yet rejected him. The woman, if free, “holds”

the proposal instead of directly accepting it. In case she already has a proposal at hand,

she rejects the less preferred. This continues until no proposal can be made, at which

point the algorithm stops and matches each woman to the man (if any) whose proposal


she is holding. The woman-proposing version works in the same way by swapping the

roles of man and woman. It can be readily seen that the order in which men propose is

immaterial to the outcome.

5.2.2 Models and Assumptions

In a cloud, each VM is allocated a slice of resources from its hosting server. In this work,

we assume that the size of a slice is a multiple of an atomic VM. For instance, if the

atomic VM has one CPU core equivalent to a 2007 Intel Xeon 1 GHz core, one memory

unit equivalent to 512 MB PC-10600 DDR3 memory, and one storage unit equivalent to

10 GB 5400 RPM HDD, a VM of size 2 means it e↵ectively has a 2 GHz 2007 Xeon CPU

core, 1 GB PC-10600 DDR3 memory and 20 GB 5400 RPM hard disk. Note that the

actual amount of resources is relative to the heterogeneous server hardware. Two VMs

have the same size as long as performance is equivalent for all resources.

This may seem an oversimplification and raise concerns about its validity in reality.

We comment that, in practice, such atomic sizing is common among large-scale public

clouds to reduce the overhead of managing hundreds of thousands of VMs. It is also

valid in production computer clusters [100], and widely adopted in related work [60,111]

to reduce the dimensionality of the problem.

Readers may notice that Amazon also has some VM configurations that do not fol-

low the atomic sizing assumption. We comment that those are mainly high-memory

and high-CPU instances specialized for certain applications, and it is highly likely that

they are managed separately with a di↵erent infrastructure. Measurement results [108]

corroborate our arguments while public information is yet to be made available on the

specifics of the Amazon infrastructure. We thus take the liberty to adopt this assump-

tion, and believe that this represents a close resemblance of reality while keeping the

problem analytically tractable1.

1Without the atomic sizing or any other assumption to reduce the dimensionality of resources, the

5.3. THEORETICAL CHALLENGES OF JOB-MACHINE STABLE MATCHING 89

We design Anchor for a setting where the workloads and resources demands of VMs

are relatively stable. Resource management in the cloud can be naturally cast as a

stable matching problem, where the overall pattern of common and conflicting interests

between stakeholders can be resolved by confining our attention to outcomes that are

stable. Broadly, it can be modelled as a many-to-one problem [39] where one server can

enroll multiple VMs but one VM can only be assigned to one server. Preferences are

used as an abstraction of policies no matter how they are defined.

In traditional many-to-one problems such as college admissions [39], each college has

a quota of the number of students it can take. This cannot be directly applied to our

scenario, as each VM has a di↵erent “size” corresponding to its demand for resources.

We cannot simply define the quota of a server as the number of VMs it can take.

We formulate VM placement as a job-machine stable matching problem with size

heterogeneous jobs. Each job has a size, and each machine has a capacity. A machine

can host multiple jobs as long as the total job size does not exceed its capacity. Each

job has a preference over all the acceptable machines that have su�cient capacities.

Similarly, each machine has a preference over all the acceptable jobs whose size is smaller

than its capacity. This is a more general many-to-one matching model in that the college

admissions problem is a special case with uni-size jobs (students).

5.3 Theoretical Challenges of Job-Machine Stable Match-

ing

We present theoretical challenges introduced by size heterogeneous jobs in this section.

Following convention, we can naturally define a blocking pair in job-machine stable

matching based on the following intuition. In a matching µ, whenever a job j prefers a

problem is essentially a multiple knapsack problem which is known to be NP-hard [60]. Multi-resourceallocation is an open problem for which there has been some progress made recently in understandingthe problem [57].


machine m to its assigned machine µ(j) (can be ; meaning it is unassigned), and m has

vacant capacity to admit j, or when m does not have enough capacity, but by rejecting

some or all of the accepted jobs that rank lower than j it is able to admit j, then j and

m have a strong incentive to deviate from µ and form a new matching. Therefore,

Definition 5.5. A job-machine pair (j,m) is a blocking pair if any of the two conditions

holds:

(a): c(m) � s(j), j �m

;, and m �j

µ(j), (5.1)

(b): c(m) < s(j), c(m) +X

j

0

s(j0) � s(j),

where j0 �m

j, j0 2 µ(m), and m �j

µ(j). (5.2)

c(m) denotes the capacity of machine m, and s(j) denotes the size of job j.

a

b

c

2

1

1

A

B

2

1

A

A, Bc, a, b

b, c

p(j) j s(j) c(m) m p(m)

B, A

Figure 5.2: A simple example where there is no strongly stable matching. Recall thatp() denotes the preference of an agent.

Depending on whether a blocking pair satisfies condition (5.1) or (5.2), we say it

is a type-1 or type-2 blocking pair. For example, in a setting shown in Fig. 5.2, the

matching A � (a), B � ; contains two type-1 blocking pairs (b, B) and (c, B), and one

type-2 blocking pair (c, A).

Definition 5.6. A job-machine matching is strongly stable if it does not contain any

blocking pair.


5.3.1 Non-existence of Strongly Stable Matchings

It is clear that both types of blocking pairs are undesirable, and we ought to find a strongly

stable matching. However, such a matching may not exist in some cases. Fig. 5.2 shows

one such example with three jobs and two machines. It can be verified that every possible

matching contains either type-1 or type-2 blocking pairs.

Proposition 5.1. Strongly stable matching does not always exist.

Note that the definitions of type-1 and type-2 blocking pair coincide in classical prob-

lems with uni-size jobs. The reason why they do not remain so in our model is the

size complementariness among jobs. In our problem, the concept of capacity denotes

the amount of resources a machine can provide, which may be used to support di↵erent

numbers of jobs. A machine’s preferable job, which is more likely to be admitted in

order to avoid type-2 blocking pairs, may consume less resources, and creates a higher

likelihood for type-1 blocking pairs to happen on the same machine.

The non-existence result demonstrates the theoretical di�culty of the problem. We

find that it is hard to even determine the necessary or su�cient conditions for the exis-

tence of strongly stable matchings in a given problem instance, albeit its definition seems

natural. Therefore, for mathematical tractability, in the subsequent parts of the chapter,

we work with the following relaxed definition:

Definition 5.7. A matching is weakly stable if it does not contain any type-2 blocking

pair.

For example in Fig. 5.2, A� (c), B� (b) is a weakly but not strongly stable matching,

because it has a type-1 blocking pair (b, A). Thus, weakly stable matchings are a superset

that subsumes strongly stable matchings. A matching is thus called unstable if it is not

weakly stable.


5.3.2 Failure of the DA Algorithm

With the new stability concept, the first theoretical challenge is how to find a weakly

stable matching, and does it always exist? If we can devise an algorithm that produces

a weakly stable solution for any given instance, then its existence is clear. One may

think that the deferred acceptance (DA) algorithm can be applied for this purpose. Jobs

propose to machines following the order in their preferences. We randomly pick any free

job that has not proposed to every machine on its preference to propose to its favorite

machine that has not yet rejected it. That machine accepts the most favorable o↵ers

made so far up to the capacity, and rejects the rest. Unfortunately, we show that this

may fail to be e↵ective.

Fig. 5.3 shows an example similar to Fig. 2. Say we first let jobs a, b, c propose until

they cannot. The result is A � (c), B � (b), and a is rejected by A. At this point, only

d can propose to A, and A holds the o↵er. The final result is A� (c, d), B � (b). This is

clearly type-2 blocked by (b, A), as A prefers b to d and b prefers A to B.

p(j) j s(j) c(m) m p(m)a

b

c

2

1

1

A

B

2

1

A

A, B c, a, b, d

b, cB, A

d 1A

Figure 5.3: An example where a possible execution of the DA algorithm produces atype-2 blocking pair (b, A).

On the other hand, if we let d propose to A first before a and b, and keep the

rest of the execution sequence unchanged, the result becomes A� (c), B � (b), which is

weakly stable. This example demonstrates two problems when applying the classical DA


algorithm here. First, the execution sequence is no longer immaterial to the outcome.

Second, it may yield an unstable matching. This creates considerable di�culties since

we cannot determine which proposing sequence yields a weakly stable matching for an

arbitrary problem.

Examined more closely, the DA algorithm fails precisely due to the size heterogeneity

of jobs. Recall that a machine will reject o↵ers only when its capacity is used up. In

the traditional setting with jobs of the same size, this ensures that whenever an o↵er

is rejected, it must be the case that the machine’s capacity is used up, and thus any

o↵er made from a less preferred job will never be accepted, i.e. the outcome is stable.

However, rejection due to capacity is problematic in our case, since a machine’s remaining

capacity may be increased, and its previously rejected job may become favorable again.

5.3.3 Optimal Weakly Stable Matching

There may be many weakly stable matchings for a problem instance. The next natural

question to ask is then, which one should we choose to operate the system with? Based

on the philosophy that a cloud exists for companies to ease the pain of IT investment

and management, rather than the other way around, it is desirable if we can find a job-

optimal weakly stable matching, in the sense that every job is assigned its best machine

possible in all stable matchings.

The original DA algorithm is again not applicable in this regard, because it may

produce type-1 blocking pairs even when the problem admits strongly stable matchings.

Thus, our second challenge is to devise an algorithm that yields the job-optimal weakly

stable matching. This quest is also theoretically important in its own right.

However, as we will show in Sec. 5.4.2, the complexity of solving this challenge is

high, which may prevent its use in large-scale problems. Thus in many cases, a weakly

stable matching is suitable for practical purposes.


5.4 A New Theory of Job-Machine Stable Matching

In this section we present our new theory of job-machine stable matching that addresses

the above challenges.

5.4.1 A Revised DA Algorithm

We first propose a revised DA algorithm, shown in Table 5.1, that is guaranteed to find

a weakly stable matching for a given problem. The key idea is to ensure that, whenever

a job is rejected, any less preferable jobs will not be accepted by a machine, even if it

has enough capacity to do so.

Table 5.1: Revised DA1: Input: c(m), p(m), 8m 2 M, s(j), p(j), 8j 2 J2: Initialize all j 2 J and m 2 M to free

3: while 9j who is free, and p(j) 6= ; do

4: m = j’s highest ranked machine in p(j)5: if c(m) � s(j) then6: j and m become matched, c(m) = c(m)� s(j)7: else

8: Find all j0 matched to m so far such that j0 �m

j9: repeat

10: m sequentially rejects each j0 by setting it to free, in the order of p(m)11: c(m) = c(m) + s(j0), best rejected = j0

12: until c(m) � s(j) or all j0 are rejected13: if c(m) � s(j) then14: j and m become matched, c(m) = c(m)� s(j)15: else

16: j becomes free, best rejected = j17: end if

18: for j00 2 p(m), j00 �m

best rejected do

19: Remove m from p(j00), j00 from p(m)20: end for

21: end if

22: end while

23: Return: the final matching, and remaining capacity c(m), 8m 2 M

The algorithm starts with a set of jobs J and a set of machines M. Each job and

machine are initialized to be free. Then the algorithm enters a propose-reject procedure.

Whenever there are free jobs that have machines to propose to, we randomly pick one, say

5.4. A NEW THEORY OF JOB-MACHINE STABLE MATCHING 95

j, to propose to its current favorite machine m in p(j), which contains all the machines

that have not yet rejected it. If m has su�cient capacity, it holds the o↵er. Otherwise,

it sequentially rejects o↵ers from less preferable jobs j0 until it can take the o↵er, in the

order of its preference. If it still cannot do so even after rejecting all the j0s, j is then

rejected. Whenever a machine rejects a job, it updates the best rejected variable, and at

the end all jobs ranked lower than best rejected are removed from its preference. The

machine is also removed from preferences of all these jobs, as it will never accept their

o↵ers.

A pseudo-code implementation is shown in Table 5.1. We can see that the order in

which jobs propose is immaterial, similar to the original DA algorithm. Moreover, we

can prove that the algorithm guarantees that type-2 blocking pairs do not exist in the

result.

Theorem 5.2. The order in which jobs propose is of no consequence to the outcome in

Revised DA.

Theorem 5.3. Revised DA, in any execution order, produces a unique weakly stable

matching.

Proof. The proof of uniqueness is essentially the same as that for the classical DA algo-

rithm in the seminal paper [39]. We prove the weak stability of the outcome by contra-

diction. Suppose that Revised DA produces a matching µ with a type-2 blocking pair

(j,m), i.e. there is at least one job j0 worse than j to m in µ(m). Since m �j

µ(j), j must

have proposed to m and been rejected. When j was rejected, j0 was either rejected before

j, or was made unable to propose to m because m is removed from the preferences of all

the jobs ranked lower than j. Thus j0 = ;, which contradicts with the assumption.

Theorem 5.3 also proves the existence of weakly stable matchings, as Revised DA

terminates within O(|J |2) in the worst case.

Theorem 5.4. A weakly stable matchings exists for every job-machine matching problem.


The significance of Revised DA is multi-fold. It solves our first technical challenge

in Sec. 5.3.2, and is appealing for practical use. The complexity is low compared to

optimization algorithms. Further, it serves as a basic building block, upon which we

develop an iterative algorithm to find the job-optimal weakly stable matching as we shall

demonstrate soon. Lastly, it bears the desirable property of being insensitive to the order

of proposing, which largely reduces the complexity of algorithm design.

Revised DA may still produce type-1 blocking pairs, and the result may not be job-

optimal as defined in Sec. 5.3.3. In order to find the job-optimal matching, an intuitive

idea is to run Revised DA multiple times, each time with type-1 blocking jobs proposing

to machines that form blocking pairs with them. The intuition is that, type-1 blocking

jobs can be possibly improved at no expense of others. However, simply doing so may

make the matching unstable, because when a machine has both type-1 blocking jobs

leaving from and proposing to it, it may have more capacity available to take jobs better

than those it accepts according to its capacity before the jobs leaving.

a

b

c

d

e

1

2

3

4

6

A

B

C

5

6

3

A, B, C

B, A

B

d, c, b, a

a, c, b

e, d, c, b, a

p(j) j s(j) c(m) m p(m)

Figure 5.4: An example where simply running the revised DA algorithm multiple timeswill produce a new type-2 blocking pair (c, C).

To give an example, let us take a look at the problem in Fig. 5.4. We now run Revised

DA over this example. The result will then be A� (d), B� (e), C � (a). Clearly there are


two type-1 blocking pairs, (a,A) and (b, C). Say we fix this by letting a propose to A

and b propose to C. Then we have a new type-2 blocking pair (c, C) due to the removal

of job a from C, where c prefers C to being unassigned, and C prefers c to b and by

rejecting b it has enough capacity to admit c. This is a result of C wrongly accepting b

when it actually has more capacity after a leaves.

5.4.2 A Multi-stage DA Algorithm

We now design a multi-stage DA algorithm to iteratively find a better weakly stable

matching with respect to jobs. The algorithm proceeds in stages. Whenever there is a

type-1 blocking pair (j,m) in the result of previous stage µt�1, the algorithm enters the

next stage where the blocking machine m will accept new o↵ers. The blocking job j is

removed from its previous machine µt�1(j), so that it can make new o↵ers to machines

that have rejected it before. µt�1(j)’s capacity is also updated accordingly. Moreover, to

account for the e↵ect of job removal, all jobs that can potentially form type-1 blocking

pairs with µt�1(j) if j leaves (there may be other machines that j form type-1 blocking

pairs with) are also removed from their machines and allowed to propose in the next

stage (corresponding to the while loop in step 7). This ensures that the algorithm does

not produce new type-2 blocking pairs during the course, as we shall prove soon. At each

stage, we run Revised DA with the selected set of proposing jobs J 0, and the entire set

of machines with updated capacity cpret

(m). The entire procedure is shown in Table 5.2.

We now prove important properties of Multi-stage DA, namely its correctness, con-

vergence, and job-optimality.

Correctness

First we establish the correctness of Multi-stage DA.

Theorem 5.5. There is no type-2 blocking pair in the matchings produced at any stage

in Multi-stage DA.


Table 5.2: Multi-stage DA1: Input: c(m), p(m), 8m 2 M, s(j), p(j), 8j 2 J .2: µ0 = ;, t = 0, stop = false, J 0 = ;3: while stop == false do

4: t = t+ 1, µ0 = µt�1

5: for m 2 M do

6: cpret

(m) = ct�1(m)

7: end for

8: while ⌦ 6= ;, where ⌦ is the set of jobs that form type-1 blocking pairs from µ0 withcpret

(m) do9: for j 2 ⌦ do

10: Add j0 to J 0.11: if µ0(j) != ; then

12: cpret

(µ0(j)) = cpret

(µ0(j)) + s(j).13: j0 is free and removed from the matching µ0.14: end if

15: end for

16: end while

17: if J 0 == ; then

18: break

19: end if

20: (µt

, ct

(m)) = Revised DA(cpret

(m), p(m),21: s(j), p(j), µ0,J 0)22: if µ

t

== µt�1 then

23: stop = true

24: end if

25: end while

26: Return µt

Proof. This can be proved by induction. As the base case, we already proved that there

is no type-2 blocking pair after the first stage in Theorem 5.3.

Given there is no type-2 blocking pair after stage t, we need to show that after

stage t + 1, there is still no type-2 blocking pair. Suppose after t + 1, there is a type-2

blocking pair (j,m), i.e., ct+1(m) < s(j), c

t+1(m) +P

j

0 s(j0) � s(j), where j0 �m

j, j0 2

µt

(m),m �j

µt+1(j). If cpre

t+1(m) � s(j), then j must have proposed to m and been

rejected according to the algorithm. Thus it is impossible for m to accept any job j0 less

preferable than j in t+ 1.

If cpret+1(m) < s(j), then j did not propose to m in t + 1. Since there is no type-2

blocking pairs after t, j0 must be accepted in t+1. Now since cpret+1(m) < s(j), the sum of


the remaining capacity and total size of newly accepted jobs after t+1 must be less than

cpret+1(m), i.e. c

t+1(m)+P

j

00 s(j00) cpret+1(m) < s(j), where j00 denotes the newly accepted

jobs in t+1. This contradicts with the assumption that ct+1(m) +

P

j

0 s(j0) � s(j) since

{j0} ✓ {j00}.

If cpret+1(m) = 0, then m only has jobs leaving from it. Since there is no type-2 blocking

pair after t, clearly there cannot be any type-2 blocking pair in t+ 1.

Therefore, type-2 blocking pairs do not exist at any stage of the algorithm. The

uniqueness of the matching result at each stage is readily implied from Theorem 5.3.

Convergence

Next we prove the convergence of Multi-stage DA. The key observation is that it pro-

duces a weakly stable matching at least as good as that in the previous stage from the

job’s perspective.

Lemma 5.1. At any consecutive stages t and t + 1 of Multi-stage DA, µt+1(j) ⌫

j

µt

(j), 8j 2 J .

Proof. This is a direct result of the algorithm design, since in t + 1 every proposing job

proposes to machines that have previously rejected it. If any of these machines accepted

it, µt+1(j) �

j

µt

(j). If none of these machines accepted it, it will for sure be able to

propose to its previous machine µt

(j) since cpret+1(µt

(j)) must be no smaller than s(j).

µt

(j) will for sure accept j at t + 1, because it will only receive o↵ers from jobs that

it previously rejected, possibly also from other jobs that it previously accepted if they

propose to other machines and are rejected in t+1. j remains favorable to m, even when

all of m’s accepted jobs in t proposed to m in t+ 1 again.

Therefore, the algorithm always tries to improve the weakly stable matching it found

in the previous stage, whenever there is such a possibility suggested by the existence of

type-1 blocking pairs. However, Lemma 5.1 also implies that a job’s machine at t + 1


may remain the same as in the previous stage. In fact, it is possible that the entire

matching is the same as the one in previous stage, i.e. µt+1 = µ

t

. This can be easily

verified using the example of Fig. 5.2. After the first stage, the weakly stable matching is

A� (c), B� (b). First b wishes to propose to A in the second stage. Then we assign b to

; and B has capacity of 1 again. c then wishes to propose to B too. After we remove c

from A and update A’s capacity, a now wishes to propose to A. Thus at the next stage,

the same set of jobs a, b, c will propose to the same set of machines with same capacity,

and the result will be the same matching as in the first stage. In this case, Multi-stage

DA will terminate with the final matching that it cannot improve upon as its output (step

17-18 of Table 5.2). We thus have:

Theorem 5.6. Multi-stage DA terminates in finite time.

Note that in each stage, Multi-stage DA may result in new type-1 blocking pairs,

and the number of type-1 blocking pairs is not monotonically decreasing. Thus its worst

case run time complexity is di�cult to analytically derive.

Job-Optimality

We now prove the most important result regarding Multi-stage DA:

Theorem 5.7. Multi-stage DA always produces the job-optimal weakly stable matching

when it terminates, in the sense that every job is at least as good in the weakly stable

matching produced by the algorithm as it would be in any other weakly stable matching.

Proof. We provide a proof sketch here. A detailed proof can be found in Sec. 7.7. The

algorithm terminates at stage t when either there is no type-1 blocking pair, or there is

type-1 blocking pair(s) but µt

= µt�1. For the former case, we show that our algorithm

only permanently rejects jobs from machines that are impossible to accept them in all

weakly stable matchings, when the jobs cannot participate any further. The outcome is


therefore optimal. For the latter case, we can also show that it is impossible for jobs that

participated in t to obtain a better machine.

Finally, we present another fact regarding the outcome of our algorithm.

Theorem 5.8. Multi-stage DA produces a unique job-optimal strongly stable matching

when it terminates with no job proposing.

The proof can be found in Sec. 7.8.

5.4.3 An Online Algorithm

We have thus far assumed a static setting with a fixed set of jobs and machines. In

practice, requests for job (VM) placement arrive dynamically, and we need to make

decisions on the fly. It may not be feasible to re-run the matching algorithm from scratch

every time when there is a new job. We further develop an online algorithm based on

Revised DA that handles the dynamic case e�ciently2.

We observe that our Revised DA algorithm can be readily used in the online setting.

The high-level idea is that, we fix the previous matching and improve upon it by applying

Revised DA only on the set of new jobs. The reason to fix the previous matching is to

avoid the potential overhead of and downtime caused by VM migration that may occur if

we allow the previous jobs to participate with the new jobs. Since most resource manage-

ment policies for clients, as we shown in Sec. 5.2, depend on system state variables that

change during the placement process, the preferences of new VMs need to be configured

according to the most updated state variables.

Let us now present the details of the online version of Revised DA called Online

DA as shown in Table 5.3. First of all, the execution never stops. After a matching is

found, it waits for new requests. When a new set of jobs J 0 arrive, their preferences are

configured according to the current system state variables. The server preferences are

2Presumably, the set of machines is fixed for a long period of time.


configured with regards to the new jobs only. Then the Revised DA algorithm runs with

the new jobs J 0 and all the servers M. Note that in step 7 c(m) is the current server

capacity, since it is always updated during the running of Revised DA.

Table 5.3: Online DA1: Input: c(m), 8m 2 M2: Initialize all m 2 M to free

3: while true do

4: Wait until new jobs J 0 arrive5: Configure preferences of new jobs j0 2 J 0 p(j) according to the current system state.6: Configure preferences of all servers p(m) with regards to new jobs J 0.7: Run Revised DA with c(m), p(m), 8m 2 M, s(j), p(j), 8j 2 J 0.8: Output the current matching9: end while

Readers may wonder if our Multi-stage DA algorithm also can be revised to be an

online algorithm. We choose not to use Multi-stage DA because it is iterative and takes

much longer to converge when the problem scale is large, which may not be acceptable

in practice. This is experimentally demonstrated in Sec. 6.3. Thus Multi-stage DA is

only to be used as an o✏ine algorithm for small to medium scale static problems, while

the simple Revised DA and Online DA can be used in o✏ine and online fashion, for

even large-scale problems. We thus consider Revised DA and Online DA more suited

for practical use of large-scale dynamic cloud systems, while Multi-stage DA more of

theoretical value in exploring the intricacy of the job-machine matching problem with

size heterogeneity.

5.5 Showcases of Resource Management Policies

We have presented the underlying mechanism of Anchor that produces a weakly sta-

ble matching between VMs of various sizes, as jobs, and physical servers, as machines.

We now introduce Anchor’s policy engine which constructs preference lists according to

various resource management policies. The cloud operator and clients interact with the

policy engine through an API as shown in Table 5.4.

5.5. SHOWCASES OF RESOURCE MANAGEMENT POLICIES 103

Table 5.4: Anchor’s policy interface.Functionality Anchor API Call

create a policy group g = create()

add/delete server add/delete(g o,s)

add/delete VMs add/delete(g c,v)

set ranking factors conf(g,factor1,...)

set placement constraints limit(g c,servers)

colocation/anti-colocation colocate(tenants,i,g c)

In order to reduce management overhead, we use policy groups that can be created

with the create() call. Each policy group contains a set of servers or VMs that are

entitled to a common policy. In fact some of the recent industry products have adopted

similar ideas to help companies manage their servers in the cloud [42]. The policy is con-

figured by the conf() call that informs the policy engine what factors to be considered

for ranking the potential partners in a descending order of importance. The exact defini-

tion of ranking factors varies depending on the specific policy as we demonstrate in the

following. With policy groups, only one common preference list is needed for all mem-

bers of the group. Membership is maintained by add() and delete() calls. colocate()

and limit() are used to set colocation/anti-colocation and placement constraints as we

discuss in Appendix E.

It is also possible for the operator to configure policies on behalf of its clients if they

do not explicitly specify any. This is done by enrolling them to the default policy group.

5.5.1 Cloud Operators

We begin our discussion from the operator’s perspective.

Server consolidation/packing. Operators usually wish to consolidate the workload

by packing VMs onto a small number of highly occupied servers, so that idle servers can be

powered down to save operational costs. To realize this policy, servers can be configured

to prefer a VM with a larger size. This can be done using conf(g o, 1/vm size), where

g o is the operator’s policy group. For VMs in the default policy group, their preference


is ranked in the descending order of server load. One may use the total size of active

VMs as the metric of load (conf(g c, 1/server load)), where g c is the client’s policy

group. Alternatively, the number of active VMs can also serve as a heuristic metric

(conf(g c, 1/num vm)).

Notice that consolidation is closely related to packing, and the above configuration

resembles the first fit decreasing heuristic widely used to solve packing problems by

iteratively assigning the largest item to the first bin that fits.

Load balancing. Another popular resource management policy is load balancing,

which distributes VMs across all servers to mitigate performance degradation due to

application dynamics over time. This can be seen as the opposite of consolidation. In

this case, we can configure the server preference with conf(g o, vm size), implying

that servers prefer smaller VMs in size. VMs in the default policy group are configured

with conf(g c, server load), such that they prefer less utilized servers. This can be

seen as a worst fit increasing heuristic.

5.5.2 Cloud Clients

From the perspective of cloud clients, other than choosing to join the default policy

group and follow the operator’s configuration, they can also express their unique policies.

Resource hunting. Depending on the resource demand of applications, VMs can

be CPU, memory, or bandwidth-bound, or even resource-intensive in terms of multiple

resources. Though resources are sliced into fixed slivers, most modern hypervisors support

dynamic resizing of VMs. For example, the hypervisor may allow a temporarily burst of

CPU usage for a VM provided that doing so does not a↵ect colocated VMs. For memory,

with a technique known as memory ballooning, the hypervisor is able to dynamically

reduce the memory footprints of idle VMs, so that memory allocation of heavily loaded

VMs can be increased.

Thus, clients may configure their policies according to the resource usage pattern of

5.5. SHOWCASES OF RESOURCE MANAGEMENT POLICIES 105

their VMs, which is unknown to the operator. CPU-bound VMs can be added to a CPU-

bound policy group, which is configured with a call to conf(g c, 1/server freecpu).

Their preferences are then ranked in the descending order of server’s time average avail-

able CPU cycles. Similarly, memory-bound and bandwidth-bound policy groups may be

configured with the call conf(g c, 1/server freemem) and conf(g c, 1/server freebw),

respectively.

Colocation/anti-colocation. In practice, some clients do or do not want their

VMs to be placed with VMs of some other tenants. Anchor supports this policy using

the API call colocate(tenants, i, g c), where tenants is a list of tenants, i is a

boolean variable to indicate whether the caller wishes to colocate or anti-colocate itself

with clients in tenants, and g c is an optional argument indicating the common policy

group the colocating tenants belong to.

In case a client A wishes to colocate with a set of clients B, it makes a call to

colocate(B, true, g). Since colocation is a mutual agreement, client A and clients in

B must share a common policy group g which is assumed to be agreed upon beforehand.

The provider then bundles client A’s VM with VMs of clients in B as virtual VMs, where

each virtual VM consists of an individual VM of A and an individual VM of each client

in B. Such a virtual VM is then treated as a single VM to join the policy group g and

participate in Anchor’s stable matching algorithms.

In case a client A wishes to anti-colocate with a set of clients B, it can indicate so

with colocate(B, false). The provider partitions the entire set of servers into two

non-overlapping sets X and Y . It creates for A a new policy group g A containing servers

in X only, and for each client in B a new policy group g b containing servers in Y only.

Client A and clients in B can then configure their own policy groups for their preferences.

It is guaranteed that client A and clients in B propose to non-overlapping sets of servers

and will not be colocated. Note that anti-colocation is not a mutual agreement, and thus

clients involved can have distinct preferences and policy groups.


5.5.3 Additional Commonly Used Policies

Anchor supports additional policies that are common to both the operator and clients.

Tiered service. It is a common practice to implement tiered services in an opera-

tional cloud, by associating each VM with a priority class. This can be done in Anchor

with a call to conf(g o, priority).

Incomplete preferences. It is possible that some VMs can only be placed on a

subset of servers, due to hardware constraints for instance. Such placement limitations

can be accommodated in Anchor, as the list of preference does not need to contain

the entire set of servers. The policy engine in Anchor supports an additional API call

limit(g c, servers) so that VM preferences in a client policy group g c only include

servers specified in servers.

Combination of policies. Besides individual policies, Anchor also supports combi-

nations of policies, which is also common in practice. The API call conf(g, factor1,

factor2, ...) is designed for this purpose, where di↵erent policies are input following

a decreasing priority order. A multi-pass sorting procedure is then performed, first based

on the least important factor, then the second least, and so on, to produce preferences

that adhere to this combination of policies.

5.5.4 Discussion

Most preference examples shown above for clients are based on state variables that change

during the placement process. However, we emphasize that the preferences do not change

during the placement process, for otherwise the concept of stability cannot be well defined

with respect to the constantly changing preferences. Our algorithms thus produce a stable

matching with respect to the preferences and system state given before the placement

process. This results in sub-optimal client performance compared to other algorithms

that keep track of state variables during the process. This is an inherent limitation of

the stable matching framework when applied to cloud resource management, because

5.6. IMPLEMENTATION AND EVALUATION 107

many practical policies rely critically on system state variables. We believe that it is an

important and di�cult problem, and would like to study it further as our future work.

5.6 Implementation and Evaluation

We investigate the performance of Anchor with both testbed implementation and large-

scale simulations based on real-world workload traces.

5.6.1 Setup

Prototype implementation. Our prototype consists of about 1500 LOC written in

Python. It is based on Oracle VirtualBox 3.2.10 [84]. The resource monitor is imple-

mented as a Python application that maintains resource statistics of servers and VMs

using SQLite, a lightweight database engine. The sqlite3 Python bundle is utilized

to update records in the database. The resource monitor periodically listens for usage

reports — once per second — from a daemon in each server, which we have implemented

for the sole purpose of collecting measurements of CPU and memory usage, via the Virtu-

alBox management API (VBoxManage metrics). Our daemon utilizes the iptraf tool to

detect the available bandwidth of each server, since VirtualBox does not provide suitable

APIs for this purpose.

The policy manager also utilizes SQLite to manage policy groups for both the operator

and cloud clients, and updates its databases upon receiving an Anchor API call. For

e�ciency, it maintains all the preferences in memory. When a placement request arrives,

it obtains necessary information from the resources monitor’s database, and sends sorted

preferences to the matching engine.

The matching engine implements the Revised DA and Multi-stage DA in the o✏ine

case, as well as the Online DA in the online case in Python. We pre-process server

preferences (with a complexity of O(n)) to obtain an inverse of the list indexed by VM ID.


Each subsequent rank comparison can thus be performed in constant time. For maximum

e�ciency, we use numpy in Python to implement the data structure of preferences.

Our evaluation of Anchor is based on a prototype data center consisting of 20 Dual

Dual-Core Intel Xeon 3.0 GHz machines connected over gigabit ethernet. Each machine

has 2 GB memory. Thus, we define the atomic VM to have 1.5 GHz CPU and 256 MB

memory. Each server has a capacity of 7 in terms of atomic VM (since the hypervisor also

consumes server resources). All machines run Ubuntu 8.04.4 LTS with Linux 2.6.24-28

server. A cluster of Dual Intel Xeon 2.4 Ghz servers are used to generate workload for

some of the experiments. One node in the cluster is designated to run the Anchor control

plane, while others host VMs. Our VMs, if not otherwise noted, run Ubuntu 8.10 server

with Apache 2.2.9, PHP 5.2.6, and MySQL 5.0.67.

Trace-driven simulation. To evaluate Anchor at scale, we conduct large-scale

simulation based on real-world workload traces from RICC (RIKEN Integrated Cluster

of Clusters) [100] in Japan. RICC is composed of 4 clusters, and was put into operation

in August 2009. The data provided in the trace is from the “massively parallel cluster,”

which has 1024 Fujitsu RX200S5 Cluster nodes, each with 12 GB memory and two 4-core

CPUs, for a total of 12 TB memory and 8192 cores. The trace file contains workload

during the period of Sat May 01 00:04:55 JST 2010, to Thu Sep 30 23:58:08 JST 2010.

5.6.2 E�ciency of Resource Allocation

We evaluate the e�ciency of Anchor resource allocation, by allowing clients to use the

resource hunting policy in Sec. 5.5.2. We enable memory ballooning in VirtualBox to

allow the temporary burst of memory use. CPU-bound VMs are configured to run a 20

newsgroups Bayesian classification job with 20,000 newsgroups documents, based on the

Apache Mahout machine learning library [8]. Memory-bound VMs run a Web application

called Olio that allows users to add and edit social events and share with others [83].

Its MySQL database is loaded with a large amount of data so that performance is mem-


ory critical. We use Faban, a benchmarking tool for tiered web applications, to inject

workload and measure Olio’s performance [35].

Our experiment comprises of 2 servers (S1, S2) and 2 VMs (VM1 and VM2). S1

runs a memory-bound VM of size 5, and S2 runs a CPU-bound VM of the same size

before allocation. VM1 is CPU-bound with size 1 while VM2 is memory-bound with size

2. Assuming servers adopt a consolidation policy, we run Anchor first with the resource

hunting policy, followed by another run with the default consolidation policy for the two

VMs. In the first run, Anchor matches VM1 to S1 and VM2 to S2, since VM1 prefers S1

with more available CPU and VM2 prefers S2 with more memory. Other VM placement

schemes that consider the resource usage pattern of VMs will yield the same matching.

In the second run, Anchor matches VM2 to S1 and VM1 to S2 for consolidation.

0 100 200 300 400 500 600 7000

20

40

60

80

100

Time (sec)

CPU

util

izat

ion

(%)

S1 idle CPUVM1 used CPU

Figure 5.5: VM1 CPU usage on S1 whenusing the resource hunting policy.

0 100 200 300 400 500 600 7000

20

40

60

80

100

Time (sec)

CPU

util

izat

ion

(%)

S2 idle CPUVM1 used CPU

Figure 5.6: VM1 CPU usage on S2 whenusing the consolidation policy.

We now compare CPU utilization of VM1 in these two matchings as shown in Fig. 5.5

and 5.6, respectively. From Fig. 5.5, we can see that as VM1 starts the learning task

at around 20 seconds, it quickly hogs its allocated CPU share of 12.5%, and bursts to

approximately 40% on S1 (80%-40%). Some may wonder why it does not saturate S1’s

CPU. We conjecture that the reason may be related to VirtualBox’s implementation that

limits the CPU allocated to a single VM. In the case it is matched to S2, it can only

consume up to about 30% CPU, while the rest is taken by S2’s pre-existing VM as seen

in Fig. 5.6. We also observe that the learning task takes about 600 seconds to complete


on S2, compared to only 460 seconds on S1, which implies a performance penalty of 30%.

We next look at the memory-bound VM2. Fig. 5.7 shows time series of memory

allocation comparison between the two matchings. Recall that VM2 has size 2, and

should be allocated 512 MB memory. By the resource hunting policy, it is matched to

S2, and obtains its fair share as soon as it is started at around 10 seconds. When we start

the Faban workload generator at 50 seconds, its memory allocation steadily increases as

an e↵ect of memory ballooning to cope with the increasing workload. At steady state it

utilizes about 900 MB. On the other hand, when it is matched to S1 by the consolidation

policy, it only has 400 MB memory after startup. The deficit of 112 MB is allocated to

the other memory hungry VM that S1 is running. VM2 gradually reclaims its fair share

as the workload of Olio database rises, but cannot get any extra resource beyond that

point.

100 120 140 160 180 20040

60

80

100

Time (sec)

Util

izat

ion

(%)

S1 CPU w. VM1S1 mem w. VM1S1 CPU w. VM2S1 mem w. VM2

Figure 5.7: VM2 memory usage on S1 andS2.

0 40 80 120 1600

300

600

900

Time (sec)

Use

d m

emor

y (M

B)

VM2 on S2VM2 on S1

Figure 5.8: S1 CPU and memory usage.

Client resource hunting policy also serves to the benefit of the operator and its servers.

Fig. 5.8 shows S1’s resource utilization. When resource hunting policy is used, i.e. when

S1 is assigned VM1, its total CPU and memory utilization are aligned at around 60%,

because VM1’s CPU-bound nature is complementary to the memory-bound nature of S1’s

existing VM. However, when S1 is assigned the memory-bound VM2 by the consolidation

policy, its memory utilization surges to nearly 100% while CPU utilization lags at only

50%. A similar observation can be made for S2.


Result: Anchor enables e�cient resource utilization of the infrastructure and improves

performance of its VMs, by allowing individual clients to express policies specific to its

resource needs.

5.6.3 Anchor’s Performance at Scale

Now we evaluate the performance and scalability of Anchor using both experiments and

trace-driven simulations.

We first conduct a small-scale experiment involving placing 10 VMs to 10 servers

using both the consolidation and load balancing policies to demonstrate the e↵ectiveness

of Anchor in realizing resource management policies. We assume that clients follow

the operator’s default policy in the experiment here, so there is no conflicting interest

involved.

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

Server

Occupancy

existing VMs

new VMs1

2

3

4

5

6

7

8

9

10

Figure 5.9: Consolidation.

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

Server

Occupancy existing VMs

new VMs

1 73 6

4

5

9

2

8

10

Figure 5.10: Load Balancing.

The experiment is to allocate 10 VMs to 10 servers, the first four of which are config-

ured with an occupancy of 2, 1, 1, 2, respectively. Fig. 5.9 shows the result of using the

consolidation policy, where VM preference is ranked in descending order of server occu-

pancy, and server preference is ranked in descending order of VM size. We observe that

all the VMs are packed into the first five servers, whose occupancy is thus maximized.

On the other hand, the load balancing policy distributes VMs across the idle servers,

resulting in a more balanced server load as shown in Fig. 5.10. Multi-stage DA takes


3 iterations to converge to the strongly stable matching of Fig. 5.9 for the former case,

and 2 iterations for the latter.

We then conduct a medium-scale experiment involving all of our 20 machines. We

orchestrate a complex scenario with 4 batches of VMs, each with 20 VMs whose sizes

is drawn uniformly randomly in [1, 4]. Servers are initially empty with a capacity of

7. VMs are uniformly randomly chosen to use either consolidation, CPU-bound, or

memory-bound resource hunting, and servers adopt a consolidation policy for placement.

Since the stakeholders have di↵erent objectives, we use the rank percentile of the

assigned partner as the performance metric that reflects one’s “happiness” about the

matching. A 90% happiness then means that the partner ranks better than 90% of the

total population. For servers, their happiness is the average of the matched VMs. From

the experiment we find that VMs obtain their top 10% partner on average while servers

only get their top 50% VMs. The reason is that the number of VMs is too small compared

to servers’ total capacity, and most of VMs’ proposals can be directly accepted.

The scale of previous experiments is limited due to the hardware constraint of our

testbed. To verify Anchor’s e↵ectiveness in a practical cloud scenario with large num-

bers of VMs and servers, we perform large-scale trace-driven simulations using the RICC

workload traces as the input to our Revised DA and Multi-stage DA algorithms. Ac-

cording to [100], the allocation of CPU and memory of this cluster is done with a fixed

ratio of 1.2 GB per core, which coincides well with our atomic VM assumption. We thus

define an atomic VM to be of 1 core with 1.2 GB memory. Each RICC server, with 8

cores and 12 GB memory as introduced in Sec. 5.6.1, has a capacity of 8. The number

of servers is fixed at 1024.

We assume that tasks in the trace run in VMs, and they arrive o✏ine before the

algorithms run. For large tasks that require more than one server, we break them down

into multiple smaller tasks, each of size 8, that can run on a single server. We then

process each task scheduling request in the trace as VM placement request(s) of various


sizes. We use the first 200 tasks in the trace, which amounts to more than 1000 VM

requests.

However, the trace does not have detailed information regarding the resource usage

history of servers and tasks, making it di�cult for us to generate various preferences

needed for stable matching. To emulate a typical operational cloud with a few policy

groups, we synthesize 8 policy groups for servers and 10 for VMs, the preference of each

group being a random permutation of members of the other side. The results are averaged

over 100 runs.

3 4 5 6 7 8 9 1090

95

100

Number of VMs (100)

VM H

appi

ness

(%)

Revised DAMulti−stage DAFirst fit

Figure 5.11: VM happiness in a static set-ting.

3 4 5 6 7 8 9 1040

60

80

100

Number of VMs (100)

Serv

er H

appi

ness

(%)


Figure 5.12: Server happiness in a staticsetting.

As a benchmark, we implement a First fit algorithm widely used to solve large-

scale VM placement problems in the literature [19,21,70]. Since the servers have di↵erent

preferences but First fit algorithm assumes a uniform ranking of VMs, the algorithm

sorts the VMs according to the preference of the most popular policy group first, and

places a VM to the best server according to the VMs preference that has enough capacity.

Fig. 5.11 and 5.12 show the results with error bars for both Revised DA and Multi-stage

DA with di↵erent scales. As expected, we observe that, as the problem scales up, VMs are

allocated to lower ranked servers and their happiness decreases, and servers are allocated

with higher ranked VMs, due to the increased competition amongst VMs. Also note that

Multi-stage DA is only able to improve the matching from the VM perspective by 15%

on average as shown in Fig. 5.11, at the cost of decreased server happiness as shown


in Fig. 5.12. The performance di↵erence between Revised DA and Multi-stage DA for

VMs are thus small.

Compared to the benchmark First fit, our algorithms provide significant perfor-

mance improvement for servers. Both Revised DA and Multi-stage DA consistently

improve the server happiness by 60% for all problem sizes. This demonstrates the ad-

vantage of our algorithms in coordinating the conflicting interests between the operator

and the clients using stable matching. Specifically, First fit only uses a single uniform

ranking of VMs for all servers, while our stable matching algorithms allow servers to ex-

press their own preferences. Further, First fit will not match a VM to a server whose

capacity is insu�cient, i.e. there will be no rejection from servers, while Online DA al-

lows rejections if a VM is preferable than some of the server’s VMs during its execution.

Clearly this improves the happiness of both VMs and servers.

3 4 5 6 7 8 9 100

30

60

90

120

Number of VMs (100)

Tim

e (s

)


Figure 5.13: Running time in the static set-ting.

3 4 5 6 7 8 9 100

25

50

75

100

Number of VMs (100)

Itera

tions

(100

0)

Revised DAMulti−stage DA

Figure 5.14: Number of iterations in thestatic setting.

Fig. 5.13 and 5.14 show the time complexity of the algorithms. It is clear that

the running time of Multi-stage DA is much worse than the simple Revised DA, and

grows more rapidly. The same observation is made for the number of iterations, where

Multi-stage DA takes more than 95000 iterations to finish while Revised DA takes only

11824 iterations with 1000 VMs. Another observation we emphasize here is that the

average case complexity of Revised DA is much lower than its worst case complexity

O(|J |2) in Sec. 4.1, while Multi-stage DA exhibits O(|J |2) complexity on average. Thus


Revised DA scales well in practice, while Multi-stage DA may only be used for small

or medium scale problems.

Revised DA takes 10 seconds to solve problems with 1000 VMs and 1024 servers,

which is acceptable for practical use. As expected, both algorithms are slower than the

simple First fit algorithm, whose running time is negligible (0.01s–0.06s). First fit

is not iterative so we do not include it in Fig. 5.14 for the number of iterations comparison.

Result: Revised DA is e↵ective and practical for large-scale problems with thousands

of VMs, and o↵ers very close-to-optimal performance for VMs.

5.6.4 Evaluation of Online DA

We conduct large-scale trace-driven simulations to evaluate Anchor in a dynamic en-

vironment where VM placement requests arrive dynamically. We again use the RICC

workload trace as the input to our Online DA algorithm. The trace contains each task’s

arrival time, finish time, and amount of resources requested. In our simulation, we process

each task scheduling request in the trace according to its arrival time as VM placement

request(s) of various sizes, and use Online DA to determine a matching for them. When

the task finishes, we remove its VMs from the corresponding servers, so capacity is freed

to accommodate future requests.

In our simulations, the servers are configured to use the consolidation policy. The VMs

are again randomly chosen to join the default, the CPU-bound, and the memory-bound

policy group that uses the consolidation policy, the CPU bound and the memory-bound

resource hunting policy, respectively. The free CPU and memory of a server is simply

calculated as the total CPU and memory minus the amount allocated to its active VMs.

We emphasize that VM preferences in this case are configured upon arrival using the

most current system state variables as discussed above. Large tasks that require more

than one server is translated into multiple smaller requests.

We simulate the exact birth-death history of the first 2000 tasks in the RICC trace,


amounting to 16600 VM placement requests that cover a time period of 475529 seconds, or

roughly 5.5 days. We compare Online DA with the online version of the first fit algorithm

that first sorts the VMs according to their size (recall servers use a consolidation policy),

and then tries to place a VM to the best machine according to its preference that has

su�cient capacity to take it upon its arrival. The number of servers is fixed at 1024.

0 2 4 6 8 10 12 14 1680

85

90

95

100

VM request arrival (1000)

VM H

appi

ness

(%)

Online DAOnline first fit

Figure 5.15: VM happiness in a dynamicsetting.

0 2 4 6 8 10 12 14 16

40

60

80

100

VM request arrival (1000)Se

rver

Hap

pine

ss (%

)

Online DAOnline first fit

Figure 5.16: Server happiness in a dynamicsetting.

The results are shown in Fig. 5.15 and 5.16. We can see that in terms of VM happiness,

Online DA only slightly outperforms Online first fit by about 7%. However in terms

of server happiness, Online DA enjoys a significant improvement where servers obtain

their top 1% of the VMs while Online first fit is only able to match servers with

their top 60% VMs.

The reason of the performance discrepancy is again due to the elegant propose-reject

design of stable matching algorithms. The first fit algorithm will not match a VM to a

server whose capacity is insu�cient, while Online DA will if this VM is preferable than

some of the server’s VMs during its execution. This improves the happiness of both VMs

and servers, since VMs can get better machines even when they are occupied, and servers

can also get their favorable VMs.

We also evaluate the complexity of Online DA. Fig. 5.17 shows the running time of

Online DA, including processing preferences, for the entire course of simulation. Observe

that it takes Online DA less than 3 seconds in any case to find a matching for new VMs,

5.7. SUMMARY 117

making it responsive to dynamic situations. Fig. 5.18 further shows that the algorithm

usually terminates in less than 50 iterations.

0 2 4 6 8 10 12 14 160

1

2

3


Tim

e (s

)

Online DA

Figure 5.17: Running times of Online DA.

0 2 4 6 8 10 12 14 160

50

100


Itera

tions

Online DA

Figure 5.18: Convergence of Online DA.

Result: The Online DA algorithm is responsive and e�cient to handle dynamic VM

placement requests. Its performance is significantly better than existing methods based on

the first fit algorithm.

5.7 Summary

We presented Anchor as a unifying fabric for resource management in the cloud, where

policies are decoupled from the management mechanisms by the stable matching frame-

work. We developed a new theory of job-machine stable matching with size heterogeneous

jobs as the underlying mechanism to resolve conflict of interests between the operator and

clients. We then showcased the versatility of the preference abstraction for a wide spec-

trum of resource management policies for VM placement with a simple API. Finally, the

e�ciency and scalability of Anchor are demonstrated using a prototype implementation

and large-scale trace-driven simulations.

Many other problems can be cast into our model. For instance, job scheduling in

distributed computing platforms such as MapReduce, where jobs have di↵erent sizes and

share a common infrastructure. Our theoretical results are thus potentially applicable

to scenarios beyond those described in this chapter. As future work, we plan to extend


Anchor for the case where resource demands vary, and VMs may require to be re-placed,

where specific considerations for VM live migration [111] are needed.

Chapter 6

Concluding Remarks

6.1 Contributions

Datacenters have become an indispensable part of the Internet and our society, powering

up many web services that are essential to people’s everyday lives. Overall, this dis-

sertation makes contributions in designing practical algorithms and systems to improve

the e�ciency of datacenter running applications used by millions of people daily. Two

aspects of datacenters operations — workload management and resource management

are examined in detail.

On the former, we augmented the design space of workload management in geo-

distributed datacenters with several new dimensions. We first proposed to consider re-

sponse routing, which controls the routing of response packets from datacenters to users.

The joint optimization of request mapping and response routing allows the operator to

achieve a trade-o↵ between latency performance and the total operating cost, including

both wide-area bandwidth and electricity cost. We then considered temperature aware

workload management with batch job capacity allocation. Through empirical studies of

production cooling systems and temperature data, we find that cooling energy e�ciency

critically depends on the ambient temperature, which exhibits significant geographical

119

120 CHAPTER 6. CONCLUDING REMARKS

diversity. By making workload management temperature aware, cooling energy consump-

tion and cost can be reduced. Further, the elastic nature of batch jobs provides more

cost saving opportunities through dynamic capacity allocation. We provided distributed

algorithms based on the alternating direction method of multipliers (ADMM) to solve the

large-scale distributed convex optimization problems e�ciently. Trace-driven simulations

show that our approaches are e↵ective in reducing the operating cost of geo-distributed

datacenters.

On the latter, resource management, we advocated a design principle of separating

policies with mechanisms for resource management in virtualized datacenters, so that

distinct policies can be supported on a uniform resource management mechanism. We

presented the design, implementation, and evaluation of Anchor, a scalable and versatile

resource management substrate, based on this design principle. Anchor is built upon a

novel one-to-many stable matching theory that we developed to resolve the conflicting

goals of tenants and the operator, and handle the heterogeneous resource requirements

of applications running on virtual machines. We showed that Anchor supports a variety

of resource management policies with good performance, and e�ciently solves problems

at scale.

Our contributions in this dissertation are not limited to the problems or settings we

considered. In fact, the algorithms are generally applicable to many problems in other

domains. The generalized m-block ADMM algorithm developed in Chapter 4 solves

general convex optimization problems with separable objective functions and equality

constrained variables. These problems commonly arise not only in networking, but also in

machine learning, data mining, image processing, etc. With the availability of distributed

data processing and storage systems such as MapReduce [31] and Spanner [26], our

algorithm can be potentially utilized to develop e�cient distributed and parallel methods

for solving these problems. The stable matching algorithms developed in Chapter 5 can

also be applied to general matching problems between jobs and machines, where jobs

6.2. FUTURE DIRECTIONS 121

can have di↵erent sizes. For example, the job scheduling of MapReduce where multiple

jobs need to be matched to machines. Each job requires di↵erent amounts of computing

resources (number of slots as in the Apache Hadoop implementation [7]). Machines

obviously have di↵erent numbers of slots. Jobs prefer to be run at machines that contain

their input data, or at least are close to the data sources [118]. These considerations

can be captured by using the preferences. The next section illustrates some of these

opportunities to extend the work done in this dissertation.

6.2 Future Directions

Our work is by no means complete and a number of open questions remain. There is

much potential for future work in the area of improving datacenter e�ciency. Some

promising avenues are:

Real-world Deployment. We believe the algorithms and systems developed for

both workload and resource management can be deployed in production datacenters. The

evaluation results we provided, using trace-driven simulation and prototype experiments,

showed enough motivation to attempt a real-world deployment as the next step.

Datacenter Demand Response. Demand response is an important aspect of the

future smart grid that uses advanced pricing signals to balance the supply and demand

of electricity in the grid. Datacenters are an e↵ective and promising target for automated

demand response, for they consume an enormous amount of electricity, and their man-

agement can be readily automated. Moreover, datacenters are designed to absorb load

spikes with extra capacity, and they are often lightly utilized in practice. The integration

of datacenters and demand response will make the grid even smarter and greener. We

do not consider the dynamic electricity pricing and its impact on datacenter workload

management. Neither do we consider various possible strategies, other than workload

management, that can increase/decrease the energy consumption according to the price


signal. These are interesting directions to look into.

Specifically, to realize the vision of integrating datacenters into demand response of the

smart grid, we need to understand the workload and energy consumption characteristics

of datacenters to identify potential areas for demand response. Meanwhile, we also need

to understand the fundamental trade-o↵ between the benefit of participating in demand

response, and the loss in reliability and availability of services. For example, datacenters

may temporarily reduce power consumptions by raising the set-point temperature of

the cooling systems, when the power price spikes. In doing so, the operator bears the

increased risk of hardware faults and downtime.

We also need to develop e�cient power management solutions to adjust the operations

of servers, switches and routers, and cooling systems, in order to respond to the incentives

of the smart grid and optimize the trade-o↵. The optimization framework and distributed

algorithms developed in this thesis will be helpful in modeling and solving the problem.

Another interesting direction is the design of more e�cient pricing strategies or incentive

schemes for demand response.

Pricing in Public Clouds. Many Internet services are running on geo-distributed

public clouds such as Amazon EC2. Our workload management approaches assumed a

private cloud context where the request mapping is centrally managed by the operator.

For public clouds, the developers control the request mapping of their applications and

services. Further, public clouds use fixed pricing to charge their tenants. These features

create new challenges for e�cient workload management in public clouds, and suggest

that the operator may utilize pricing to incentivize tenants such that the requests are

properly distributed across the datacenters according to the cost di↵erentials.

The pricing of resources in cloud computing in general is an important but not well-

explored area. The current practice adopted by public clouds, such as Amazon EC2

and Windows Azure, vary in terms of what to price, and how to update the price. For

example Amazon EC2 sells computing resources in some fixed bundles, which are called

6.2. FUTURE DIRECTIONS 123

“instances,” and a per-instance price is used. In other words, bundled pricing is adopted.

However, Windows Azure, Rackspace and other public clouds choose to price computing

resources, including CPU, memory, and storage, individually. Also, Amazon EC2 uses

both static pricing and dynamic pricing (which is called “spot pricing” in Amazon’s

term), but other public clouds only use static pricing.

Here, some fundamental questions need to be answered: What’s the right pricing

strategy for cloud computing, bundled pricing or individual resource pricing? And should

we use the simple static pricing, or dynamically update the price as supply and demand

relationship changes? To answer these questions, perhaps we first need to ask ourselves,

what are the factors we should consider when studying pricing in cloud computing? Op-

erator revenue is obviously important. The benefit of tenants by purchasing and using

cloud resources is equally important for the entire economy of cloud computing to work

out. We could leverage the vast literature of Internet pricing to help us answer these ques-

tions. However we do have to ponder around another key issue: What’re the di↵erences

between pricing Internet access, and pricing computing resources in cloud computing?

These are fundamental and important questions that have significant implications not

only for academicians, but also for practitioners and government regulatory bodies.

Multi-resource Allocation. In designing Anchor we assumed that all VMs have

a fixed bundle of multiple resources. This simplified the multi-dimensional resource

allocation by consolidating into a one-dimensional problem, which as we showed is quite

di�cult already in developing a stable matching theory. The general case of multi-

dimensional resource allocation has received much attention recently. How to apply the

stable matching theory instead of optimization to solve the multi-dimensional resource

allocation remains untouched thus far.

There are several possible directions to tackle this question. One direction is to follow

the Dominant Resource Fairness (DRF) concept, a generalization of max-min fairness to

multiple resource types proposed in [41]. Using DRF, the size of a VM is determined


by the amount of dominant resource this VM requires. If the VM is CPU-bound, its

size is determined by the amount of CPU resources required; on the other hand if the

VM is memory-bound, its size is determined by the memory resource. The new stable

matching algorithms we developed in this thesis can then be applied to solve the VM

placement problem. Here an interesting question to study is the utilization of multiple

resources as a result of our stable matching algorithms, and the comparison against other

optimization based approaches.

Big Data Driven Network Optimization in Datacenters. How to extract

value from data in an e�cient manner is the central thesis of big data. For datacenters,

tra�c engineering of the internal interconnect can be particularly benefited from big

data analytics. Datacenter networks comprise of hundreds of thousands of links and

switches, and carry a massive amount of tra�c that changes constantly. Studies have

found that existing tra�c engineering techniques perform sub-optimally in datacenter

networks, mainly due to the lack of global knowledge of the flows and coordination for

scheduling flows. A big data approach is promising in tackling this problem.

Specifically, with the increasing use of software defined routers in datacenter net-

works, it is feasible to collect real-time port level and even flow level tra�c statistics.

Stochastic learning algorithms can then be applied here to characterize and predict the

short-term tra�c demand of individual servers and links based on the big data of tra�c

statistics. These predictive algorithms provide an accurate characterization of all the

flows. Then, one can design e�cient tra�c engineering algorithms by drawing upon the

distributed ADMM algorithm developed in this thesis that are well suited to solving

large-scale distributed convex optimization. For example, routing and flow scheduling

can be dynamically optimized with software defined routers.

Finally, our approaches showed promising potential in improving the e�ciency of

operating datacenters. Given the current trend in datacenter prevalence and demand

growth, we believe that their utility will increase significantly in the future.

Chapter 7

Proofs

7.1 Proof of Lemma 3.1

The KKT conditions [14] of the per-stub datacenter problem (3.14) constitute the fol-

lowing system of equations.

⇢(�t+1ij

� ↵t+1ij

) +Di

(PB

j

+ ⌫t+1j

)� �t

ij

� ⌧ t+1ij

= 0, 8i, (7.1)

⌫t+1j

⇣

Cj

�X

i

�t+1ij

Di

⌘

= 0, �t+1ij

⌧ t+1ij

= 0, 8i (7.2)

Cj

�X

i

�t+1ij

Di

� 0, ⌫j

� 0, �t+1ij

� 0, ⌧ t+1ij

� 0, 8i. (7.3)

�t+1ij

is the optimal solution, and ⌫t+1j

is the KKT multiplier. (7.1) is the first-order

optimality conditions, (7.2) is the complementary slackness condition, and (7.3) are the

primal and dual feasibility conditions.

For all i 2 I that satisfy �t

ij

�Di

PB

j

+ ⇢↵t+1ij

0, assume �t+1ij

> 0. Then according

to the complementary slackness condition (7.2) ⌧ t+1ij

= 0. The left hand side (LHS) of

(7.1) is always positive, which contradicts the optimality condition. Thus �t+1ij

= 0.

As in Lemma 3.1, denote the rest of stub datacenters as the set It+1j

. �t

ij

�Di

PB

j

+

⇢↵t+1ij

> 0 holds for all i 2 It+1j

. IfP

i2It+1j

(�t

ij

�Di

PB

j

+⇢↵t+1ij

)Di

⇢Cj

, then according

125

126 CHAPTER 7. PROOFS

to (7.2) ⌫t+1j

= 0. This is so because for those i 2 It+1j

such that �t+1ij

> 0, ⇢�t+1ij

�t

ij

�

Di

PB

j

+ ⇢↵t+1ij

since ⌫t+1j

� 0 in (7.1). ThusP

i2It+1j

�t+1ij

Di

Cj

, and ⌫t+1j

= 0. Then,

⌧ t+1ij

= 0 must hold for all i 2 It+1j

, for otherwise �t+1ij

= 0 and the LHS of (7.2) is always

negative. Substituting ⌫t+1j

= 0 and ⌧ t+1ij

= 0 into (7.1) yields �t+1ij

=�

tij�DiP

Bj

⇢

+ ↵t+1ij

.

IfP

i2It+1j

(�t

ij

�Di

PB

j

+⇢↵t+1ij

)Di

> ⇢Cj

, note that the objective of (3.14) is minimized

at�

tij�Di(PB

j +⌫

t+1j )

⇢

+ ↵t+1ij

> 0 when the capacity constraint is absent, we must have

�t+1ij

<�

tij�Di(PB

j +⌫

t+1j )

⇢

+ ↵t+1ij

to conform to the capacity constraint. Since the objective

function of (3.14) is convex in �ij

, for �ij

2h

0,�

tij�Di(PB

j +⌫

t+1j )

⇢

+ ↵t+1ij

i

it is increasing.

Thus the optimal �t+1ij

must satisfy the capacity constraint at equality, and equal to

max

⇢

�

tij�Di(PB

j +⌫

t+1j )

⇢

+ ↵t+1ij

, 0

�

.

7.2 Proof of Lemma 3.3

The KKT conditions for the per-client sub-problem with an a�ne utility function (3.17)

are

⇢(↵t+1ij

� �t

ij

) +Di

(aLij

+ PE

j

) + �t

ij

+ µt+1i

� �t+1ij

= 0, 8j, (7.4)X

j

↵t+1ij

� 1 = 0, (7.5)

µt+1i

6= 0, �t+1ij

↵t+1ij

= 0,↵t+1ij

� 0, �t+1ij

� 0, 8j, (7.6)

where ↵t+1ij

is the optimal solution as in (3.6), and µt+1i

and �t+1ij

are the KKT multiplier

for the equality and inequality constraints of (3.17), respectively. (7.4) corresponds

to the first-order optimality condition, (7.5) is one of the primal feasibility conditions,

and (7.6) captures the other primal feasibility condition, the dual feasibility, and the

complementary slackness conditions. Essentially, since ↵t+1ij

and �t+1ij

never appear at the

same time in (7.4), ↵t+1ij

= max�

�t

ij

��

Di

�

aLij

+ PE

j

�

+ �t

ij

+ µt+1i

�

/⇢, 0

, and must

satisfy (7.5). Thus the proof.

7.3. PROOF OF INEQUALITY (4.21) 127

7.3 Proof of Inequality (4.21)

Taking rxi on both sides of (4.14), we get

rxiL⇢

(xk; yk) = rfi

(xk

i

) + AT

i

yk + ⇢AT

i

m

X

i=1

Ai

xk

i

� b

!

.

Recall that xk+1i

minimizes L⇢

(xk+11 , . . . , xk+1

i�1 , xi

, xk

i+1, . . . , xk

m

; yk), so we have

0 = rfi

(xk+1i

) + AT

i

yk + ⇢AT

i

i

X

j=1

Aj

xk+1j

+m

X

j=i+1

Aj

xk

j

� b

!

.

Combining the two equalities above, we obtain

rxiL⇢

(xk; yk) = rfi

(xk

i

)�rfi

(xk+1i

) + ⇢AT

i

i

X

j=1

Aj

(xk

j

� xk+1j

)

!

.

This implies that

krxiL⇢

(xk; yk)k2 krfi

(xk+1i

)�rfi

(xk

i

)k2 +i

X

j=1

k⇢AT

i

Aj

(xk

j

� xk+1j

)k2

krfi

(xk+1i

)�rfi

(xk

i

)k2 +i

X

j=1

k⇢AT

i

Aj

k2kxk

j

� xk+1j

k2, (7.7)

where the last inequality follows from the definition of the matrix norm.

Recall that, by Assumption 3, we have

krfi

(xk+1i

)�rfi

(xk

i

)k2 i

kxk

i

� xk+1i

k2.

Substituting this in (7.7) gives

krxiL⇢

(xk; yk)k2 i

kxk

i

� xk+1i

k2 +i

X

j=1

k⇢AT

i

Aj

k2kxk

j

� xk+1j

k2.


Since kxk

j

� xk+1j

k2 kxk � xk+1k2 for all j, we have

krxiL⇢

(xk; yk)k2 ✓kxk � xk+1k2

for some ✓ > 0. In particular, if we choose

✓ = maxi

(

i

+i

X

j=1

k⇢AT

i

Aj

k2

)

,

then krxiL⇢

(xk; yk)k2 ✓kxk � xk+1k2 for all i, which implies that

krx

L⇢

(xk; yk)k22 m

X

i=1

krxiL⇢

(xk; yk)k22 m✓2kxk � xk+1k22.


This inequality is proved in Lemma 2.2 under three assumptions in pp.5 of [53]. Since

these assumptions are valid in our case as well, we omit the detailed proof here.


We first introduce two lemmas that bound the changes in �k

d

and �k

p

over one iteration.

Lemma 7.1.

�k

d

��k�1d

�%(Axk � b)T (Axk+1 � b). (7.8)

Proof. By definition, xk+1 minimizes L⇢

(x; yk), i.e.,

L⇢

(xk+1; yk) = d(yk).


Thus, we have

�k

d

��k�1d

=�

d⇤ � d(yk)�

��

d⇤ � d(yk�1)�

= d(yk�1)� d(yk)

= L⇢

(xk; yk�1)� L⇢

(xk+1; yk)

=�

L⇢

(xk+1; yk�1)� L⇢

(xk+1; yk)�

+�

L⇢

(xk; yk�1)� L⇢

(xk+1; yk�1)�

= (yk�1 � yk)T (Axk+1 � b) +�

L⇢

(xk; yk�1)� L⇢

(xk+1; yk�1)�

= �%(Axk � b)T (Axk+1 � b) +�

L⇢

(xk; yk�1)� L⇢

(xk+1; yk�1)�

�%(Axk � b)T (Axk+1 � b),

where the last inequality follows from (4.15).

Lemma 7.2.

�k

p

��k�1p

%kAxk � bk22 � �kxk+1 � xkk22 � %(Axk � b)T (Axk+1 � b). (7.9)

Proof. First, we have

�k

p

��k�1p

=�

L⇢

(xk+1; yk)� d(yk)�

��

L⇢

(xk; yk�1)� d(yk�1)�

=�

L⇢

(xk+1; yk)� L⇢

(xk; yk�1)�

+�

d(yk�1)� d(yk)�

�

L⇢

(xk+1; yk)� L⇢

(xk; yk�1)�

� %(Axk � b)T (Axk+1 � b), (7.10)

where the last inequality follows from (7.8).

We next bound the term L⇢

(xk+1; yk)� L⇢

(xk; yk�1). Note that

L⇢

(xk; yk)� L⇢

(xk; yk�1) = (yk � yk�1)T (Axk � b)

= (%(Axk � b))T (Axk � b)

= %kAxk � bk22. (7.11)


Note also that, by Assumption 2, the augmented Lagrangian

L⇢

(x1, . . . , xm

; y) =m

X

i=1

fi

(xi

) + yT (m

X

i=1

Ai

xi

� b) + (⇢/2)km

X

i=1

Ai

xi

� bk22

is strongly convex over each variable xi

, as the sum of a strongly convex function and a

convex function is strongly convex. Thus, we have

L⇢

(xk+11 , . . . , xk+1

i�1 , xk

i

, xk

i+1, . . . , xk

m

; yk)� L⇢

(xk+11 , . . . , xk+1

i�1 , xk+1i

, xk

i+1, . . . , xk

m

; yk)

� rxiL⇢

(xk+11 , . . . , xk+1

i�1 , xk+1i

, xk

i+1, . . . , xk

m

; yk) +⌫i

2kxk

i

� xk+1i

k22

=⌫i

2kxk

i

� xk+1i

k22,

for i = 1, . . . ,m, where the last equality follows from the fact that xk+1i

minimizes

L⇢

(xk+11 , . . . , xk+1

i�1 , xi

, xk

i+1, . . . , xk

m

; yk).

Adding all the inequalities above together, we obtain

L⇢

(xk; yk)� L⇢

(xk+1; yk) �m

X

i=1

⌫i

2kxk

i

� xk+1i

k22.

If we choose � = mini

{⌫i

/2}, then we have

L⇢

(xk; yk)� L⇢

(xk+1; yk) � �kxk � xk+1k22,

or equivalently,

L⇢

(xk+1; yk)� L⇢

(xk; yk) ��kxk+1 � xkk22. (7.12)

Adding (7.11) and (7.12), we obtain

L⇢

(xk; yk)� L⇢

(xk; yk�1) %kAxk � bk22 � �kxk+1 � xkk22.


Substituting this in the first term of (7.10), we get

�k

p

��k�1p

%kAxk � bk22 � �kxk+1 � xkk22 � %(Axk � b)T (Axk+1 � b).

Now we are ready to prove the inequality (4.17). Adding (7.8) and (7.9) gives

V k � V k�1 %kAxk � bk22 � �kxk+1 � xkk22 � 2%(Axk � b)T (Axk+1 � b). (7.13)

Since Axk � Axk+1 = (Axk � b)� (Axk+1 � b), we have

kAxk � Axk+1k22 = k(Axk � b)� (Axk+1 � b)k22

= kAxk � bk22 � 2(Axk � b)T (Axk+1 � b) + kAxk+1 � bk22.

Substituting this in (7.13) yields

V k � V k�1 %kAxk � Axk+1k22 � %kAxk+1 � bk22 � �kxk+1 � xkk22. (7.14)

Note that

kAxk � Axk+1k22 kAk22kxk � xk+1k22 (7.15)

⌧ 2kAk22krx

L⇢

(xk; yk)k22 (7.16)

⌧ 2⌘2kAk22kxk � xk+1k22, (7.17)

where (7.15) follows form the definition of the matrix norm, (7.16) follows form (4.20),

and (7.17) follows from (4.21).


Substituting this in the first term of (7.14), we obtain

V k � V k�1 (%⌧ 2⌘2kAk22 � �)kxk+1 � xkk22 � %kAxk+1 � bk22

= �%kAxk+1 � bk22 � #kxk+1 � xkk22,

where # = ��%⌧ 2⌘2kAk22. Note that, when the stepsize % is small enough, we have # > 0,

which gives (4.17).

7.6 Per-datacenter Sub-problem (4.28) is a SOCP

Note that (4.28) can be rewritten as the following quadratic program:

min aTj

Fj

aj

+ hT

j

aj

s.t. aj

⌫ 0,

where Fj

= (⇢/2)�

I|I| + eT e�

with e = (1, . . . , 1), and hj

is a column vector with hji

=

Ej

Pj

+ �k

j

+ 'k

ij

+ ⇢(�k

j

+ �k

j

�Cj

� ↵k+1ij

). Clearly, Fj

captures the quadratic terms and

hj

captures the linear terms in the objective function.

Since Fj

is symmetric and positive-definite, Fj

can be decomposed as Fj

= GT

j

Gj

(known as the Cholesky decomposition), where Gj

is an upper triangular matrix with

positive diagonal entries. In particular, Gj

is invertible. Let gj

= (1/2)(G�1j

)Thj

. Then

the objective function can be expressed as

kGj

aj

+ gj

k22 � gTj

gj

.

Thus, the quadratic program is equivalent to the following second-order cone program

7.7. PROOF OF THEOREM 5.7 133

(SOCP)

min kGj

aj

+ gj

k2

s.t. aj

⌫ 0.

Since SOCP can be e�ciently solved (using, for example, interior point methods), our

per-datacenter sub-problem admits fast algorithms.

7.7 Proof of Theorem 5.7

To prove Theorem 5.7 we need the following lemmas.

Lemma 7.3. If a particular job j participates in stage t + 1, it must have participated

in stage t.

Proof. It is a direct result of our algorithm design. When a job stops participating in

t + 1, it does not form any blocking pair with any of the machines no matter how the

rest of jobs participated in t are matched.

Lemma 7.4. If a set of jobs do not participate in a particular stage t in Multi-stage

DA, then they are assigned to their best possible machine in all weakly stable matchings

after t.

Proof. Let us call a machine “possible” for a given job if there is a weakly stable matching

that assigns it there. We can prove this lemma by induction. It is clear that when we

pick up all the jobs that can participate, it is equivalent to pick up those that definitely

cannot participate. We can easily do so by first finding jobs that cannot participate unless

some jobs that did not participate in the previous stage participate, marking them, and

finding the rest that cannot participate unless some jobs that were marked in this stage

or previous stages participate, until there is no such job.


Assume that up to a certain point of Multi-stage DA, the lemma holds, i.e. no

marked job was permanently rejected by a possible machine (since it cannot propose

any more). Now suppose we mark job j, which is (permanently) rejected by a possible

machine m in a hypothetical matching µ0. Since j is marked, m must have accepted a

set of jobs Jm

that are preferable than j and marked before. Then, in µ0, at least one of

the jobs from Jm

must be sent to a less desirable machine, since all machines preferable

than m is impossible for it by assumption. This clearly forms a type-2 blocking pair in

µ0, and contradicts with the assumption. Thus the proof.

This shows that our algorithm only permanently rejects jobs from machines that

are impossible to accept them in all weakly stable matchings, when the jobs cannot

participate any further. The resulting assignment is therefore optimal.

Lemma 7.5. If µt

= µt�1 in Multi-stage DA, then the set of participating jobs at t are

assigned their best possible machine in all weakly stable matchings.

Proof. Assume that at a given point of the algorithm execution at t, every job matched

to its previous machine µt�1(j) is given its best possible machine. Suppose now, j is

rejected by all machines better than µt�1(m), while there is a hypothetical weakly stable

matching µ0 that sends j to a better machine m. Thus j must have proposed to and been

rejected by m. m rejected j because it accepted j1, j2, · · · , each of which is preferable

than j. If ji

did not participate in t, by Lemma 7.4 m is its best possible machine. If ji

participated in t, by assumption m is impossible for it. Thus in µ0, for m to have j, at

least one of ji

has to be sent to a less desirable machine, which causes a type-2 blocking

pair (ji

,m) in µ0, which contradicts the assumption.

Now we can prove Theorem 5.7.

Proof of Theorem 5.7. There are two possible cases when Multi-stage DA terminates.

If the algorithm terminates when there is no type-1 blocking pair, i.e. no job can, or

wishes to if it can, participate by proposing to a better machine. Then by Lemma 7.4, all

7.8. PROOF OF THEOREM 5.8 135

these jobs are assigned to their best possible machine. Therefore the resulting matching

is job-optimal.

If the algorithm terminates at stage t where µt

= µt�1, then by Lemma 7.4 all jobs

that did not participate in t are assigned the best machine. By Lemma 7.5 all jobs that

participated in t are also sent to their best machine. Hence the matching is also the

job-optimal weakly stable matching.

7.8 Proof of Theorem 5.8

Proof. This can be easily proved by contradiction. Assume that the matching produced

when the algorithm terminates with no job proposing is not strongly stable. Thus, we

must be able to find a type-1 blocking pair, say (j,m), as implied by Theorem 5. m will

participate in the next stage, and j will be willing to propose to m, and our algorithm

will continue to run, rather than terminating. This contradicts with our assumption.

We conjecture that when the algorithm terminates with type-1 blocking pairs, the

problem does not admit a strongly stable solution. The proof is, however, not immediate

and left for future work.

Bibliography

[1] http://www.routeviews.org.

[2] https://www.google.com/about/datacenters/inside/locations/.

[3] S. Agarwal, J. Dunagan, N. Jain, S. Saroiu, A. Wolman, and H. Bhogan, “Volley:

Automated data placement for geo-distributed cloud services,” in Proc. USENIX

NSDI, 2010.

[4] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center

network architecture,” in Proc. ACM SIGCOMM, 2008.

[5] Amazon EC2, http://aws.amazon.com/ec2/.

[6] L. L. Andrew, M. Lin, and A. Wierman, “Optimality, fairness, and robustness in

speed scaling designs,” in Proc. ACM Sigmetrics, 2010.

[7] Apache Hadoop, http://hadoop.apache.org.

[8] Apache Mahout, http://mahout.apache.org.

[9] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer,

I. Pratt, and A. Warfield, “Xen and the art of virtualization,” in Proc. ACM SOSP,

2003.

136

BIBLIOGRAPHY 137

[10] C. Bash and G. Forman, “Cool job allocation: Measuring the power savings of

placing jobs at cooling-e�cient locations in the data center,” in Proc. USENIX

ATC, 2007.

[11] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Nu-

merical Methods. Athena Scientific, 1997.

[12] S. Boyd and A. Mutapcic, “Subgradient methods,” Lecture notes of EE364b,

Stanford University, Winter Quarter 2006-2007. http://www.stanford.edu/class/

ee364b/notes/subgrad method notes.pdf.

[13] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimiza-

tion and statistical learning via the alternating direction method of multipliers,”

Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2010.

[14] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press,

2004.

[15] T. D. Braun, H. J. Siegel, N. Beck, L. L. Boloni, M. Maheswaran, A. I. Reuther,

J. P. Robertson, M. D. Theys, B. Yao, D. Hensgen, and R. F. Freund, “A com-

parison of eleven static heuristics for mapping a class of independent tasks onto

heterogeneous distributed computing systems,” Elsevier Journal of Parallel and

Distributed Computing, vol. 61, no. 6, pp. 810–837, June 2001.

[16] D. Breitgand and A. Epstein, “Improving consolidation of virtual machines with

risk-aware bandwidth oversubscription in compute clouds,” in Proc. IEEE INFO-

COM, 2012.

[17] R. Campbell, I. Gupta, M. Heath, S. Y. Ko, M. Kozuch, M. Kunze, T. Kwan,

K. Lai, H. Y. Lee, M. Lyons, D. Milojicic, D. O’Hallaron, and Y. C. Soh, “Open

CirrusTM cloud computing testbed: Federated data centers for open source systems

and services research,” in Proc. USENIX HotCloud, 2009.

138 BIBLIOGRAPHY

[18] Z. Cao, Z. Wang, and E. Zegura, “Performance of hashing-based schemes for In-

ternet load balancing,” in Proc. IEEE INFOCOM, 2000.

[19] M. Cardosa, M. R. Korupolu, and A. Singh, “Shares and utilities based power

consolidation in virtualized server environments,” in Proc. IFIP/IEEE Intl. Symp.

Inte. Netw. Manag. (IM), 2009.

[20] F. Chen, K. Guo, J. Lin, and T. La Porta, “Intra-cloud lightning: Building CDNs

in the cloud,” in Proc. IEEE INFOCOM, 2012.

[21] M. Chen, H. Zhang, Y.-Y. Su, X. Wang, G. Jiang, and K. Yoshihira, “E↵ective

VM sizing in virtualized data centers,” in Proc. IFIP/IEEE Intl. Symp. Inte. Netw.

Manag. (IM), 2011.

[22] Y. Chen, A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, and N. Gautam, “Man-

aging server energy and operational costs in hosting centers,” in Proc. ACM Sig-

metrics, 2005.

[23] M. Chiang, S. H. Low, A. R. Calderbank, and J. C. Doyle, “Layering as optimiza-

tion decomposition: A mathematical theory of network architectures,” Proc. IEEE,

vol. 95, no. 1, pp. 255–312, January 2007.

[24] CloudStack, http://www.cloudstack.org.

[25] ComputerWeekly.com, “Fresh air cooling makes HP datacentre super-e�cient,”

http://tinyurl.com/bpqv6tl.

[26] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat,

A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li,

A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito,

M. Szymaniak, C. Taylor, R. Wang, and D. Woodford, “Spanner: Google’s globally-

distributed database,” in Proc. USENIX OSDI, 2012.

BIBLIOGRAPHY 139

[27] Datacenter Dynamics, “DCD industry census: Growth into 2013,” http://tinyurl.

com/a2yktdg, January 2013.

[28] Datacenterknowledge.com, “Estimate: Facebook running 180,000 servers,” http:

//tinyurl.com/b4g452h, August 2012.

[29] ——, “Too hot for humans, but Google servers keep humming,” http://tinyurl.

com/89ros64, March 2012.

[30] J. Dean, “Underneath the covers at google: Current systems and future directions,”

In Google I/O, 2008.

[31] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clus-

ters,” in Proc. USENIX OSDI, 2004.

[32] I. Drago, M. Mellia, M. Munafo, A. Sperotto, R. Sadre, and A. Pras, “Inside

Dropbox: Understanding personal cloud storage services,” in Proc. ACM IMC,

2012.

[33] Emerson Network Power, “Liebert R� DSETM

precision cooling system sales

brochure,” http://tinyurl.com/c7e8qxz, 2012.

[34] Eucalyptus, http://www.eucalyptus.com.

[35] Faban, http://www.opensparc.net/sunsource/faban/www.

[36] X. Fan, W.-D. Weber, and L. A. Barroso, “Power provisioning for a warehouse-

sized computer,” in Proc. ACM/IEEE Intl. Symp. Computer Architecture (ISCA),

2007.

[37] Federal Energy Regulatory Commission, “U.S. electric power markets,” http://

www.ferc.gov/market-oversight/mkt-electric/overview.asp, 2011.

140 BIBLIOGRAPHY

[38] C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, D. Moll, R. Rockell, T. Seely,

and S. Diot, “Packet-level tra�c measurements from the Sprint IP backbone,”

IEEE Netw., vol. 17, no. 6, pp. 6–16, November 2003.

[39] D. Gale and L. S. Shapley, “College admissions and the stability of marriage,”

Amer. Math. Mon., vol. 69, no. 1, pp. 9–15, 1962.

[40] P. X. Gao, A. R. Curtis, B. Wong, and S. Keshav, “It’s not easy being green,” in

Proc. ACM SIGCOMM, 2012.

[41] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Sto-

ica, “Dominant resource fairness: Fair allocation of multiple resource types,” in

Proc. USENIX NSDI, 2011.

[42] Gigaom.com, “Tier 3 spi↵s up cloud management,” http://gigaom.com/cloud/tier-

3-spi↵s-up-cloud-management/, August 2012.

[43] D. K. Goldenberg, L. Qiu, H. Xie, Y. R. Yang, and Y. Zhang, “Optimizing cost

and performance for multihoming,” in Proc. ACM SIGCOMM, 2004.

[44] Z. Gong, X. Gu, and J. Wilkes, “PRESS: PRedictive Elastic ReSource Scaling for

cloud systems,” in Proc. IEEE Intl. Conf. Netw. Serv. Manag. (CNSM), 2010.

[45] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel, “The Cost of a Cloud: Re-

search Problems in Data Center Networks,” SIGCOMM Comput. Commun. Rev.,

vol. 39, no. 1, pp. 68–73, 2009.

[46] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, P. Patel, and

S. Sengupta, “VL2: A Scalable and Flexible Data Center Network,” in Proc. ACM

SIGCOMM, 2009.

[47] J. Hamilton, “Open compute mechanical system design,” http://tinyurl.com/

8ulxfzp, April 2011.

BIBLIOGRAPHY 141

[48] D. Han and X. Yuan, “A note on the alternating direction method of multipliers,”

J. Optim. Theory Appl., vol. 155, pp. 227–238, 2012.

[49] B. S. He, M. Tao, and X. M. Yuan, “Alternating direction method with Gaussian

back substitution for separable convex programming,” SIAM J. Optim., vol. 22,

pp. 313–340, 2012.

[50] P. He, Y. Li, Z. Nie, and N. E. Shawwa, “Review of linear programming software,”

http://www.cas.mcmaster.ca/⇠cs777/presentations/Solvers1.pdf, January 2007.

[51] M. R. Hestenes, “Multiplier and gradient methods,” J. Optim. Theory Appl., vol. 4,

no. 5, pp. 303–320, 1969.

[52] U. Holzle, “Cloud computing can use energy e�ciently,” New York Times, http:

//tinyurl.com/9qdka7g, September 2012.

[53] M. Hong and Z.-Q. Luo, “On the linear convergence of the alternating direction

method of multipliers,” http://arxiv.org/abs/1208.3922, August 2012.

[54] O. H. Ibarra and C. E. Kim, “Heuristic algorithms for scheduling independent tasks

on nonidentical processors,” J. ACM, vol. 24, no. 2, pp. 280–289, April 1977.

[55] Intel Inc., “Reducing data center cost with an air economizer,” http://tinyurl.com/

bmyuotx, August 2008.

[56] J. W. Jiang, T. Lan, S. Ha, M. Chen, and M. Chiang, “Joint VM placement and

routing for data center tra�c engineering,” in Proc. IEEE INFOCOM, 2012.

[57] C. Joe-Wang, S. Sen, T. Lan, and M. Chiang, “Multi-resource allocation: Fairness-

e�ciency tradeo↵s in a unifying framework,” in Proc. IEEE INFOCOM, 2012.

[58] A. Kalbasi, D. Krishnamurthy, J. Rolia, and M. Richter, “MODE: Mix driven on-

line resource demand estimation,” in Proc. IEEE Intl. Conf. Netw. Serv. Manag.

(CNSM), 2011.

142 BIBLIOGRAPHY

[59] R. Kohavi, R. M. Henne, and D. Sommerfield, “Practical guide to controlled ex-

periments on the web: Listen to your customers not to the hippo,” in Proc. ACM

SIGKDD, 2007.

[60] M. Korupolu, A. Singh, and B. Bamba, “Coupled placement in modern data cen-

ters,” in Proc. IEEE Intl. Symp. Parallel & Distr. Processing (IPDPS), 2009.

[61] R. Krishnan, H. V. Madhyastha, S. Srinivasan, S. Jain, A. Krishnamurthy, T. An-

derson, and J. Gao, “Moving beyond end-to-end path information to optimize CDN

performance,” in Proc. ACM IMC, 2009.

[62] W. Leinberger, G. Karypis, and V. Kumar, “Job scheduling in the presence of mul-

tiple resource requirements,” in Proc. ACM/IEEE Conference on Supercomputing,

1999.

[63] M. Lin, A. Wierman, L. L. H. Andrew, and E. Thereska, “Dynamic right-sizing for

power-proportional data centers,” in Proc. IEEE INFOCOM, 2011.

[64] Z. Liu, Y. Chen, C. Bash, A. Wierman, D. Gmach, Z. Wang, M. Marwah, and

C. Hyser, “Renewable and cooling aware workload management for sustainable

data centers,” in Proc. ACM Sigmetrics, 2012.

[65] Z. Liu, M. Lin, A. Wierman, S. H. Low, and L. L. Andrew, “Greening geographical

load balancing,” in Proc. ACM Sigmetrics, 2011.

[66] J. R. Lorch and A. J. Smith, “Improving dynamic voltage scaling algorithms with

pace,” in Proc. ACM Sigmetrics, 2001.

[67] H. V. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy,

and A. Venkataramani, “iPlane: An information plane for distributed services,” in

Proc. USENIX OSDI, 2006.

BIBLIOGRAPHY 143

[68] V. Mathew, R. K. Sitaraman, and P. Shenoy, “Energy-aware load balancing in

content delivery networks,” in Proc. IEEE INFOCOM, 2012.

[69] X. Meng, V. Pappas, and L. Zhang, “Improving the scalability of data center

networks with tra�c-aware virtual machine placement,” in Proc. IEEE INFOCOM,

2010.

[70] X. Meng, C. Isci, J. Kephart, L. Zhang, E. Bouillet, and D. Pendarakis, “E�cient

resource provisioning in compute clouds via vm multiplexing,” in Proc. Intl. Conf.

Autonomic Computing (ICAC), 2010.

[71] J. Mo and J. Walrand, “Fair end-to-end window-based congestion control,”

IEEE/ACM Trans. Netw., vol. 8, no. 5, pp. 556–567, October 2000.

[72] A.-H. Mohsenian-Rad and A. Leon-Garcia, “Coordination of cloud computing and

smart power grids,” in Proc. IEEE SmartGridComm, 2010.

[73] J. Moore, J. Chase, P. Ranganathan, and R. Sharma, “Making scheduling ”cool”:

Temperature-aware workload placement in data centers,” in Proc. USENIX ATC,

2005.

[74] S. Muthukrishnan, R. Rajaraman, A. Shaheen, and J. Gehrke, “Online scheduling

to minimize average stretch,” in Proc. IEEE FOCS, 1999.

[75] V. K. Naik, M. S. Squillante, and S. K. Setia, “Performance analysis of job schedul-

ing policies in parallel supercomputing environments,” in Proc. ACM/IEEE Con-

ference on Supercomputing, 1993.

[76] S. Narayana, J. W. Jiang, J. Rexford, and M. Chiang, “To coordinate or not

to coordinate? Wide-Area tra�c management for data centers,” http://www.cs.

princeton.edu/⇠narayana/conext12.pdf, Princeton University, Tech. Rep., 2012.

144 BIBLIOGRAPHY

[77] R. Nathuji and K. Schwan, “Virtualpower: Coordinated power management in

virtualized enterprise systems,” in Proc. ACM SOSP, 2007.

[78] National Climate Data Center (NCDC), http://www.ncdc.noaa.gov.

[79] Nimbus, http://www.nimbusproject.org.

[80] D. Niu, H. Xu, B. Li, and S. Zhao, “Risk management for video-on-demand servers

leveraging demand forecast,” in Proc. ACM Multimedia, 2011.

[81] ——, “Quality-assured cloud bandwidth auto-scaling for video-on-demand appli-

cations,” in Proc. IEEE INFOCOM, 2012.

[82] E. Nygren, R. K. Sitaraman, and J. Sun, “The Akamai network: A platform for

high-performance Internet applications,” SIGOPS Oper. Syst. Rev., vol. 44, no. 3,

pp. 2–19, August 2010.

[83] Olio, http://incubator.apache.org/olio.

[84] Oracle VirtualBox, http://www.virtualbox.org.

[85] K. Papagiannaki, N. Taft, Z.-L. Zhang, and C. Diot, “Long-term forecasting of

Internet backbone tra�c: Observations and initial models,” in Proc. IEEE INFO-

COM, 2003.

[86] C. D. Patel and R. S. C. E. Bash, “Smart cooling of data centers,” in Proc. ASME

International Electronic Packaging Technical Conference and Exhibition (Inter-

PACK), 2003.

[87] S. Pelley, D. Meisner, T. F. Wenisch, and J. W. VanGilder, “Understanding and

abstracting total data center power,” in Proc. Workshop on Energy E�cient Design

(WEED), 2009.

BIBLIOGRAPHY 145

[88] A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, and B. Maggs, “Cutting the

electricity bill for Internet-scale systems,” in Proc. ACM SIGCOMM, 2009.

[89] K. Rajamani and C. Lefurgy, “On evaluating request-distribution schemes for sav-

ing energy in server clusters,” in Proc. IEEE Intl. Symp. Performance Analysis of

Systems and Software, 2003.

[90] L. Rao, X. Liu, L. Xie, and W. Liu, “Minimizing electricity cost: Optimization

of distributed Internet data centers in a multi-electricity-market environment,” in

Proc. IEEE INFOCOM, 2010.

[91] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch, “Hetero-

geneity and dynamicity of clouds at scale: Google trace analysis,” in Proc. ACM

SoCC, 2012.

[92] Y. Rekhter, T. Li, and S. Hares, “A border gateway protocol 4 (BGP-4),” RFC

4271, http://datatracker.ietf.org/doc/rfc4271/, January 2006.

[93] S. Ren, Y. He, and F. Xu, “Provably-e�cient job scheduling for energy and fairness

in geographically distributed data centers,” in Proc. IEEE ICDCS, 2012.

[94] A. E. Roth, “Deferred acceptance algorithms: History, theory, practice, and open

questions,” Int. J. Game Theory, vol. 36, pp. 537–569, 2008.

[95] S. K. Sahni, “Algorithms for scheduling independent tasks,” J. ACM, vol. 23, no. 1,

pp. 116–127, January 1976.

[96] Z. Shen, S. Subbiah, X. Gu, and J. Wilkes, “CloudScale: Elastic resource scaling

for multi-tenant cloud systems,” in Proc. ACM SoCC, 2011.

[97] B. Sotomayor, R. Montero, I. Llorente, and I. Foster, “Virtual infrastructure man-

agement in private and hybrid clouds,” IEEE Internet Comput., vol. 13, no. 5, pp.

14–22, September 2009.

146 BIBLIOGRAPHY

[98] R. Stanojevic, I. Castro, and S. Gorinsky, “CIPT: Using Tuangou to reduce IP

transit costs,” in Proc. ACM CoNEXT, 2011.

[99] R. F. Sullivan, “Alternating cold and hot aisles provides more reliable cooling for

server farms,” Uptime Institute, 2000.

[100] The RICC Log, http://www.cs.huji.ac.il/labs/parallel/workload/l ricc/index.html.

[101] G. Urdaneta, G. Pierre, and M. van Steen, “Wikipedia workload analysis for de-

centralized hosting,” Elsevier Computer Networks, vol. 53, no. 11, pp. 1830–1845,

July 2009.

[102] B. Urgaonkar, P. Shenoy, and T. Roscoe, “Resource overbooking and application

profiling in shared hosting platforms,” SIGOPS Oper. Syst. Rev., vol. 36, no. SI,

pp. 239–254, December 2002.

[103] V. Valancius, C. Lumezanu, N. Feamster, R. Johari, and V. V. Vazirani, “How

many tiers? Pricing in the Internet transit market,” in Proc. ACM SIGCOMM,

2011.

[104] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger,

G. A. Gibson, and B. Mueller, “Safe and e↵ective fine-grained TCP retransmissions

for datacenter communication,” in Proc. ACM SIGCOMM, 2009.

[105] A. Verma, G. Dasgupta, T. K. Nayak, P. De, and R. Kothari, “Server workload

analysis for power minimization using consolidation,” in Proc. USENIX ATC, 2009.

[106] A. Verma, P. Ahuja, and A. Neogi, “pMapper: Power and migration cost aware

application placement in virtualized systems,” in Proc. ACM Middleware, 2008.

[107] VMware vSphereTM, http://www.vmware.com/products/drs/overview.html.

[108] G. Wang and T. Ng, “The impact of virtualization on network performance of

Amazon EC2 data center,” in Proc. IEEE INFOCOM, 2010.

BIBLIOGRAPHY 147

[109] M. Weiser, B. Welch, A. Demers, and S. Shenker, “Scheduling for reduced cpu

energy,” in Proc. USENIX OSDI, 1994.

[110] P. Wendell, J. W. Jiang, M. J. Freedman, and J. Rexford, “DONAR: Decentralized

server selection for cloud services,” in Proc. ACM SIGCOMM, 2010.

[111] T. Wood, P. Shenoy, A. Venkataramani, and M. Yousif, “Black-box and gray-box

strategies for virtual machine migration,” in Proc. USENIX NSDI, 2007.

[112] D. Xu and X. Liu, “Geographic trough filling for Internet datacenters,” in

Proc. IEEE INFOCOM, 2012.

[113] D. Xu, X. Liu, and Z. Niu, “Joint resource provisioning for distributed Internet

datacenters with diverse and dynamic tra�c,” 2012, under submission.

[114] H. Xu and B. Li, “A general and practical datacenter selection framework for cloud

services,” in Proc. IEEE CLOUD, 2012.

[115] H. Xu, C. Feng, and B. Li, “Temperature aware workload management in geo-

distributed datacenters,” 2013, under submission.

[116] H. Xu and B. Li, “Joint request mapping and response routing for geo-distributed

cloud services,” in Proc. IEEE INFOCOM, 2013.

[117] ——, “Anchor: A versatile and e�cient framework for resource management in

the cloud,” IEEE Trans. Parallel Distrib. Syst., to appear, 2013. c� 2013 IEEE,

reprinted with permission.

[118] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica,

“Delay scheduling: A simple technique for achieving locality and fairness in cluster

scheduling,” in Proc. ACM EuroSys, 2010.

[119] Zenoss Inc., “Virtualization and cloud computing survey,” 2010.

148 BIBLIOGRAPHY

[120] Z. Zhang, M. Zhang, A. Greenberg, Y. C. Hu, R. Mahajan, and B. Christian, “Opti-

mizing cost and performance in online service provider networks,” in Proc. USENIX

NSDI, 2010.

[121] R. Zhou, Z. Wang, A. McReynolds, C. Bash, T. Christian, and R. Shih, “Optimiza-

tion and control of cooling microgrids for data centers,” in Proc. IEEE ITherm,

2012.

[122] Y. Zhu, B. Helsley, J. Rexford, A. Siganporia, and S. Srinivasan, “LatLong: Diag-

nosing wide-area latency changes for CDNs,” IEEE Trans. Netw. Service Manag.,

vol. 9, no. 3, pp. 333–345, September 2012.

efficient workload and resource management in datacenters...workout at hart house that improved my...

Documents