by wei wang - university of toronto t-space...wei wang doctor of philosophy graduate department of...

Fair Scheduling in Cloud Datacenters with Multiple Resource Types

by

Wei Wang

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2015 by Wei Wang

Abstract

Fair Scheduling in Cloud Datacenters with Multiple Resource Types

Wei Wang

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2015

This dissertation focuses on algorithm design and prototype implementation of fair sharing policies

in cloud datacenters with multiple resource types. Specifically, it seeks to address two fundamental

resource management problems.

First, how should the the computing resources of a large-scale cluster – such as CPU cores, memory,

and storage – be fairly shared among running applications? This problem has become even more com-

plicated in cloud datacenters, mainly due to their unprecedented heterogeneity and complexity. Cloud

computing clusters are likely to be constructed from a large number of commodity servers spanning mul-

tiple generations, with different computing capabilities, bandwidth, and storage capacities. On the other

hand, depending on the underlying applications, computing tasks may require vastly different amounts

of resources: some are CPU-intensive; some are memory- or bandwidth-bound. Despite these complex-

ities, existing resource sharing policies either result in significant resource fragmentation, or require a

homogeneous cluster where servers are of the same specification.

This dissertation proposes Dominant Resource Fairness in Heterogeneous systems (DRFH), a general

multi-resource sharing policy that preserves many axiomatically, and highly desirable “fair” properties

that used to be provided by weighted fair sharing in the single-resource setting. DRFH eliminates

the application’s incentive of cheating the cluster scheduler for more allocation share by misreporting

its resource requirement, a strategic behaviour commonly observed in production clusters. Prototype

implementation and trace-driven simulations show that DRFH can be easily enforced in cluster systems,

with higher resource utilization and shorter job completion time than the existing resource sharing

policies.

Second, how should the active flows fairly share the network resources of middleboxes, software

routers, and other appliances that are widely deployed in datacenters? In middleboxes or software

routers, flows usually undergo deep packet inspection, which requires the support of multiple types of

resources, and may bottleneck on either CPU, memory bandwidth, or link bandwidth. While there is

rich literature of fair queueing for a single type of resource (i.e., link bandwidth), it remains unclear how

ii

to schedule multiple resources in middleboxes to achieve fair sharing among flows. A similar problem

also arises in virtual machine (VM) scheduling inside a hypervisor, where different VMs may consume

different amounts of resources, and it is desirable to fairly multiplex their access to physical resources.

To answer these challenges, this dissertation proposes Multi-resource Round Robin (MR3) that serves

flows in rounds and achieves near-perfect fairness in O(1) time. MR3 serves as a foundation for a more

general fair scheduler, called Group Multi-Resource Round Robin (GMR3). GMR3 also runs in O(1)

time, yet provides weight-proportional packet latency when flows are assigned uneven weights.

This dissertation also identifies a new challenge that is unique to multi-resource scheduling: the

general tradeoff between fairness and efficiency. Such a tradeoff has never been a problem for traditional

fair sharing of link bandwidth. As long as the queueing algorithm is work conserving, the bandwidth is

always fully utilized, and fairness is the only concern. However, in the presence of multiple resource types,

fairness and efficiency are conflicting objectives that cannot be achieved at the same time. Motivated

by this problem, a new queueing algorithm is proposed and prototyped. It allows network operator to

flexibly specify her tradeoff preference and implements the specified tradeoff by determining the right

packet scheduling order.

iii

To my family

iv

Acknowledgments

First and foremost, I am deeply indebted to my thesis advisors, Professor Baochun Li and Professor

Ben Liang, who have inspired my mind and guided me relentlessly throughout my PhD. They are both

brilliant researchers who always shared with me sharp visions and provided me invaluable feedback to

virtually every piece of work I have done. Without their guide and training, I could not imagine that I

would have gone so further. I felt extremely privileged to work with both of them and benefit from both

their perspectives.

There are also a number of people to whom I am especially thankful. The first one is easy: Dr. Chen

Feng. A good old friend and a close collaborator, Chen was always available and ready to help in various

ways, both technical and non-technical. Chapter 4 was joint work with him and my advisors, where he

has helped simply the analyses and has provided the key ideas for the design of an O(log n) tracking

algorithm (Sec. 4.5). Also, I would like to express my sincere gratitude to Professor Jorg Liebeherr and

Professor Ding Yuan, who generously served on my thesis proposal committee and have offered fantastic

advices to the preliminary work of this dissertation. I am grateful to Jun Li as well, for his great help

in the experiments of Chapter 2. Finally, I am greatly thankful to Professor Frank Kschischang, for his

comments on writing.

Beyond direct contributors on my thesis project, many other people have generously offered help

in my graduate work. In particular, the close collaboration with Dr. Di Niu led to join work in cloud

economics [1]. Dr. Zimu Liu has provided numerous great suggestions on prototype implementations.

Dr. Henry Xu and Zhifei Zhu were always available for inspiring and fruitful discussions.

I was also very fortunate to work with talented members in two fantastic research groups at the

University of Toronto: iQua and WCL. The list is in the order of seniority: Dr. Di Niu, Dr. Chen Feng,

Dr. Mahdi Hajiaghayi, Dr. Seyed Hossein Seyedmehdi, Dr. Henry Xu, Dr. Zimu Liu, Dr. Yuan Feng,

Dr. Boyang Wang, Dr. Zhi Wang, Yuefei Zhu, Yiwei Pu, Qiang Xiao, Juyang Xia, Honghao Ju, Wei Bao,

Sun Sun, Jun Li, Jaya Prakash Champati, Ali Ramezani-Kebrya, Meng-Hsi Chen, Yong Wang, Li Chen,

Liyao Xiang, Shuhao Liu, Yujie Xu, Sowndarya Sundar, Yinan Liu, and Mohammadmoein Soltanizadeh.

v

I also would like to thank friends in Bahen for their accompany and encouragement: Huiyuan Xiong,

Siyu Liu, Yuhan Zhou, Alice Gao, Dr. Yicheng Lin, Dr. Qi Zhang, Chu Pang, Weiwei Li, and Binbin

Dai.

Finally, my deepest gratitude goes to my family. My work would have never been possible without

the unwavering support and unflagging love from my parents and my wife. This dissertation is dedicated

to them.

vi

Contents

Acknowledgment v

1 Introduction 1

1.1 Resource Sharing Policies in Cloud Datacenters . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Multi-Resource Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Multi-Resource Fair Sharing for Cloud Systems with Heterogeneous Servers . . . . 5

1.3.2 Multi-Resource Fair Queueing for Network Flows . . . . . . . . . . . . . . . . . . . 5

1.3.3 Fairness-Efficiency Tradeoff for Multi-Resource Scheduling . . . . . . . . . . . . . . 6

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Multi-Resource Fair Scheduler for Datacenter Jobs 8

2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 DRF Scheduler and Its Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 System Model and Allocation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Basic Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 Allocation Mechanism and Desirable Properties . . . . . . . . . . . . . . . . . . . . 13

2.3.4 Naive DRF Extension and Its Inefficiency . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 DRFH Allocation and Its Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 DRFH Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.2 Envy-Freeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.3 Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.4 Group Strategyproofness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.5 Sharing Incentive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

vii

2.4.6 Other Important Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5.1 Weighted Users with a Finite Number of Tasks . . . . . . . . . . . . . . . . . . . . 28

2.5.2 Scheduling Tasks as Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.6.1 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.6.2 Trace-Driven Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Multi-Resource Fair Queueing for Network Flows 40

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Preliminaries and Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.1 Packet Processing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.2 Dominant Resource Fairness (DRF) over Time . . . . . . . . . . . . . . . . . . . . 44

3.3.3 Scheduling Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.4 Scheduling Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Multi-Resource Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.1 Challenges of Round-Robin Extension . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.2 Basic Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4.3 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4.6 MR3 for the Weighted Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.5 Group Multi-Resource Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.5.1 Basic Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.5.2 Flow Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.5.3 Inter-Group Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5.4 Intra-Group Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.5.5 Handling New Packet Arrivals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.5.6 Implementation and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


viii

3.5.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.6 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.8.1 Fairness Analysis for MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.8.2 Analysis of Startup Latency of MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.8.3 Analysis of Scheduling Delay of MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.8.4 Fairness Analysis of Weighted MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.8.5 Delay Analysis of Weighted MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4 Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 93

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.2 Fairness and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.2.1 Dominant Resource Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.2.2 The Efficiency Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.2.3 Tradeoff between Fairness and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 97

4.2.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.3 Fairness, Efficiency, and Their Tradeoff in the Fluid Model . . . . . . . . . . . . . . . . . 99

4.3.1 Fluid Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.3.2 Fluid Schedule with Perfect Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.3.3 Fluid Schedule with Optimal Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 102

4.3.4 Tradeoff between Fairness and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 104

4.4 Packet-by-Packet Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.4.1 Start-Time Tracking vs. Finish-Time Tracking . . . . . . . . . . . . . . . . . . . . 107


4.5 An O(log n) Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.5.1 Packet Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.5.2 Direct Implementation of Fluid Scheduling . . . . . . . . . . . . . . . . . . . . . . 109

4.5.3 Virtual Time Implementation of Fluid Scheduling . . . . . . . . . . . . . . . . . . 110

4.5.4 Start-Time Tracking and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.6.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.6.2 Trace-Driven Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

ix

4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.8 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.9 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.9.1 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.9.2 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.9.3 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.9.4 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.9.5 Preliminaries for the Proofs of Theorems 14 and 15 . . . . . . . . . . . . . . . . . . 126

4.9.6 Proof of Theorem 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.9.7 Proof of Theorem 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5 Concluding Remarks 133

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Bibliography 136

x

List of Figures

2.1 An example of a system consisting of two heterogeneous servers, in which user 1 can

schedule at most two tasks each demanding 1 CPU and 4 GB memory. The resources

required to execute the two tasks are also highlighted in the figure. . . . . . . . . . . . . 10

2.2 An example of a system containing two heterogeneous servers shared by two users. Each

computing task of user 1 requires 0.2 CPU time and 1 GB memory, while the computing

task of user 2 requires 1 CPU time and 0.2 GB memory. . . . . . . . . . . . . . . . . . . 12

2.3 DRF allocation for the example shown in Fig. 2.2, where user 1 is allocated 5 tasks in

server 1 and 1 in server 2, while user 2 is allocated 1 task in server 1 and 5 in server 2. . 15

2.4 An alternative allocation with higher system utilization for the example of Fig. 2.2. Server

1 and 2 are exclusively assigned to user 1 and 2, respectively. Both users schedule 10 tasks. 19

2.5 CPU, memory, and global dominant share of four applications running in a 50-node Ama-

zon EC2 cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Time series of CPU and memory utilization. . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.7 DRFH improvements on job completion times over Slots scheduler. . . . . . . . . . . . . 35

2.8 Task completion ratio of users using Best-Fit DRFH and Slots schedulers, respectively.

Each bubble’s size is logarithmic to the number of tasks the user submitted. . . . . . . . . 36

2.9 Comparison of task completion ratios under DRFH and that obtained in dedicated clouds

(DCs). Each circle’s radius is logarithmic to the number of tasks submitted. . . . . . . . . 36

3.1 Illustration of a scheduling discipline that achieves DRF. . . . . . . . . . . . . . . . . . . 44

3.2 Illustration of a direct DRR extension. Each packet of flow 1 has processing times 〈7, 6.9〉,

while each packet of flow 2 has processing times 〈1, 7〉. . . . . . . . . . . . . . . . . . . . 47

3.3 Naive fix of the DRR extension shown in Fig. 3.2a by withholding the scheduling oppor-

tunity of every packet until its previous packet is completely processed on all resources. . 48

3.4 Illustration of a schedule by MR3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

xi

3.5 Illustration of the round-robin service and the sequence number. . . . . . . . . . . . . . . 51

3.6 Dominant services and packet throughput received by different flows under FCFS and

MR3. Flows 1, 11 and 21 are ill-behaving. . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.7 Latency comparison between DRFQ and MR3. . . . . . . . . . . . . . . . . . . . . . . . . 60

3.8 MR3 can quickly adapt to traffic dynamics and achieve DRF across all 3 flows. . . . . . . 62

3.9 Fairness and delay sensitivity of MR3 in response to mixed packet sizes and arrival dis-

tributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.10 MR3 schedule fails to offer weight-proportional delay when flows are assigned uneven

weights. P ik denotes the kth packet of flow i. . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.11 An improved schedule over MR3 in the example of Fig. 3.10, where the scheduling delay

is significantly reduced. P ik denotes the kth packet of flow i. . . . . . . . . . . . . . . . . 66

3.12 An illustration of the scheduling rounds of flow groups, where Rkl denotes the scheduling

round l of flow group Gk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.13 An illustration of the inter-group scheduler selecting flow groups in the example of Fig. 3.10.

Within a group, the flow is determined in a round-robin manner by the intra-group sched-

uler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.14 The schedule determined by GMR3 in the example of Fig. 3.10, where f li denotes the

packet processing for flow i ∈ Gk in scheduling round l of its flow group Gk. The slot axis

is only for the accounting mechanism, while the time axis shows real time elapse. . . . . 69

3.15 Illustration of the maximum number of scheduling rounds of flow i on the last resource in

(t1, t2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.16 The illustration of a scenario where the scheduling delay Di(p) reaches the maximum.

Here, f li denotes the processing of flow i in scheduling round l of its flow group. . . . . . 78

3.17 Simulation results of the fairness and delay performance of GMR3, as compared to DRFQ

and MR3. Figure (a) dedicates to the fairness evaluation, while (b), (c), and (d) compare

the scheduling delay of the three schedulers. . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.18 Illustration of the startup latency. In the figure, flow i in round k is denoted as ik. Flow

n+ 1 becomes active when flow n has been served on resource 1 in round k − 1, and will

be served right after flow n in round k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.19 Illustration of the packet latency, where flow i in round k is denoted as ik. The figure

shows the scenario under which the latency reaches its maximal value: a packet p is pushed

to the top of the input queue in one round but is scheduled in the next round because of

the account deficit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

xii

4.1 An example showing the tradeoff between fairness and efficiency for multi-resource packet

scheduling. Packets that finishes CPU processing are placed into a buffer in front of the

output link. Flow 1 sends packets p1, p2, ..., each having a processing time vector 〈2, 3〉;

Flow 2 sends packets q1, q2, ..., each having a processing time vector 〈9, 1〉. Schedule (a)

achieves DRF but is inefficient; Schedule (b) is efficient but unfair. . . . . . . . . . . . . 94

4.2 The DRGPS fluid that implements the perfect fairness in the example of Fig. 4.1. Flow 1

sends packets p1, p2, ..., and receives 〈3/5 CPU, 1/15 bandwidth〉; Flow 2 sends packets

q1, q2, ..., and receives 〈3/5 CPU, 1/15 bandwidth〉. Only 2/3 of the link bandwidth is

utilized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.3 Overall resource utilization observed in Click. No packet drops. . . . . . . . . . . . . . . 115

4.4 Dominant share each flow receives per second in Click. No packet drops. The strict fair

share is 2%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.5 Average resource utilization and dominant share each flow receives in Click at different

fairness levels. The queue capacity is 200 packets. The measurement of resource utilization

and dominant share is conducted every second over the entire schedule. . . . . . . . . . . 116

4.6 Mean per-packet latency of elephant (sending 20,000 pkts/s) and mice flows (sending 2

pkts/s) in Click. The error bar shows the standard deviation. . . . . . . . . . . . . . . . 117

4.7 Resource utilization achieved at different fairness levels in the simulation, averaged over

10 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.8 The improvement of per-packet latency and packet drop rate due to the fairness tradeoff. 118

4.9 Per-packet latency against flow sizes in the first 20s of simulation, feeding CPU-bound

traffic, with and without the fairness tradeoff. . . . . . . . . . . . . . . . . . . . . . . . . 119

xiii

List of Tables

1.1 Configurations of servers in one of Google’s datacenters [2, 3]. CPU and memory units

are normalized to the maximum server (highlighted below). . . . . . . . . . . . . . . . . . 2

2.1 Resource configurations of the 50 nodes in the launched Amazon EC2 cluster. . . . . . . . 31

2.2 Details of tasks submitted by four artificial applications. . . . . . . . . . . . . . . . . . . . 31

2.3 Resource utilization of the Slots scheduler with different slot sizes. . . . . . . . . . . . . . 33

3.1 Performance comparison between MR3 and DRFQ, where L is the maximum packet pro-

cessing time, m is the number of resources, and n is the number of active flows. . . . . . . 58

3.2 Linear model for CPU processing time in 3 middlebox modules. Model parameters are

based on the measurement results reported in [4]. . . . . . . . . . . . . . . . . . . . . . . . 59

3.3 Summary of performance of GMR3 and existing schemes, where n is the number of flows,

and m is the number of resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.1 Main notations used in the fluid model. The superscript t is dropped when time can be

clearly inferred from the context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.2 Schedule makespan observed in Click at different fairness levels. The queue capacity is

infinite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

xiv

Chapter 1

Introduction

Cloud computing realizes the long-held ambition of “big data” processing. By pooling a large number

of commodity servers into a cluster – known as datacenter – cloud computing allows “big data” appli-

cations, such as web search, machine learning, and social networks, to scale up to hundreds of servers,

accommodating their ever-growing demand for computing cycles. Building a computer system at such

a large scale poses some major technical challenges on resource management. Even a medium-sized

production datacenter consists of more than 10K servers, and can host over 2,000 applications from hun-

dreds of users [2, 3]. A fundamental system design problem facing a datacenter operator is: how should

the cluster resources, such as CPU, memory, and link bandwidth, be shared among applications fairly

and efficiently? Motivated by this question, my doctoral research has been on modeling, algorithms,

analyses, and prototype implementations that incorporate the following two themes:

(1) designing fundamental resource sharing policies for large-scale computer clusters and data analytic

systems, and

(2) analyzing fundamental scheduling problems for devices and systems, such as middleboxes, software

routers, and hypervisors, that are widely deployed in cloud datacenters.

1.1 Resource Sharing Policies in Cloud Datacenters

Unlike traditional application-specific clusters and grids, cloud computing systems distinguish themselves

with unprecedented server and workload heterogeneity. Modern datacenters are likely to be constructed

from a variety of server classes, with different configurations in terms of processing capabilities, memory

sizes, and storage spaces [5]. Asynchronous hardware upgrades, such as adding new servers and phasing

1

Chapter 1. Introduction 2

Table 1.1: Configurations of servers in one of Google’s datacenters [2, 3]. CPU and memory units arenormalized to the maximum server (highlighted below).

Number of servers CPUs Memory6732 0.50 0.503863 0.50 0.251001 0.50 0.75795 1.00 1.00126 0.25 0.2552 0.50 0.125 0.50 0.035 0.50 0.973 1.00 0.501 0.50 0.06

out existing ones, further aggravate such diversity, leading to a wide range of server specifications in

a cloud computing system [2, 6–9]. Table 1.1 illustrates the heterogeneity of servers in one of Google’s

clusters [2,3]. Similar server heterogeneity has also been observed in public clouds, such as Amazon EC2

and Rackspace [6, 7].

In addition to server heterogeneity, cloud computing systems also represent much higher diversity in

resource demand profiles. Depending on the underlying applications, the workload spanning multiple

cloud users may require vastly different amounts of resources (e.g., CPU, memory, and storage). For

example, numerical computing tasks are usually CPU intensive, while database operations typically

require high-memory support. The heterogeneity of both servers and workload demands poses significant

technical challenges on the resource allocation mechanism, giving rise to many delicate issues – notably

fairness and efficiency – that must be carefully addressed.

Despite the unprecedented heterogeneity in cloud computing systems, state-of-the-art computing

frameworks employ rather simple abstraction that falls short. For example, Hadoop [10] and Dryad [11],

the two most widely deployed cloud computing frameworks, partition a server’s resources into bundles –

known as slots – that contain fixed amounts of different resources. The system then allocates resources

to users at the granularity of these slots. Such a single resource abstraction ignores the heterogeneity of

both server specifications and demand profiles, inevitably leading to a fairly inefficient allocation [12].

Towards addressing the inefficiency of the current allocation module, many recent works focus on

multi-resource allocation mechanisms. Notably, Ghodsi et al. [12] suggest a compelling alternative known

as the Dominant Resource Fairness (DRF) allocation, in which each user’s dominant share – the maxi-

mum ratio of any resource that the user has been allocated – is equalized. The DRF allocation possesses

a set of highly desirable fairness properties, and has quickly received significant attention in the lit-

erature [13–16]. While DRF and its subsequent works address the demand heterogeneity of multiple


resources, they all limit the discussion to a simplified model where resources are pooled in one place and

the entire resource pool is abstracted as one big server.1 Such an all-in-one resource model not only

contrasts the prevalent datacenter infrastructure – where resources are distributed to a large number of

servers – but also ignores the server heterogeneity: the allocations depend only on the total amount of

resources pooled in the system, irrespective of the underlying resource distribution of servers. In fact,

when servers are heterogeneous, even the definition of dominant resource is not so clear. Depending on

the underlying server configurations, a computing task may bottleneck on different resources in different

servers. We shall see in Sec. 2.3.4 that naive extensions, such as applying the DRF allocation to each

server separately, may lead to a highly inefficient allocation.

1.2 Multi-Resource Scheduling

The need for multi-resource management goes beyond cloud clusters and extends to appliances and

systems that are widely deployed in datacenters, such as middleboxes, software routers, and hypervisors.

Recent studies report that the number of middleboxes deployed in enterprise and datacenter networks

is on par with the traditional switches and routers [17, 18]. These middleboxes do more than just

packet forwarding. In addition, they perform filtering (e.g., firewalls), optimization (e.g., HTTP caching

and WAN optimization), and transformation (e.g., dynamic request routing) based on the underlying

traffic contents [18–20], which requires the support of multiple hardware resources such as CPU, memory

bandwidth, and link bandwidth [4,17]. A scheduling algorithm specifically designed for multiple resource

types is therefore needed for sharing these resources fairly and efficiently. Similar problems also arise

in virtual machine (VM) scheduling inside a hypervisor, where different VMs may consume different

amounts of resources, and it is desirable to fairly multiplex their access to physical resources.

While single-resource fair queueing for bandwidth sharing has been extensively studied for switches

and routers [21–26], multi-resource fair queueing imposes new scheduling challenges as flows are com-

peting for multiple resources and may have vastly different resource requirements. For example, flows

that require forwarding a large amount of small packets congest the memory bandwidth of a software

router [27], while those that require IP security encryption (IPsec) needs more CPU processing time [28].

Despite their heterogeneous resource requirements, flows are expected to receive predictable service isola-

tion to meet their Quality of Service (QoS) requirements. This requires a multi-resource packet scheduler

with the following desirable properties.

Fairness. The packet scheduler should provide some measure of service isolation across flows, so

1While [12] briefly touches on the case where resources are distributed to small servers (known as the discrete scenario),its coverage is rather informal.


that the damaging behaviour of rogue traffic will not affect the QoS of other regular flows. In particular,

each flow should receive service (i.e., throughput) at least at the level as when every resource is allocated

in proportion to the flow’s weight, irrespective of the behaviours of other traffic.

Bounded scheduling delay. Interactive Internet applications such as video streaming and online

games have stringent end-to-end delay requirements. It is hence important for a packet scheduler to offer

bounded scheduling delay. Such a delay bound should be a small constant, independent of the number

of flows.

Low complexity. As the volume of traffic through middleboxes surges [29, 30], it is important to

make scheduling decisions at high speed. Ideally, a packet scheduler should have time complexity that

is a small constant, independent of the number of flows. In addition, the scheduling algorithm should

be amenable to practical implementation.

While all three properties have been extensively studied for bandwidth sharing in traditional switches

and routers [21–23, 25, 31], multi-resource fair queueing remains a largely uncharted territory. The

recent work of Ghodsi et al. [4] suggests a promising alternative, known as Dominant Resource Fair

Queueing (DRFQ), that implements Dominant Resource Fairness (DRF) [12, 16] in the time domain.

While DRFQ provides nearly perfect service isolation, it is expensive to implement. Specifically, DRFQ

needs to sort packet timestamps [4] and requires O(log n) time complexity per packet, where n is the

number of backlogged flows. With a large number of flows, it is hard to implement DRFQ at high

speeds. This problem is further aggravated by the recent middlebox innovations, where software-defined

middleboxes deployed as VMs and processes are now replacing traditional network appliances with

dedicated hardware [20,32]. As more software-defined middleboxes are consolidated onto commodity and

cloud servers [17,18], a device will see an increasing amount of flows competing for multiple resources.

1.3 Summary of Contribution

This dissertation focuses on algorithm design and prototype implementation of the two multi-resource

sharing problems mentioned in the previous two sections. In particular, this dissertation makes the

following contributions.


1.3.1 Multi-Resource Fair Sharing for Cloud Systems with Heterogeneous

Servers

To address the problems of existing resource schedulers in computer clusters, the first contribution of

this dissertation is a rigorous study on a solution with provable operational benefits that bridge the gap

between the existing multi-resource allocation models and the state-of-the-art datacenter infrastructure.

We propose DRFH [33, 34], a DRF generalization in Heterogeneous environments where resources are

pooled by a large number of heterogeneous servers, representing different points in the configuration

space of resources such as processing, memory, and storage. DRFH generalizes the intuition of DRF by

seeking an allocation that equalizes every user’s global dominant share, which is the maximum ratio of

any resources the user has been allocated in the entire resource pool. We systematically analyze DRFH

and show that it retains most of the desirable properties that the all-in-one DRF model provides for a

single server [12]. Specifically, DRFH is Pareto optimal, where no user is able to increase its allocation

without decreasing other users’ allocations. Meanwhile, DRFH is envy-free in that no user prefers the

allocation of another one. More importantly, DRFH is group strategyproof in that whenever a coalition

of users collude with each other to misreport their resource demands, there is a member of the coalition

that cannot strictly gain. As a result, the coalition is better off not formed. In addition, DRFH offers

some level of service isolation by ensuring the sharing incentive property in a weak sense – it allows users

to execute more tasks than those under some “equal partition” where the entire resource pool is evenly

allocated among all users. DRFH also satisfies a set of other important properties, namely single-server

DRF, single-resource fairness, bottleneck fairness, and population monotonicity (details in Sec. 2.3.3).

As a direct application, we design a heuristic scheduling algorithm that implements DRFH in real-

world systems. We evaluate DRFH via both prototype implementation and trace-driven simulations.

Both implementation and simulation results show that compared with the traditional slot schedulers

adopted in prevalent cloud computing frameworks, the DRFH algorithm suitably matches demand het-

erogeneity to server heterogeneity, significantly improving the system’s resource utilization, yet with a

substantial reduction of job completion times.

1.3.2 Multi-Resource Fair Queueing for Network Flows

To address the complexity problem of the existing multi-resource fair queueing algorithms, this disser-

tation proposes two multi-resource fair schedulers that are complementary to each other.

Multi-Resource Round Robin. The second contribution of this dissertation is the design of a low

complexity multi-resource fair scheduler, called Multi-Resource Round Robin (MR3), that serves flows in


rounds and achieves near-perfect fairness in O(1) time [35]. MR3 can also provide bounded scheduling

delay for unweighted flows. The design of MR3 overcomes a series of newly emerged technical challenges

due to the presence of multiple resource types, and is very easy to implement in middleboxes or software

routers. We shall describe its detailed design in Sec. 3.4.

Group Multi-Resource Round Robin. Despite many desirable properties, MR3 fails to provide

weight-proportional delays when flows are assigned uneven weights. The third contribution of this

dissertation is an improved scheduling algorithm presented in Sec. 3.5, referred to as Group Multi-

Resource Round Robin (GMR3), that achieves all three desirable properties [36]. GMR3 groups flows

with similar weights into a small number of groups, each associating with a timestamp. The scheduling

decisions are made in a two-level hierarchy. At the higher level, GMR3 makes inter-group scheduling

decisions by choosing the group with the earliest timestamp, while at the lower level, the intra-group

scheduler serves flows within a group in a round-robin fashion. GMR3 is highly efficient, as it requires

only O(1) time per packet in almost all practical scenarios. In addition, we show that GMR3 achieves

near-perfect fairness across flows, with its scheduling delay bounded by a small constant. These desirable

properties are proven analytically and validated by experiments as well. To our knowledge, GMR3 is the

first fair queueing algorithm that offers near-perfect fairness with O(1) time complexity and a constant

scheduling delay bound.

1.3.3 Fairness-Efficiency Tradeoff for Multi-Resource Scheduling

The last contribution of this dissertation is to identify a new challenge that is unique to multi-resource

scheduling – the general tradeoff between fairness and efficiency. Such a tradeoff has never been a problem

for traditional fair sharing of link bandwidth. As long as the queueing algorithm is work conserving,

the bandwidth is always fully utilized given a non-empty system. That leaves fairness the only concern.

However, this is no longer the case for multi-resource scheduling: different queueing algorithms, even

work conserving, may result in vastly different resource utilization. In general, fairness and efficiency

are conflicting objectives that cannot be achieved at the same time. For applications with loose fairness

requirements, trading off some degree of fairness for higher efficiency and higher throughput is well

justified.

Driven by this practical need, this dissertation proposes a new queueing algorithm that allows network

operator to flexibly specify its tradeoff preference and implements the specified tradeoff by determining

the right packet scheduling order [37]. Both trace-driven simulation and prototype implementation show

that in many cases, trading off only a small degree of fairness is sufficient to improve the overall traffic


throughput to the point where the system capacity is almost saturated. We shall present the detailed

discussions in Chapter 4.

1.4 Thesis Organization

This dissertation is organized as follows. Chapter 2 presents our study of multi-resource fair sharing for

datacenter jobs. Chapter 3 presents the study of multi-resource fair scheduling for packet processing

in middleboxes and software routers that are widely deployed in cloud datacenter. Chapter 4 discusses

the tradeoff between fairness and efficiency for multi-resource scheduling. Chapter 5 provides possible

future work and concluding remarks.

Chapter 2

Multi-Resource Fair Scheduler for

Datacenter Jobs

2.1 Motivation

As we have explained in Sec. 1.1, one fundamental problem facing computer clusters is to fairly and

efficiently share resources among different applications. This problem has become even more complicated

in cloud computing systems, mainly due to the unprecedented heterogeneity in terms of both server

specifications and computing requirements. Cloud computing clusters are likely to be constructed from

a large number of commodity servers spanning multiple generations, with different configurations in

terms of processing capabilities, memory sizes, outgoing bandwidth, and storage spaces. On the other

hand, depending on the underlying applications, computing tasks may require vastly different amounts

of resources. For example, scientific computations are typically CPU-intensive; database applications

usually require high memory support; graph-parallel computations, such as PageRank, are bandwidth-

bound.

Despite these complexities, state-of-the-art cluster computing frameworks employ rather simple re-

source sharing policies that fall short. For example, many widely deployed data analytic frameworks,

such as Hadoop [10], Dryad [11], and Spark [38], schedule tasks at the granularity of slots, defined as

a bundle of resources that contain fixed amounts of memory and CPU cores. This inevitably results in

resource fragmentation, the magnitude of which increases with the number of tasks scheduled. Twitter’s

Mesos [39] adopts a recently proposed DRF scheduler [12] to overcome this drawback. However, DRF

is unable to handle server heterogeneity and may lead to low resource utilization [40].

8

Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 9

In this chapter, we design a multi-resource allocation mechanism, called DRFH, that generalizes the

notion of Dominant Resource Fairness (DRF) from a single server to multiple heterogeneous servers.

DRFH provides a number of highly desirable properties. With DRFH, no user prefers the allocation

of another user; no one can improve its allocation without decreasing that of the others; and more

importantly, no coalition behavior of misreporting resource demands can benefit all its members. DRFH

also ensures some level of service isolation among the users. As a direct application, we design a simple

heuristic that implements DRFH in real-world systems. Both prototype implementation and large-scale

trace-driven simulations show that DRFH significantly outperforms the traditional slot-based scheduler,

leading to much higher resource utilization with substantially shorter job completion times.

The remainder of this chapter is organized as follows. We briefly revisit the DRF allocation and

point out its limitations in heterogeneous environments in Sec. 2.2. We then formulate the allocation

problem with heterogeneous servers in Sec. 2.3, where a set of desirable allocation properties are also

defined. In Sec. 2.4, we propose DRFH and analyze its properties. Sec. 2.5 dedicates to some practical

issues on implementing DRFH. We evaluate the performance of DRFH via trace-driven simulations in

Sec. 2.6. We survey the related work in Sec. 2.7 and summarize the chapter in Sec. 2.8.

2.2 DRF Scheduler and Its Limitations

In this section, we briefly review the DRF allocation [12] and show that it may lead to an infeasible

allocation when a cloud system is composed of multiple heterogeneous servers.

In DRF, the dominant resource is defined for each user as the resource that requires the largest fraction

of the total availability. Consider an example given in [12], suppose that a computing system has 9 CPUs

and 18 GB memory, and is shared by two users. User 1 wishes to launch a set of (divisible) tasks each

requiring 〈1 CPU, 4 GB〉, and user 2 has a set of (divisible) tasks each requiring 〈3 CPU, 1 GB〉. In this

example, the dominant resource of user 1 is the memory as each of its task demands 1/9 of the total

CPU and 2/9 of the total memory. On the other hand, the dominant resource of user 2 is CPU, as each

of its task requires 1/3 of the total CPU and 1/18 of the total memory.

We now define the dominant share for each user as the fraction of the dominant resource the user

has been allocated. DRF is then defined as the max-min fairness1 in terms of the dominant share. In

other words, the mechanism of DRF seeks a maximum allocation that equalizes each user’s dominant

share. Back to the aforementioned example, the DRF mechanism allocates 〈3 CPU, 12 GB〉 to user 1

and 〈6 CPU, 2 GB〉 to user 2, where user 1 launches three tasks and user 2 two. It is easy to verify that

1An allocation is said to be max-min fair if any attempt to increase the allocation of a user would decrease the allocationof another user with a smaller or equal share of resources.


Memory

CPUs

Server 1 Server 2

(1 CPU, 14 GB) (8 CPUs, 4 GB)

Figure 2.1: An example of a system consisting of two heterogeneous servers, in which user 1 can scheduleat most two tasks each demanding 1 CPU and 4 GB memory. The resources required to execute thetwo tasks are also highlighted in the figure.

both users receive the same dominant share (i.e., 2/3) and no one can launch more tasks by allocating

additional resources, although there is 2 GB memory left unused.

The DRF allocation above is based on a simplified all-in-one resource model, where the entire system

is modeled as one big server. The allocation hence depends only on the total amount of resources pooled

in the system. In the example above, no matter how many servers the system has, and what each server

specification is, as long as the system has 9 CPUs and 18 GB memory in total, the DRF allocation will

always schedule three tasks for user 1 and two for user 2. However, this allocation may not be possible

to implement, especially when the system consists of heterogeneous servers. For example, suppose that

the resource pool is provided by two servers. Server 1 has 1 CPU and 14 GB memory, and server 2 has

8 CPUs and 4 GB memory. As shown in Fig. 2.1, even allocating both servers exclusively to user 1,

at most two tasks can be scheduled, one in each server. Moreover, even for some server specifications

where the DRF allocation is feasible, the mechanism only gives the total amount of resources each user

should receive. It remains unclear how many resources a user should be allocated in each server. These

problems significantly limit the application of the DRF mechanism. In general, the allocation is valid

only when the system contains a single server or multiple homogeneous servers, which is rarely a case

under the prevalent datacenter infrastructure.

Despite the limitation of the all-in-one resource model, DRF is shown to possess a set of highly

desirable allocation properties for cloud computing systems [12, 16]. A natural question is: how should

the DRF intuition be generalized to a heterogeneous environment to achieve similar properties? Note

that this is not an easy question to answer. In fact, with heterogeneous servers, even the definition of

dominant resource is not so clear. Depending on the server specifications, a resource most demanded in

one server (in terms of the fraction of the server’s availability) might be the least-demanded in another.

For instance, in the example of Fig. 2.1, user 1 demands CPU the most in server 1. But in server 2,

it demands memory the most. Should the dominant resource be defined separately for each server, or


should it be defined for the entire resource pool? How should the allocation be conducted? And what

properties do the resulting allocation preserve? We shall answer these questions in the following sections.

2.3 System Model and Allocation Properties

In this section, we model multi-resource allocation in a cloud computing system with heterogeneous

servers. We formalize a number of desirable properties that are deemed the most important for allocation

mechanisms in cloud computing environments.

2.3.1 Basic Settings

Let S = {1, . . . , k} be the set of heterogeneous servers a cloud computing system has in its resource

pool. Let R = {1, . . . ,m} be the set of m hardware resources provided by each server, e.g., CPU,

memory, storage, etc. Let cl = (cl1, . . . , clm)T be the resource capacity vector of server l ∈ S, where each

component clr denotes the total amount of resource r available in this server. Without loss of generality,

we normalize the total availability of every resource to 1, i.e.,

∑l∈S

clr = 1, ∀r ∈ R .

Let U = {1, . . . , n} be the set of cloud users sharing the entire system. For every user i, let Di =

(Di1, . . . , Dim)T be its resource demand vector, where Dir is the amount of resource r required by each

instance of the task of user i. For simplicity, we assume positive demands, i.e., Dir > 0 for all user i

and resource r. We say resource r∗i the global dominant resource of user i if

r∗i ∈ arg maxr∈R

Dir .

In other words, resource r∗i is the most heavily demanded resource required by each instance of the task

of user i, over the entire resource pool. For each user i and resource r, we define

dir = Dir/Dir∗i

as the normalized demand and denote by di = (di1, . . . , dim)T the normalized demand vector of user i.

As a concrete example, consider Fig. 2.2 where the system consists of two heterogeneous servers.

Server 1 is high-memory with 2 CPUs and 12 GB memory, while server 2 is high-CPU with 12 CPUs

and 2 GB memory. Since the system has 14 CPUs and 14 GB memory in total, the normalized capacity


Memory

CPUs

Server 1 Server 2

(2 CPUs, 12 GB) (12 CPUs, 2 GB)

Figure 2.2: An example of a system containing two heterogeneous servers shared by two users. Eachcomputing task of user 1 requires 0.2 CPU time and 1 GB memory, while the computing task of user 2requires 1 CPU time and 0.2 GB memory.

vectors of server 1 and 2 are c1 = (CPU share,memory share)T = (1/7, 6/7)T and c2 = (6/7, 1/7)T ,

respectively. Now suppose that there are two users. User 1 has memory-intensive tasks each requiring

0.2 CPU time and 1 GB memory, while user 2 has CPU-heavy tasks each requiring 1 CPU time and

0.2 GB memory. The demand vector of user 1 is D1 = (1/70, 1/14)T and the normalized vector is

d1 = (1/5, 1)T , where memory is the global dominant resource. Similarly, user 2 has D2 = (1/14, 1/70)T

and d2 = (1, 1/5)T , and CPU is its global dominant resource.

For now, we assume users have an infinite number of tasks to be scheduled, and all tasks are divisible

[12,14–16,41]. We shall discuss how these assumptions can be relaxed in Sec. 2.5.

2.3.2 Resource Allocation

For every user i and server l, let Ail = (Ail1, . . . , Ailm)T be the resource allocation vector, where Ailr is

the amount of resource r allocated to user i in server l. Let Ai = (Ai1, . . . ,Aik) be the allocation matrix

of user i, and A = (A1, . . . ,An) the overall allocation for all users. We say an allocation A feasible if

no server is required to use more than any of its total resources, i.e.,

∑i∈U

Ailr ≤ clr, ∀l ∈ S, r ∈ R .

For each user i, given allocation Ail in server l, let Nil(Ail) be the maximum number of tasks (possibly

fractional) it can schedule. We have

Nil(Ail)Dir ≤ Ailr, ∀r ∈ R .

As a result,

Nil(Ail) = minr∈R{Ailr/Dir} .


The total number of tasks user i can schedule under allocation Ai is hence

Ni(Ai) =∑l∈S

Nil(Ail) . (2.1)

Intuitively, a user prefers an allocation that allows it to schedule more tasks.

A well-justified allocation should never give a user more resources than it can actually use in a server.

Following the terminology used in the economics literature [42], we call such an allocation non-wasteful:

Definition 1. For user i and server l, an allocation Ail is non-wasteful if reducing any resource decreases

the number of tasks scheduled, i.e., for all A′il ≺ Ail2, we have

Nil(A′il) < Nil(Ail) .

Further, user i’s allocation Ai = (Ail) is non-wasteful if Ail is non-wasteful for all server l, and

allocation A = (Ai) is non-wasteful if Ai is non-wasteful for all user i.

Note that one can always convert an allocation to non-wasteful by revoking those resources that are

allocated but have never been actually used, without changing the number of tasks scheduled for any

user. Unless otherwise specified, we limit the discussion to non-wasteful allocations.

2.3.3 Allocation Mechanism and Desirable Properties

A resource allocation mechanism takes user demands as input and outputs the allocation result. In

general, an allocation mechanism should provide the following essential properties that are widely recog-

nized as the most important fairness and efficiency measures in both cloud computing systems [12,13,43]

and the economics literature [42,44].

Envy-freeness. An allocation mechanism is envy-free if no user prefers the other’s allocation to its

own, i.e.,

Ni(Ai) ≥ Ni(Aj), ∀i, j ∈ U .

This property essentially embodies the notion of fairness.

Pareto optimality. An allocation mechanism is Pareto optimal if it returns an allocation A such

that for all feasible allocations A′, if Ni(A′i) > Ni(Ai) for some user i, then there exists a user j 6= i such

that Nj(A′j) < Nj(Aj). In other words, allocation A cannot be further improved such that all users

are at least as well off and at least one user is strictly better off. This property ensures the allocation

2For any two vectors x and y, we say x ≺ y if xi ≤ yi, ∀i and for some j we have strict inequality: xj < yj .


efficiency and is critical to achieve high resource utilization.

Group strategyproofness. An allocation mechanism is group strategyproof if whenever a coalition

of users misreport their resource demands (assuming a user’s demand is its private information), there

is a member of the coalition who would schedule less tasks and hence has no incentive to join the

coalition. Specifically, let M ⊂ U be the coalition of manipulators in which user i ∈ M misreports its

demand as D′i 6= Di. Let A′ be the allocation returned. Also, let A be the allocation returned when all

users truthfully report their demands. The allocation mechanism is group strategyproof if there exists

a manipulator i ∈M who cannot schedule more tasks than being truthful, i.e.,

Ni(A′i) ≤ Ni(Ai) .

In other words, user i is better off quitting the coalition. Group strategyproofness is of a special

importance for a cloud computing system, as it is common to observe in a real-world system that users

try to manipulate the scheduler for more allocations by lying about their resource demands [12,43].

Sharing incentive is another critical property that has been frequently mentioned in the litera-

ture [12–14, 16]. It ensures that every user’s allocation is not worse off than that obtained by evenly

dividing the entire resource pool. While this property is well defined for a single server, it is not for

a system containing multiple heterogeneous servers, as there is an infinite number of ways to evenly

divide the resource pool among users, and it is unclear which one should be selected as a benchmark to

compare with. We shall give a specific discussion to Sec. 2.4.5, where we justify between two reasonable

alternatives.

In addition to the four essential allocation properties above, we also consider four other important

properties as follows:

Single-server DRF. If the system contains only one server, then the resulting allocation should be

reduced to the DRF allocation.

Single-resource fairness. If there is a single resource in the system, then the resulting allocation

should be reduced to a max-min fair allocation.

Bottleneck fairness. If all users bottleneck on the same resource (i.e., having the same global

dominant resource), then the resulting allocation should be reduced to a max-min fair allocation for

that resource.

Population monotonicity. If a user leaves the system and relinquishes all its allocations, then the

remaining users will not see any reduction in the number of tasks scheduled.

To summarize, our objective is to design an allocation mechanism that guarantees all the properties


50%

CPU Memory

100%

0%

50%

CPU Memory

100%

0%

User1 User2

Server 1 Server 2

42%

8%

Figure 2.3: DRF allocation for the example shown in Fig. 2.2, where user 1 is allocated 5 tasks in server1 and 1 in server 2, while user 2 is allocated 1 task in server 1 and 5 in server 2.

defined above.

2.3.4 Naive DRF Extension and Its Inefficiency

It has been shown in [12, 16] that the DRF allocation satisfies all the desirable properties mentioned

above when the entire resource pool is modeled as one server. When resources are distributed to multiple

heterogeneous servers, a naive generalization is to separately apply the DRF allocation per server. For

instance, consider the example of Fig. 2.2. We first apply DRF in server 1. Because CPU is the

dominant resource of both users, it is equally divided for both of them, each receiving 1. As a result,

user 1 schedules 5 tasks onto server 1, while user 2 schedules one. Similarly, in server 2, memory is the

dominant resource of both users and is evenly allocated, leading to one task scheduled for user 1 and

five for user 2. The resulting allocations in the two servers are illustrated in Fig. 2.3, where both users

schedule 6 tasks.

Unfortunately, this allocation violates the Pareto optimality and is highly inefficient. If we instead

allocate server 1 exclusively to user 1, and server 2 exclusively to user 2, then both users schedule 10

tasks, almost twice the number of tasks scheduled under the DRF allocation. In fact, a similar example

can be constructed to show that the per-server DRF may lead to arbitrarily low resource utilization.

The failure of the naive DRF extension to the heterogeneous environment necessitates an alternative

allocation mechanism, which is the main theme of the next section.

2.4 DRFH Allocation and Its Properties

In this section, we present DRFH, a generalization of DRF in a heterogeneous cloud computing system

where resources are distributed in a number of heterogeneous servers. We analyze DRFH and show that


it provides all the desirable properties defined in Sec. 2.3.

2.4.1 DRFH Allocation

Instead of allocating separately in each server, DRFH jointly considers resource allocation across all

heterogeneous servers. The key intuition is to achieve the max-min fair allocation for the global dominant

resources. Specifically, given allocation Ail, let

Gil(Ail) = Nil(Ail)Dir∗i= min

r∈R{Ailr/dir} . (2.2)

be the amount of global dominant resource user i is allocated in server l. Since the total availability

of resources is normalized to 1, we also refer to Gil(Ail) the global dominant share user i receives in

server l. Simply adding up Gil(Ail) over all servers gives the global dominant share user i receives under

allocation Ai, i.e.,

Gi(Ai) =∑l∈S

Gil(Ail) =∑l∈S

minr∈R{Ailr/dir} . (2.3)

DRFH simply applies the max-min fair allocation across user’s global dominant share: it seeks an

allocation that maximizes the minimum global dominant share among all users, subject to the resource

constraints per server, i.e.,

maxA

mini∈U

Gi(Ai)

s.t.∑i∈U

Ailr ≤ clr,∀l ∈ S, r ∈ R .(2.4)

Recall that without loss of generality, we assume non-wasteful allocation A (see Sec. 2.3.2). We have

the following structural result.

Lemma 1. For user i and server l, an allocation Ail is non-wasteful if and only if there exists some gil

such that

Ail = gildi .

In particular, gil is the global dominant share user i receives in server l under allocation Ail, i.e.,

gil = Gil(Ail) .


Proof: (⇐) We start with the necessity proof. Since Ail = gildi, for all resource r ∈ R, we have

Ailr/Dir = gildir/Dir = gilDir∗i.

As a result,

Nil(Ail) = minr∈R{Ailr/Dir} = gilDir∗i

.

Now for any A′il ≺ Ail, suppose that A′ilr0 < Ailr0 for some resource r0. We have

Nil(A′il) = min

r∈R{A′ilr/Dir}

≤ A′ilr0/Dir0

< Ailr0/Dir0 = Nil(Ail) .

Hence by definition, allocation Ail is non-wasteful.

(⇒) We next present the sufficiency proof. Since Ail is non-wasteful, for any two resources r1, r2 ∈ R,

we must have

Ailr1/Dir1 = Ailr2/Dir2 .

Otherwise, without loss of generality, suppose that Ailr1/Dir1 > Ailr2/Dir2 . There must exist some

ε > 0, such that

(Ailr1 − ε)/Dir1 > Ailr2/Dir2 .

Now construct an allocation A′il, such that

A′ilr =

A′ilr1 − ε, r = r1;

Ailr, o.w.(2.5)


Clearly, A′il ≺ Ail. However, it is easy to see that

Nil(A′il) = min

r∈R{A′ilr/Dir}

= minr 6=r1{A′ilr/Dir}

= minr 6=r1{Ailr/Dir}

= minr∈R{Ailr/Dir}

= Nil(Ail) ,

which contradicts the fact that Ail is non-wasteful. As a result, there exits some nil, such that for all

resource r ∈ R, we have

Ailr = nilDir = nilDir∗idir .

Now letting gil = nilDir∗i, we see Ail = gildi. ut

Intuitively, Lemma 1 indicates that under a non-wasteful allocation, resources are allocated in pro-

portion to the user’s demand. Lemma 1 immediately suggests the following relationship for every user i

and its non-wasteful allocation Ai:

Gi(Ai) =∑l∈S

Gil(Ail) =∑l∈S

gil .

Problem (2.4) can hence be equivalently written as

max{gil}

mini∈U

∑l∈S

gil

s.t.∑i∈U

gildir ≤ clr, ∀l ∈ S, r ∈ R ,

(2.6)

where the constraints are derived from Lemma 1. Now let g = mini∑l∈S gil. Via straightforward

algebraic operations, we see that (2.6) is equivalent to the following problem:

max{gil},g

g

s.t.∑i∈U

gildir ≤ clr, ∀l ∈ S, r ∈ R ,

∑l∈U

gil = g, ∀i ∈ U .

(2.7)

Note that the second constraint embodies the fairness in terms of equalized global dominant share g. By


CPU Memory

5/7

0CPU Memory

0

User1 User2

Server 1 Server 2

1/7 1/7

6/7 6/7

Resource

Share

Figure 2.4: An alternative allocation with higher system utilization for the example of Fig. 2.2. Server1 and 2 are exclusively assigned to user 1 and 2, respectively. Both users schedule 10 tasks.

solving (2.7), DRFH allocates each user the maximum global dominant share g, under the constraints of

both server capacity and fairness. The allocation received by each user i in server l is simply Ail = gildi.

For example, Fig. 2.4 illustrates the resulting DRFH allocation in the example of Fig. 2.2. By solving

(2.7), DRFH allocates server 1 exclusively to user 1 and server 2 exclusively to user 2, allowing each user

to schedule 10 tasks with the maximum global dominant share g = 5/7.

We next analyze the properties of DRFH allocation obtained by solving (2.7). Our analyses of DRFH

start with the four essential resource allocation properties, namely, envy-freeness, Pareto optimality,

group strategyproofness, and sharing incentive.

2.4.2 Envy-Freeness

We first show by the following proposition that under the DRFH allocation, no user prefers other’s

allocation to its own.

Proposition 1 (Envy-freeness). The DRFH allocation obtained by solving (2.7) is envy-free.

Proof: Let {gil}, g be the solution to problem (2.7). For all user i, its DRFH allocation in server

l is Ail = gildi. To show Ni(Aj) ≤ Ni(Ai) for any two users i and j, it is equivalent to prove

Ni(Aj) ≤ Ni(Ai). We have

Gi(Aj) =∑lGil(Ajl)

=∑l minr{gjldjr/dir}

≤∑l gjl

= Gi(Ai) ,


where the inequality holds because

minr{djr/dir} ≤ djr∗i /dir∗i = djr∗i ≤ 1 ,

where r∗i is user i’s global dominant resource. ut

2.4.3 Pareto Optimality

We next show that DRFH leads to an efficient allocation under which no user can improve its allocation

without decreasing that of the others.

Proposition 2 (Pareto optimality). The DRFH allocation obtained by solving (2.7) is Pareto optimal.

Proof: Let {gil}, g be the solution to problem (2.7). For all user i, its DRFH allocation in server

l is Ail = gildi. Since (2.6) and (2.7) are equivalent, {gil} is also the solution to (2.6), and g is the

maximum value of the objective of (2.6).

Assume, by way of contradiction, that allocation A is not Pareto optimal, i.e., there exists some

allocation A′, such that Ni(A′i) ≥ Ni(Ai) for all user i, and for some user j we have strict inequality:

Nj(A′j) > Nj(Aj). Equivalently, this implies Gi(A

′i) ≥ Gi(Ai) for all user i, and Gj(A

′j) > Gj(Aj) for

user j. Without loss of generality, let A′ be non-wasteful. By Lemma 1, for all user i and server l, there

exists some g′il such that A′il = g′ildi. We show that based on {g′il}, one can construct some {gil} such

that {gil} is a feasible solution to (2.6), yet leads to a higher objective than g, contradicting the fact

that {gil}, g optimally solve (2.6).

To see this, consider user j. We have

Gj(Aj) =∑l gjl = g < Gj(A

′j) =

∑l g′jl.

For user j, there exists a server l0 and some ε > 0, such that after reducing g′jl0 to g′jl0 − ε, the resulting

global dominant share remains higher than g, i.e.,∑l g′jl−ε ≥ g. This leads to at least εdj idle resources

in server l0. We construct {gil} by redistributing these idle resources to all users to increase their global

dominant share, therefore strictly improving the objective of (2.6).

Denote by {g′′il} the dominant share after reducing g′jl0 to g′jl0 − ε, i.e.,

g′′il =

g′jl0 − ε, i = j, l = l0;

g′il, o.w.


The corresponding non-wasteful allocation is A′′il = g′′ildi for all user i and server l. Note that allocation

A′′ is preferred to the original allocation A by all users, i.e., for all user i, we have

Gi(A′′i ) =

∑l

g′′il =

∑l g′jl − ε ≥ g = Gj(Aj), i = j;∑

l g′il = Gi(A

′i) ≥ Gi(Ai), o.w.

We now construct {gil} by redistributing the εdj idle resources in server l0 to all users, each increasing

its global dominant share g′′il0 by δ = minr{εdjr/∑i dir}, i.e.,

gil =

g′′il0 + δ, l = l0;

g′′il, o.w.

It is easy to check that {gil} remains a feasible allocation. To see this, it suffices to check server l0. For

all its resource r, we have

∑i gil0dir =

∑i(g′′il0

+ δ)dir

=∑i g′il0dir − εdjr + δ

∑i dir

≤ cl0r − (εdjr − δ∑i dir)

≤ cl0r .

where the first inequality holds because A′ is a feasible allocation.

On the other hand, for all user i ∈ U , we have

∑l gil =

∑l g′′il + δ = Gi(A

′′i ) + δ ≥ Gi(Ai) + δ > g .

This contradicts the premise that g is optimal for (2.6). ut

2.4.4 Group Strategyproofness

For now, all our discussions are based on a critical assumption that all users truthfully report their

resource demands. However, in a real-world system, it is common to observe users to attempt to

manipulate the scheduler by misreporting their resource demands, so as to receive more allocation [12,43].

More often than not, these strategic behaviours would significantly hurt those honest users and reduce the

number of their tasks scheduled, inevitably leading to a fairly inefficient allocation outcome. Fortunately,

we show by the following proposition that DRFH is immune to these strategic behaviours, as reporting


the true demand is always the dominant strategy for all users, even if they form a coalition to misreport

together with others.

Proposition 3 (Group strategyproofness). The DRFH allocation obtained by solving (2.7) is group

strategyproof in that the coalition behavior of misreporting demands cannot strictly benefit every member.

Proof: Let M ⊂ U be the set of strategic users forming a coalition to misreport the normalized

demand vector d′M = (d′i)i∈M , where d′i 6= di. for all i ∈ M . Let d′ be the collection of normalized

demand vectors submitted by all users, where d′i = di, for all i ∈ U\M . Let A′ be the resulting allocation

obtained by solving (2.7). In particular, A′il = g′ild′i for each user i and server l, and g′ =

∑l g′il, where

{g′il}, g′ solve (2.7). On the other hand, let A be the allocation returned when all users truthfully report

their demands, and {gil}, g the solution to (2.7) with the truthful d. Similarly, for each user i and server

l, we have Ail = gildi, and g =∑l gil. We check the following two cases and show that there exists a

user i ∈M , such that Gi(A′i) ≤ Gi(Ai), which is equivalent to Ni(A

′i) ≤ Ni(Ai).

Case 1: g′ ≤ g. In this case, let ρi = minr{d′ir/dir} be defined for all user i ∈M . Clearly,

ρi = minr{d′ir/dir} ≤ d′ir∗i /dir∗i = d′ir∗i ≤ 1 ,

where r∗i is the dominant resource of user i. We then have

Gi(A′i) =

∑lGil(A

′il)

=∑lGil(g

′ild′i)

=∑l minr{g′ild′ir/dir}

= ρig′

≤ g

= Gi(Ai).

Case 2: g′ > g. We first consider users that are not manipulators. Since they truthfully report their

demands, we have

Gj(A′j) = g′ > g = Gj(Aj),∀j ∈ U \M. (2.8)

Now for those manipulators, there is a user i ∈M such that Gi(A′i) < Gi(Ai). Otherwise, allocation A′

is strictly preferred to allocation A by all users. This contradicts the facts that A is a Pareto optimal

allocation and A′ is a feasible allocation. ut


2.4.5 Sharing Incentive

In addition to the aforementioned three properties, sharing incentive is another critical allocation prop-

erty that has been frequently mentioned in the literature, e.g., [12–14,16,43]. The property ensures that

every user can execute at least the number of tasks it schedules when the entire resource pool is evenly

partitioned. The property provides service isolations among the users.

While the sharing incentive property is well defined in the all-in-one resource model, it is not for

the system with multiple heterogeneous servers. In the former case, since the entire resource pool is

abstracted as a single server, evenly dividing every resource of this big server would lead to a unique

allocation. However, when the system consists of multiple heterogeneous servers, there are many different

ways to evenly divide these servers, and it is unclear which one should be used as a benchmark for

comparison. For instance, in the example of Fig. 2.2, two users share a system with 14 CPUs and 14 GB

memory in total. The following two allocations both allocate each user 7 CPUs and 7 GB memory: (a)

User 1 is allocated 1/2 resources of server 1 and 1/2 resources of server 2, while user 2 is allocated the

rest; (b) user 1 is allocated (1.5 CPUs, 5.5 GB) in server 1 and (5.5 CPUs, 1.5 GB) in server 2, while

user 2 is allocated the rest. It is easy to verify that the two allocations lead to different number of tasks

scheduled for the same user, and can be used as two different allocation benchmarks. In fact, one can

construct many other allocations that evenly divide all resources among the users.

Despite the general ambiguity explained above, in the next two subsections, we consider two defini-

tions of the sharing incentive property, strong and weak, depending on the choice of the benchmark for

equal partitioning of resources.

Strong Sharing Incentive

Among various allocations that evenly divide all servers, perhaps the most straightforward approach is

to evenly partition each server’s availability cl among all n users. The strong sharing incentive property

is defined by using this per-server partitioning as a benchmark.

Definition 2 (Strong sharing incentive). Allocation A satisfies the strong sharing incentive property if

each user schedules fewer tasks by evenly partitioning each server, i.e.,

Ni(Ai) =∑l∈S

Ni(Ail) ≥∑l∈S

Ni(cl/n), ∀i ∈ U .

Before we proceed, it is worth mentioning that the per-server partitioning above cannot be directly

implemented in practice. With a large number of users, in each server, everyone will be allocated a very


small fraction of the server’s availability. In practice, such a small slice of resources usually cannot be

used to run any computing task. However, per-server partitioning may be interpreted as follows. Since

a cloud system is constructed by pooling hundreds of thousands of servers [2, 5], the number of users is

typically far smaller than the number of servers [12,43], i.e., k � n. An equal partition could randomly

allocate to each user k/n servers, which is equivalent to randomly allocating each server to each user

with probability 1/n. It is easy to see that the mean number of tasks scheduled for each user under this

random allocation is∑lNi(ci/n), the same as that obtained under the per-server partitioning.

Unfortunately, the following proposition shows that DRFH may violate the sharing incentive property

in the strong sense. The proof gives a counterexample.

Proposition 4. DRFH does not satisfy the property of strong sharing incentive.

Proof: Consider a system consisting of two servers. Server 1 has 1 CPU and 2 GB memory; server

2 has 4 CPUs and 3 GB memory. There are two users. Each instance of the task of user 1 demands

1 CPU and 1 GB memory; each of user 2’s tasks demands 3 CPUs and 2 GB memory. In this case,

we have c1 = (1/5, 2/5)T , c2 = (4/5, 3/5)T , D1 = (1/5, 1/5)T , D2 = (3/5, 2/5)T , d1 = (1, 1)T , and

d2 = (1, 2/3)T . It is easy to verify that under DRFH, the global dominant share both users receive

is 12/25. On the other hand, under the per-server partitioning, the global dominant share that user 2

receives is 1/2, higher than that received under DRFH. ut

While DRFH may violate the strong sharing incentive property, we shall show via trace-driven

simulations in Sec. 2.6 that this only happens in rare cases.

Weak Sharing Incentive

The strong sharing incentive property is defined by choosing the per-server partitioning as a benchmark,

which is only one of many different ways to evenly divide the total availability. In general, any equal

partition that allocates an equal share of every resource can be used as a benchmark. This allows us to

relax the sharing incentive definition. We first define an equal partition as follows.

Definition 3 (Equal partition). Allocation A is an equal partition if it divides every resource evenly

among all users, i.e., ∑l∈S

Ailr = 1/n, ∀r ∈ R, i ∈ U .

It is easy to verify that the aforementioned per-server partition is an equal partition. We are now

ready to define the weak sharing incentive property as follows.


Definition 4 (Weak sharing incentive). Allocation A satisfies the weak sharing incentive property if

there exists an equal partition A′ under which each user schedules fewer tasks than those under A, i.e.,

Ni(Ai) ≥ Ni(A′i), ∀i ∈ U .

In other words, the property of weak sharing incentive only requires the allocation to be better off

than one equal partition, without specifying its specific form. It is hence a more relaxed requirement

than the strong sharing incentive property.

The following proposition shows that DRFH satisfies the sharing incentive property in the weak

sense. The proof is constructive.

Proposition 5 (Weak sharing incentive). DRFH satisfies the property of weak sharing incentive.

Proof: Let g be the global dominant share each user receives under a DRFH allocation A, and gil

the global dominant share user i receives in server l. We construct an equal partition A′ under which

users schedule fewer tasks than those under A.

Case 1: g ≥ 1/n. In this case, let A′ be any equal partition. We show that each user schedules

fewer tasks under A′ than those under A. To see this, consider the DRFH allocation A. Since it is

non-wasteful, the number of tasks user i schedules is

Ni(Ai) = g/Dir∗i≥ 1/nDir∗i

.

On the other hand, the number of tasks user i schedules under A′ would be at most

Ni(A′i) =

∑l∈S minr{A′ilr/Dir}

≤∑l∈S A

′ilr/Dir∗i

= 1/nDir∗i

≤ Ni(Ai) .

Case 2: g < 1/n. In this case, no resource has been fully allocated under A, i.e.,

∑i∈U

∑l∈S

Ailr =∑i∈U

∑l∈S

gildir ≤∑i∈U

∑l∈S

gil =∑i∈U

g < 1

for all resource r ∈ R. Let

Llr = clr −∑i∈U

Ailr


be the amount of resource r left unallocated in server l, Further, let

Lr =∑l∈S

Llr = 1−∑i∈U

∑l∈S

Ailr

be the total amount of resource r left unallocated.

We are now ready to construct an equal partition A′ based on A. Since A′ should allocate each

user 1/n of the total availability of every resource r, the additional amount of resource r user i needs to

obtain is

uir = 1/n−∑l∈S

Ailr .

It is easy to see that uir > 0,∀i ∈ U, r ∈ R. The demanded fraction of unallocated resource r for user i

is

fir = uir/Lr .

As a result, we can construct A′ by reallocating those leftover resources in each server to users, in

proportion to their demands, i.e.,

A′ilr = Ailr + Llrfir, ∀i ∈ U, l ∈ S, r ∈ R .

It is easy to verify that A′ is an equal partition, i.e.,

∑l∈S

A′ilr =∑l∈S

Llrfir +∑l∈S

Ailr

= (uir/Lr)∑l∈S

Llr +∑l∈S

Ailr

= uir +∑l∈S

Ailr

= 1/n , ∀i ∈ U, r ∈ R .

We now compare the number of tasks scheduled for each user under both allocations A and A′.

Because A′ allocates more resources to each user than A does, we have Ni(A′i) ≥ Ni(Ai) for all i.

On the other hand, by the Pareto optimality of allocation A, no user can schedule more tasks without

decreasing the number of tasks scheduled for others. Therefore, we must have Ni(A′i) = Ni(Ai) for all

i. ut


Discussion

Strong sharing incentive provides more predictable service isolation than weak sharing incentive does.

It assures a user a priori that it can schedule at least the number of tasks when every server is evenly

allocated. This gives users a concrete idea on the worst Quality of Service (QoS) they may receive,

allowing them to accurately predict their computing performance. While weak sharing incentive also

provides some degree of service isolation, a user cannot infer the guaranteed number of tasks it can

schedule a priori from this weaker property, and therefore cannot predict the computing performance.

We note that the root cause of such degradation of service isolation is due to the heterogeneity

among servers. When all servers are of the same specification of hardware resources, DRFH reduces to

DRF, and strong sharing incentive is guaranteed. This is also the case for schedulers adopting the single-

resource abstraction. For example, in Hadoop, each server is divided into several slots (e.g., reducers and

mappers). Hadoop Fair Scheduler [45] allocates these slots evenly to all users. We see that predictable

service isolation is achieved: each user receives at least ks/n slots, where ks is the number of slots, and

n is the number of users.

In general, one can view weak sharing incentive as the price paid by DRFH to achieve high resource

utilization. In fact, naively applying DRF allocation separately to each server retains strong sharing

incentive: in each server, the DRF allocation ensures that a user can schedule at least the number of

tasks when resources are evenly allocated [12, 16]. However, as we have seen in Sec. 2.3.4, such a naive

DRF extension may lead to extremely low resource utilization that is unacceptable. Similar problem

also exists for traditional schedulers adopting single-resource abstractions. By artificially dividing servers

into slots, these schedulers cannot match computing demands to available resources at a fine granularity,

resulting in poor resource utilization in practice [12]. For these reasons, we believe that slightly trading

off the degree of service isolation for much higher resource utilization is well justified. We shall use

trace-driven simulation to show in Sec. 2.6.2 that DRFH only violates strong sharing incentive in rare

cases in the Google cluster.

2.4.6 Other Important Properties

In addition to the four essential properties shown in the previous subsection, DRFH also provides a num-

ber of other important properties. First, since DRFH generalizes DRF to heterogeneous environments,

it naturally reduces to the DRF allocation when there is only one server contained in the system, where

the global dominant resource defined in DRFH is exactly the same as the dominant resource defined in

DRF.


Proposition 6 (Single-server DRF). The DRFH leads to the same allocation as DRF when all resources

are concentrated in one server.

Next, by definition, we see that both single-resource fairness and bottleneck fairness trivially hold

for the DRFH allocation. We hence omit the proofs of the following two propositions.

Proposition 7 (Single-resource fairness). The DRFH allocation satisfies single-resource fairness.

Proposition 8 (Bottleneck fairness). The DRFH allocation satisfies bottleneck fairness.

Finally, we see that when a user leaves the system and relinquishes all its allocations, the remaining

users will not see any reduction of the number of tasks scheduled. Formally,

Proposition 9 (Population monotonicity). The DRFH allocation satisfies population monotonicity.

Proof: Let A be the resulting DRFH allocation, then for all user i and server l, Ail = gildi and

Gi(Ai) = g, where {gil} and g solve (2.7). Suppose that user j leaves the system, changing the resulting

DRFH allocation to A′. By DRFH, for all user i 6= j and server l, we have A′il = g′ildi and Gi(A′i) = g′,

where {g′il}i 6=j and g′ solve the following optimization problem:

maxg′il,i6=j

g′

s.t.∑i 6=j

g′ildir ≤ clr,∀l ∈ S, r ∈ R ,

∑l∈U

g′il = g′,∀i 6= j .

(2.9)

To show Ni(A′i) ≥ Ni(Ai) for all user i 6= j, it is equivalent to prove Gi(A

′i) ≥ Gi(Ai). It is easy

to verify that g, {gil}i 6=j satisfy all the constraints of (2.9) and are hence feasible to (2.9). As a result,

g′ ≥ g, which is exactly Gi(A′i) ≥ Gi(Ai). ut

2.5 Practical Considerations

So far, all our discussions are based on several assumptions that may not be the case in a real-world

system. In this section, we relax these assumptions and discuss how DRFH can be implemented in

practice.

2.5.1 Weighted Users with a Finite Number of Tasks

In the previous sections, users are assumed to be assigned equal weights and have infinite computing

demands. Both assumptions can be easily removed with some minor modifications of DRFH.


When users are assigned uneven weights, let wi be the weight associated with user i. DRFH seeks

an allocation that achieves the weighted max-min fairness across users. Specifically, we maximize the

minimum normalized global dominant share (w.r.t the weight) of all users under the same resource

constraints as in (2.4), i.e.,

maxA

mini∈U

Gi(Ai)/wi

s.t.∑i∈U

Ailr ≤ clr,∀l ∈ S, r ∈ R .

When users have a finite number of tasks, the DRFH allocation is computed iteratively. In each

round, DRFH increases the global dominant share allocated to all active users, until one of them has

all its tasks scheduled, after which the user becomes inactive and will no longer be considered in the

following allocation rounds. DRFH then starts a new iteration and repeats the allocation process above,

until no user is active or no more resources could be allocated to users. Because each iteration saturates

at least one user’s resource demand, the allocation will be accomplished in at most n rounds, where n

is the number of users.3 Our analysis presented in Sec. 2.4 also extends to weighted users with a finite

number of tasks.

2.5.2 Scheduling Tasks as Entities

Until now, we have assumed that all tasks are divisible. In a real-world system, however, fractional

tasks may not be accepted. To schedule tasks as entities, one can apply progressive filling as a simple

implementation of DRFH4. That is, whenever there is a scheduling opportunity, the scheduler always

accommodates the user with the lowest global dominant share. To do this, it picks the first server whose

remaining resources are sufficient to accommodate the request of the user’s task. While this First-Fit

algorithm offers a fairly good approximation to DRFH, we propose another simple heuristic that can

lead to a better allocation with higher resource utilization.

Similar to First-Fit, the heuristic also chooses user i with the lowest global dominant share to serve.

However, instead of randomly picking a server, the heuristic chooses the “best” one whose remaining

resoruces most suitably matches the demand of user i’s tasks, and is hence referred to as the Best-Fit

DRFH. Specifically, for user i with resource demand vector Di = (Di1, . . . , Dim)T and a server l with

available resource vector cl = (cl1, . . . , clm)T , where clr is the share of resource r remaining available in

server l, we define the following heuristic function to quantitatively measure the fitness of the task for

3For medium- and large-sized cloud clusters, n is in the order of thousands [2, 3].4Progressive filling has also been used to implement the DRF allocation [12]. However, when the system consists of

multiple heterogeneous servers, progressive filling will lead to a DRFH allocation.


server l:

H(i, l) = ‖Di/Di1 − cl/cl1‖1 , (2.10)

where ‖·‖1 is the L1-norm. Intuitively, the smaller H(i, l), the more similar the resource demand vector

Di appears to the server’s available resource vector cl, and the better fit user i’s task is for server l.

Best-Fit DRFH schedules user i’s tasks to server l with the least H(i, l).

As an illustrative example, suppose that only two types of resources are concerned, CPU and memory.

A CPU-heavy task of user i with resource demand vector Di = (1/10, 1/30)T is to be scheduled, meaning

that the task requires 1/10 of the total CPU availability and 1/30 of the total memory availability of the

system. Only two servers have sufficient remaining resources to accommodate this task. Server 1 has the

available resource vector c1 = (1/5, 1/15)T ; Server 2 has the available resource vector c2 = (1/8, 1/4)T .

Intuitively, because the task is CPU-bound, it is more fit for Server 1, which is CPU-abundant. This is

indeed the case as H(i, 1) = 0 < H(i, 2) = 5/3, and Best-Fit DRFH places the task onto Server 1.

Both First-Fit and Best-Fit DRFH can be easily implemented by searching all k servers in O(k)

time, which is fast enough for small- and medium-sized clusters. For a large cluster containing tens of

thousands of servers, this computation can be fast approximated by adapting the power of two choices

load balancing technique [46]. Instead of scanning through all servers, the scheduler randomly probes

two servers and places the task on the server that fits the task better.

It is worth mentioning that the definition of the heuristic function (2.10) is not unique. In fact, one

can use more complex heuristic function other than (2.10) to measure the fitness of a task for a server,

e.g., cosine similarity [47]. However, as we shall show in the next section, Best-Fit DRFH with (2.10)

as its heuristic function already improves the utilization to a level where the system capacity is almost

saturated. Therefore, the benefit of using more complex fitness measure is very limited, at least for the

Google cluster traces [3].

2.6 Evaluation

In this section, we evaluate the performance of DRFH via both prototype implementation and extensive

simulations driven by Google cluster-usage traces [3].

2.6.1 Prototype Implementation

We implemented First-Fit DRFH as a new pluggable allocation module in Apache Mesos 0.18.1 [39,48].

Mesos is an open-source cluster resource management system that has been widely adopted by industry


Table 2.1: Resource configurations of the 50 nodes in the launched Amazon EC2 cluster.

# Cores Memory (MB) # Nodes2 200 71 800 101 400 91 200 92 800 82 400 7

Table 2.2: Details of tasks submitted by four artificial applications.

App. ID # Tasks # Cores per Task Memory per Task (MB) Task Runtime (s)1 6000 0.2 20 252 5000 0.1 40 603 7500 0.1 40 254 8000 0.2 20 25

companies such as Twitter and Google. Our implementation maintains a priority queue for all active

applications: the one with the least global dominant share is deemed the most “unfair” and will receive

the highest priority for task scheduling. Whenever resources become available in some server, our DRFH

allocator offers them to the application with the highest scheduling opportunity.

We set up a cluster in Amazon EC2 with a total of 50 nodes (t2.medium). In order to emulate

the machine heterogeneity in a large cluster, we configured Mesos so that each node contributes only a

partial amount of resources (i.e., CPU and memory) to the cluster. Table 2.1 summarizes the resource

configuration of each node.

To evaluate the allocation fairness of our DRFH allocator in the presence of users dynamically joining

and departing, we wrote four artificial Mesos applications and launched them sequentially over time.

Table 2.2 summarizes the number of tasks submitted by all four applications, the resource demands of

each task, and task runtime.

Throughout the experiment, we measure the CPU, memory, and global dominant share of the four

applications every 10 s. Fig. 2.5 shows the measured allocation shares over time with DRFH. We see

that initially, only App1 is active, and is allocated the entire CPU cores of the cluster. This allocation

remains for 300 s, after which App2 joins the system. Both applications now compete for the cluster

resources, leading to a DRFH allocation with both applications receiving the same global dominant

share of 50%. At around 560 s, App1 finishes running and returns all the allocated resources. App2 then

occupies the entire cluster, and is allocated 70% of the total CPU cores and 80% of cluster memory. We

note that not all memory is allocated to App2 because a certain amount of memory has been reserved

for Mesos Slave in each node [48]. At around 800 s, both App3 and App4 become active and start to


0 300 600 900 1200 1500 18000

0.2

0.4

0.6

0.8

1

CP

U S

ha

re

0 300 600 900 1200 1500 18000

0.2

0.4

0.6

0.8

1

Time (s)

Me

m.

Sh

are

0 300 600 900 1200 1500 18000

0.2

0.4

0.6

0.8

1

Do

min

an

t S

ha

re

App1

App2

App3

App4

Figure 2.5: CPU, memory, and global dominant share of four applications running in a 50-node AmazonEC2 cluster.

launch computing tasks. The DRFH allocator ensures the same global dominant share of 33% for all

three applications until App2 finishes at 1350 s. After that the entire cluster is shared by App3 and

App4. Since App3 (resp., App4) has the same resource requirements as those of App1 (resp., App2),

the allocation shares received by App3 (resp., App4) are similar to those received by App1 (resp., App2)

from 300 s to 560 s. Finally, we see that App4 increases its CPU share to 100% after App3 departs

the system. To summarize, whenever there is more than one application sharing the cluster, our Mesos

implementation can quickly ensure all applications to receive the same global dominant share, achieving

the precise DRFH allocation at all times.

Our prototype implementation in a small-sized Amazon EC2 cluster suggests that naive First-Fit

DRFH is sufficient to achieve near-perfect fairness. While Best-Fit DRFH cannot do better than First-

Fit in terms of fairness, we shall see in the next subsection that the former outperforms the latter in

terms of other metrics – notably resource utilization – when the cluster is of a large scale.


Table 2.3: Resource utilization of the Slots scheduler with different slot sizes.

Number of Slots CPU Utilization Memory Utilization10 per maximum server 35.1% 23.4%12 per maximum server 42.2% 27.4%14 per maximum server 43.9% 28.0%16 per maximum server 45.4% 24.2%20 per maximum server 40.6% 20.0%

2.6.2 Trace-Driven Simulation

We now turn to a system with a much larger scale and compare the two DRFH implementations via ex-

tensive simulations driven by Google cluster-usage traces [3]. The traces contain resource demand/usage

information of over 900 users (i.e., Google services and engineers) on a cluster of 12K servers. The server

configurations are summarized in Table 1.1, where the CPUs and memory of each server are normalized

so that the maximum server is 1. Each user submits computing jobs, divided into a number of tasks, each

requiring a set of resources (i.e., CPU and memory). From the traces, we extract the computing demand

information – the required amount of resources and task running time – and use it as the demand input

of the allocation algorithms for evaluation.

Resource Utilization

Our first evaluation focuses on the resource utilization. We take the 24-hour computing demand data

from the Google traces and simulate it on a smaller cloud computing system of 2,000 servers so that

fairness becomes relevant. The server configurations are randomly drawn from the distribution of Google

cluster servers in Table 1.1. We compare Best-Fit DRFH with two other benchmarks, the traditional

Slots schedulers that schedules tasks onto slots of servers (e.g., Hadoop Fair Scheduler [45]), and the

First-Fit DRFH that chooses the first server that fits the task. For the former, we try different slot

sizes and chooses the one with the highest CPU and memory utilization. Table 2.3 summarizes our

observations, where dividing the maximum server (1 CPU and 1 memory in Table 1.1) into 14 slots

leads to the highest overall utilization.

Fig. 2.6 depicts the time series of CPU and memory utilization of the three algorithms. We see that

the two DRFH implementations significantly outperform the traditional Slots scheduler with much higher

resource utilization, mainly because the latter ignores the heterogeneity of both servers and workload.

This observation is consistent with findings in the homogeneous environment where all servers are of

the same hardware configurations [12]. As for the DRFH implementations, we see that Best-Fit DRFH

leads to uniformly higher resource utilization than the First-Fit alternative at all times.


0 200 400 600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Time (min)C

PU

Utiliz

atio

n

Best−Fit DRFH First−Fit DRFH Slots

0 200 400 600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Time (min)

Me

mo

ry U

tiliz

atio

n

Best−Fit DRFH First−Fit DRFH Slots

Figure 2.6: Time series of CPU and memory utilization.

The high resource utilization of Best-Fit DRFH naturally translates to shorter job completion times

shown in Fig. 2.7a, where the CDFs of job completion times for both Best-Fit DRFH and Slots scheduler

are depicted. Fig. 2.7b offers a more detailed breakdown, where jobs are classified into 5 categories

based on the number of its computing tasks, and for each category, the mean completion time reduction

is computed. While DRFH shows no improvement over Slots scheduler for small jobs, a significant

completion time reduction has been observed for those containing more tasks. Generally, the larger

the job is, the more improvement one may expect. Similar observations have also been found in the

homogeneous environments [12].

Fig. 2.7 does not account for partially completed jobs and focuses only on those having all tasks

finished in both Best-Fit and Slots. As a complementary study, Fig. 2.8 computes the task completion

ratio – the number of tasks completed over the number of tasks submitted – for every user using Best-Fit

DRFH and Slots schedulers, respectively. The radius of the circle is scaled logarithmically to the number

of tasks the user submitted. We see that Best-Fit DRFH leads to higher task completion ratio for almost

all users. Around 20% users have all their tasks completed under Best-Fit DRFH but do not under Slots.

Sharing Incentive

Our final evaluation is on the sharing incentive property of DRFH. While we have shown in Sec. 2.4.5

that DRFH may not satisfy the property in the strong sense, it remains unclear how often this property

would be violated in practice. Is it frequently happened or just a rare case? We answer this question in


101

102

103

104

105

0

0.2

0.4

0.6

0.8

1

Job Completion Time (s)

Best−Fit DRFHSlots

(a) CDF of job completion times.

0

20

40

60

80

Job Size (tasks)

Com

ple

tion T

ime R

eduction

1−5051−100

101−500

501−1000>1000

−1% 2%

25%

43%

62%

(b) Job completion time reduction.

Figure 2.7: DRFH improvements on job completion times over Slots scheduler.

this subsection.

We first compute the task completion ratio under the benchmark per-server partition. To do this,

we randomly choose dk/ne servers, where k is the number of servers and n is the number of users in the

traces. We then allocate these dk/ne servers to a user and schedule its tasks onto them. These dk/ne

servers form a dedicated cloud exclusive for this user. We compute the task completion ratio obtained

in this dedicated cloud and compare it with the one obtained under the DRFH allocation. Fig. 2.9

illustrates the comparison results for all users. We see that most users would prefer DRFH allocation as

compared with running tasks in a dedicated cloud. In particular, only 2% users see fewer tasks finished

under the DRFH allocation. Even for these users, the task completion ratio decreases only slightly, as

shown in Fig. 2.9. As a result, we see that DRFH only violates the property of strong sharing incentive

in rare cases in the Google traces.


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Task completion ratio w/ Slots

Task c

om

ple

tion r

atio w

/ D

RF

H

← y = x

Figure 2.8: Task completion ratio of users using Best-Fit DRFH and Slots schedulers, respectively. Eachbubble’s size is logarithmic to the number of tasks the user submitted.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Task completion ratio in a dedicated cloud

Task c

om

ple

tion r

atio w

/ D

RF

H

← y = x

Figure 2.9: Comparison of task completion ratios under DRFH and that obtained in dedicated clouds(DCs). Each circle’s radius is logarithmic to the number of tasks submitted.

2.7 Related Work

Despite the extensive computing system literature on fair resource allocation, many existing works

limit their discussions to the allocation of a single resource type, e.g., CPU time [49, 50] and link

bandwidth [51–55]. Various fairness notions have also been proposed throughout the years, ranging

from application-specific allocations [56,57] to general fairness measures [51,58,59].

As for multi-resource allocation, state-of-the-art cloud computing systems employ naive single re-

source abstractions. For example, the two fair sharing schedulers currently supported in Hadoop [45,60]

partition a node into slots with fixed fractions of resources, and allocate resources jointly at the slot

granularity. Quincy [61], a fair scheduler developed for Dryad [11], models the fair scheduling problem

as a min-cost flow problem to schedule jobs into slots. The recent work [43] takes the job placement

constraints into consideration, yet it still uses a slot-based single resource abstraction.


Ghodsi et al. [12] are the first in the literature to present a systematic investigation on the multi-

resource allocation problem in cloud computing systems. They proposed DRF to equalize the dominant

share of all users, and show that a number of desirable fairness properties are guaranteed in the resulting

allocation. DRF has quickly attracted a substantial amount of attention and has been generalized to

many dimensions. Notably, Joe-Wong et al. [13] generalized the DRF measure and incorporated it into

a unifying framework that captures the trade-offs between allocation fairness and efficiency. Dolev et

al. [14] suggested another notion of fairness for multi-resource allocation, known as Bottleneck-Based

Fairness (BBF), under which two fairness properties that DRF possesses are also guaranteed. Gutman

and Nisan [15] considered another settings of DRF with a more general domain of user utilities, and

showed their connections to the BBF mechanism. Parkes et al. [16], on the other hand, extended DRF

in several ways, including the presence of zero demands for certain resources, weighted user endowments,

and in particular the case of indivisible tasks. They also studied the loss of social welfare under the

DRF rules. Kash et al. [41] extended the DRF model to allow users to join the system over time but

will never leave. Bhattacharya et al. [62] generalized DRF to a hierarchical scheduler that offers service

isolations in a computing system with a hierarchical structure. All these works assume, explicitly or

implicitly, that resources are either concentrated into one super computer, or are distributed to a set of

homogeneous servers with exactly the same resource configuration.

However, server heterogeneity has been widely observed in today’s cloud computing systems. Specif-

ically, Ahmad et al. [8] noted that datacenter clusters usually consist of both high-performance servers

and low-power nodes with different hardware architectures. Reiss et al. [2,3] illustrated a wide range of

server specification in Google clusters. As for public clouds, Farley et al. [6] and Ou et al. [7] observed

significant hardware diversity among Amazon EC2 servers that may lead to substantially different per-

formance across supposedly equivalent VM instances. Ou et al. [7] also pointed out that such server

heterogeneity is not limited to EC2 only, but generally exists in long-lasting public clouds such as

Rackspace.

To our knowledge, the very recent paper [63] is the only work that studied allocation properties

in the presence of server heterogeneity, where a randomized allocation algorithm, called Constrained-

DRF (CDRF), is proposed to schedule discrete jobs. While CDRF possess all the desirable properties

discussed in this chapter, it is too complex for a job scheduler, and an efficient algorithm remains an

open problem [63]. More recently, Grandl et al. [64] proposed an efficient heuristic algorithm for multi-

resource scheduling in heterogeneous computer clusters. Their work mainly focuses on designing a good

heuristic algorithm, not studying the allocation properties, and is therefore orthogonal to our work.

Other related works include fair-division problems in the economics literature, in particular the


egalitarian division under Leontief preferences [42] and the cake-cutting problem [44]. These works also

assume the all-in-one resource model, and hence cannot be directly applied to cloud computing systems

with heterogeneous servers.

Finally, we note that many bandwidth allocation problems in the networking literature, although

focusing on a single resource type, can be naturally converted to a multi-resource sharing problem.

For example, in the traditional flow control problem (described in Sec. 6.5.2 of [65]), each flow has its

route determined already and is associated with a fix path in the network. The flow control problem

requires us to determine the traffic rate for each flow such that the entire network bandwidth is fairly

shared. To see how this problem connects to a multi-resource sharing problem, we view each link as a

type of resource, with the availability as the bandwidth capacity. Moreover, we view each flow as an

aggregation of many sub-flows each traversing the same path of the flow at rate of 1 bit per second.

We hence construct an equivalent multi-resource sharing problem where the equivalent of a user is a

flow and the equivalent of a task is a sub-flow. DRF can therefore be applied as a fair sharing solution:

The dominant resource of a flow is the link with the smallest bandwidth capacity among all links the

flow traverses. It turns out that the resulting DRF allocation reduces to the well-known max-min fair

flow control mechanism [65] (Sec. 6.5.2). It is worth emphasizing that while the flow control problem

can be interpreted as a multi-resource fair sharing problem, the converse is generally not true. This is

because in a network, a flow always consumes the same amount of bandwidth in all links it traverses.

In contrast, in a multi-resource sharing problem, a task may require different amounts of resources. It

is such a key difference that makes the max-min fair flow control not applicable when extending to a

general multi-resource setting.

2.8 Summary

In this chapter, we have studied a multi-resource allocation problem in a heterogeneous cloud computing

system where the resource pool is composed of a large number of servers with different configurations

in terms of resources such as processing, memory, and storage. The proposed multi-resource allocation

mechanism, known as DRFH, equalizes the global dominant share allocated to each user, and hence

generalizes the DRF allocation from a single server to multiple heterogeneous servers. We have analyzed

DRFH and showed that it retains almost all desirable properties that DRF provides in the single-server

scenario. Notably, DRFH is envy-free, Pareto optimal, and group strategyproof. It also offers the sharing

incentive in a weak sense. We have implemented DRFH as a new pluggable resource allocator in Apache

Mesos. Experimental results show that our implementation leads to a precise DRFH allocation in a


real cluster. Large-scale simulations driven by Google cluster traces further show that, compared to the

traditional single-resource abstraction such as a slot scheduler, DRFH achieves significant improvements

in resource utilization, leading to much shorter job completion times.

Chapter 3

Multi-Resource Fair Queueing for

Network Flows

3.1 Motivation

In Chapter 2, we have studied how multiple types of computing resources – such as CPU cores, memory,

IP addresses, and storage spaces – should be shared for datacenter jobs. In addition to computing

resources, network is another important concern. Modern datacenters typically deploy a large number

of network appliances or “middleboxes”. According to recent studies [17,18], the number of middleboxes

deployed in enterprise and datacenter networks is already on par with the traditional switches and

routers. These middleboxes do more than just packet forwarding. They perform a variety of critical

network functions that require deep packet inspection based on the payload of packets, such as IP

security encryption, WAN optimization, and intrusion detection. Performing these complex network

functions requires the support of multiple types of resources, and may bottleneck on either CPU or link

bandwidth [4,28]. For example, flows that require basic forwarding may congest the link bandwidth [4],

while those that require IP security encryption need more CPU processing time [28]. In all cases, link

bandwidth is no longer the only network resource shared by flows. A scheduling algorithm specifically

designed for multiple resource types is therefore needed for sharing these resources fairly and efficiently.

Starting from this chapter, we shall focus on the design and implementation of multi-resource queueing

algorithms for middleboxes. The queueing algorithm determines the order in which packets in various

independent flows are processed, and serves as a fundamental mechanism for allocating resources in

a middlebox. Ideally, a queueing algorithm – also known as a packet scheduler – should provide the

40

Chapter 3. Multi-Resource Fair Queueing for Network Flows 41

following desirable properties.

Fairness. The middlebox scheduler should provide some measure of service isolation to allow com-

peting flows to have a fair share of middlebox resources. In particular, each flow should receive service

(i.e., throughput) at least at the level where every resource is equally allocated (assuming flows are

equally weighted). Moreover, this service isolation should not be compromised by strategic behaviours

of other flows.

Low complexity. With the ever growing line rate and the increasing volume of traffic passing

through middleboxes [29, 30], it is critical to schedule packets at high speeds. This requires low time

complexity when making scheduling decisions. In particular, it is desirable that this complexity is a small

constant, independent of the number of traffic flows. Equally importantly, the scheduling algorithm

should also be amenable to practical implementations.

Bounded scheduling delay. Interactive network and data analytic applications have stringent

end-to-end delay requirements. It is hence important for a packet scheduler to offer bounded scheduling

delay. Such a delay bound should be a small constant, independent of the number of flows.

Despite recent advances in multi-resource fair queueing (e.g., [4]), how a multi-resource packet sched-

uler is to be designed to satisfy all three desirable properties remains an open challenge. Existing designs

are expensive to implement at high speed. In particular, DRFQ [4], the first multi-resource fair queueing

algorithm that implements Dominant Resource Fairness (DRF) [12], associates packets with timestamps,

and schedules the one with the earliest timestamp. It suffers from a sorting bottleneck with scheduling

complexity logarithmic in the number of flows. Similar problem also exists in DRWF2Q [66].

In this chapter, we design two new schedulers that take O(1) time to schedule a packet while achieving

near-perfect fairness with the scheduling delay bounded by a small constant. The first scheduler, referred

to as Multi-Resource Round Robin (MR3), is designed for flows with equal weights. While round-robin

schedulers have found successful applications to fairly share a single type of resource, e.g., outgoing

bandwidth in switches and routers [25, 67, 68], we observe that directly applying them to schedule

multiple types of resources may lead to arbitrary unfairness. Nonetheless, we show, analytically, that

near-perfect fairness could be achieved by simply deferring the scheduling opportunity of a packet until

the progress gap between two resources falls below a small threshold. We then implement this principle

in a multi-resource packet scheduling design similar to Elastic Round Robin [68], which we show to be

the most suitable variant for the middlebox environment in the design space of round robin algorithms.

Both theoretical analyses and extensive simulation demonstrate that compared with DRFQ, the price

we pay is a slight increase in packet latency. MR3 is the first multi-resource fair queueing algorithm that

offers near-perfect fairness in O(1) time.


Despite the low complexity and the friendliness of implementation, MR3 may incur large schedul-

ing delay for flows with uneven weights, and is hence unsuitable for applications with stringent delay

requirements. To address this problem, we design an improved round-robin scheduler, called Group

Multi-Resource Round Robin (GMR3), that achieves all three desirable scheduling properties. GMR3

groups flows with similar weights into a small number of groups, each associating with a timestamp. The

scheduling decisions are made in a two-level hierarchy. At the higher level, GMR3 makes inter-group

scheduling decisions by choosing the group with the earliest timestamp, while at the lower level, the

intra-group scheduler serves flows within a group in a round-robin fashion. GMR3 is highly efficient, as

it requires only O(1) time per packet in almost all practical scenarios. In addition, we show that GMR3

achieves near-perfect fairness across flows, with its scheduling delay bounded by a small constant. These

desirable properties are proven analytically and validated by simulation. To our knowledge, GMR3

represents the first fair queueing algorithm that offers near-perfect fairness with O(1) time complexity

and a constant scheduling delay bound. GMR3 is amenable to practical implementations, and may find

a variety of applications in other multi-resource scheduling contexts such as VM scheduling inside a

hypervisor.

The remainder of this chapter is organized as follows. After a brief survey of the existing literature

in Sec. 3.2, we discuss the challenges of extending round-robin algorithms to the multi-resource scenario

in Sec. 3.4. Our design of MR3 is given in Sec. 3.4.3 and is evaluated via both theoretical analyses and

simulation experiments in Sec. 3.4.4 and Sec. 3.4.5, respectively. Sec. 3.7 concludes the chapter.

3.2 Related Work

Unlike switches and routers where the output bandwidth is the only shared resource, middleboxes handle

a variety of hardware resources and require a more complex packet scheduler. Many recent measure-

ments, such as [4, 27, 28], report that packet processing in a middlebox may bottleneck on any of CPU,

memory bandwidth, and link bandwidth, depending on the network functions applied to the traffic flow.

Such a multi-resource setting significantly complicates the scheduling algorithm. As pointed out in [4],

simply applying traditional fair queueing schemes [21–26] per resource (i.e., per-resource fairness) or on

the bottleneck resource of the middlebox (i.e., bottleneck fairness [69]) fails to offer service isolation.

Furthermore, by strategically claiming some resources that are not needed, a flow may increase its service

share at the price of other flows.

Ghodsi et al. [4] has proposed a multi-resource fair queueing algorithm, termed DRFQ, that imple-

ments DRF in the time domain, i.e., it schedules packets in a way such that each flow receives roughly the


same processing time on its most congested resource. It is shown to achieve service isolation and guaran-

tee that no rational user would misreport the amount of resources it requires. Following this intuition, we

have previously extended the idealized GPS model [21,22] to Dominant Resource GPS (DRGPS), which

implements strict DRF at all times [66]. By emulating DRGPS, well-known fair queueing algorithms,

such as WFQ [21] and WF2Q [26], can have direct extensions in the multi-resource setting. However,

while all these algorithms achieve nearly perfect service isolation, they are timestamp-based schedulers

and are expensive to implement. They all require selecting a packet with the earliest timestamp among

n active flows, taking O(log n) time per packet processing. With a large n, these algorithms are hard to

implement at high speeds.

Reducing the scheduling complexity is a major concern in many existing studies on high-speed net-

working. When there is only a single resource to schedule, round-robin schedulers [25, 67,68] have been

proposed to multiplex the output bandwidth of switches and routers, in which flows are served in cir-

cular order. These algorithms eliminate the sorting bottleneck associated with the timestamp-based

schedulers, and achieve O(1) time complexity per packet. Due to their extreme simplicity, round-robin

schemes have been widely implemented in high-speed routers such as Cisco GSR [70].

Despite the successful application of round-robin algorithms in traditional L2/L3 devices, we observe

in this study that they may lead to arbitrary unfairness when naively applied to multiple resources.

Therefore, it remains unclear whether their attractiveness, i.e., the implementation simplicity and low

time complexity, extends to the multi-resource environment, and if it does, how a round-robin scheduler

should be designed and implemented in middleboxes. We study answers to these questions in the

following sections.

3.3 Preliminaries and Design Objectives

In this section, we introduce some basic concepts and clarify the detailed design objectives of a multi-

resource scheduler.

3.3.1 Packet Processing Time

Depending on the network functions applied to a flow, processing a packet of the flow may consume

different amounts of middlebox resources. Following [4], we define the packet processing time as a metric

for the resource requirements of a packet. Specifically, for packet p, its packet processing time on resource

r, denoted τr(p), is defined as the time required to process the packet on resource r, normalized to the

middlebox’s processing capacity of resource r. For example, a packet may require 10 µs to process using


P1

Q1P1CPU

Link ...P2

P3

Q1

Q2

Q2

...

P3

Time20 6 124 8 10 14

P2 Q3

Q3

P4

P4

Q4

(a) The scheduling discipline.

P1

Q1CPU

Link ...P2

Q2 ...

P3

Time20 6 124 8 10 14

Q3

P4

Q4

(b) The processing time received on the dominant resources.

Figure 3.1: Illustration of a scheduling discipline that achieves DRF.

one CPU core. A middlebox with 2 CPU cores can process 2 such packets in parallel. As a result, the

packet processing time of this packet on the CPU resource of this middlebox is 5 µs.

3.3.2 Dominant Resource Fairness (DRF) over Time

Fairness is the primary design objective for a packet scheduler. We have seen in the previous chapter

that the recently proposed Dominant Resource Fairness [12, 16] serves as a promising notion of fairness

for sharing multiple types of resources. DRF generalizes max-min fairness to the dominant resource in

the space domain, and can also be extended to resource multiplexing over time. In our context, the

dominant resource of a packet is the one that requires the most packet processing time. Specifically,

let m be the number of resources concerned. For a packet p, its dominant resource, denoted r∗(p), is

defined as

r∗(p) = arg max1≤r≤m

{τr(p)}. (3.1)

A schedule is said to implement DRF over time if flows receive the same processing time on the dominant

resource of their packets in all backlogged periods. For example, consider two flows in Fig. 3.1a. Flow

1 sends packets P1, P2, . . ., while flow 2 sends packets Q1, Q2, . . .. Each packet of flow 1 requires 1

time unit for CPU processing and 3 time units for link transmission, and thus has the processing times

〈1, 3〉. Each packet of flow 2, on the other hand, requires the processing times 〈3, 1〉. In this case, the

dominant resource of packets of flow 1 is link bandwidth, while the dominant resource of packets of flow

2 is CPU. We see that the scheduling scheme shown in Fig. 3.1a achieves DRF, under which both flows

receive the same processing time on dominant resources (see Fig. 3.1b).

It has been shown in [4,66] that a scheduler that achieves strict DRF at all times offers the following


highly desirable properties:

1. Predictable service isolation. For each flow i, the received service (i.e., throughput) is at least at

the level where every resource is equally allocated.

2. Truthfulness. No flow can receive better service (finish faster) by misreporting the amount of

resources it requires.

3. Work conservation. No resource is wasted in idle should it be used to increase the throughput of

a backlogged flow.

Therefore, we adopt DRF as the notion of fairness for multi-resource scheduling.

To measure how well a packet scheduler approximates DRF, the following Relative Fairness Bound

(RFB) is used as a fairness metric [4, 66]:

Definition 5. For any packet arrivals, let Ti(t1, t2) be the packet processing time flow i receives on the

dominant resources of its packets in the time interval (t1, t2). Ti(t1, t2) is referred to as the dominant

service flow i receives in (t1, t2). Let wi be the weight associated with flow i, and B(t1, t2) the set of

flows that are backlogged in (t1, t2). The Relative Fairness Bound (RFB) is defined as

RFB = supt1,t2;i,j∈B(t1,t2)

∣∣∣∣Ti(t1, t2)

wi− Tj(t1, t2)

wj

∣∣∣∣ . (3.2)

We require a scheduling scheme to have a small RFB, such that the difference between the normalized

dominant service received by any two flows i and j, over any backlogged time period (t1, t2), is bounded

by a small constant.

3.3.3 Scheduling Delay

In addition to fairness, scheduling delay is another important concern for a packet scheduler. The

scheduling delay is defined as the time that elapses between the instant a packet reaches the head of its

queue, and the instant the packet is completely processed on all resources. The delay is introduced by

the scheduling algorithm and is also referred to as the single packet delay in the fair queueing literature

[68, 71–73]. Intuitively, flows with larger weights are expected to experience smaller delay. In the ideal

case, the scheduling delay SDi(p) experienced by any packet p of flow i should be within a small constant

amount that is inversely proportional to the deserved scheduling weight of the flow, i.e.,

SDi(p) ≤ C/wi, (3.3)


where C is a constant, independent of the number of flows.

3.3.4 Scheduling Complexity

To handle a large volume of traffic at high speeds, the scheduler must operate with low scheduling

complexity, defined as the time required to make a packet scheduling decision. Ideally, this complexity

should be a small constant, independent of the number of flows.

In summary, a good packet scheduler should offer near-perfect fairness and a constant delay bound

that is inversely proportional to the flow’s weight, while operating in O(1) time complexity as well.

3.4 Multi-Resource Round Robin

In this section, we present a new scheduler, called Multi-Resource Round Robin (MR3), that achieves all

the aforementioned scheduling properties when flows are assigned the same weights,i.e., wi = 1 for all

flow i. We shall extend this design to a more general scheduler for flows with uneven weights in Sec. 3.5.

We start by revisiting round-robin algorithms in the traditional fair queueing literature and discussing

the challenges of extending them to the multi-resource setting.

3.4.1 Challenges of Round-Robin Extension

As mentioned in Sec. 4.7, among various scheduling schemes, round-robin algorithms are particularly

attractive for practical implementation due to its extreme simplicity and constant time complexity.

To extend it to the multi-resource setting with DRF, a natural approach is to directly apply it on the

dominant resources of a flow’s packets, such that in each round, flows receive roughly the same dominant

services. Such a general extension can be applied to many well-known round-robin algorithms. However,

such a naive extension may lead to arbitrary unfairness.

Take the well-known Deficit Round Robin (DRR) [25] as an example. When there is a single resource,

DRR assigns some predefined quantum size to each flow. Each flow maintains a deficit counter, whose

value is the currently unused transmission quota. In each round, DRR polls every backlogged flow and

transmits its packets up to an amount of data equal to the sum of its quantum and deficit counter.

The unused transmission quota will be carried over to the next round as the value of the flow’s deficit

counter. Similar to the single-resource case, one can apply DRR [25] on the dominant resources of a

flow’s packets as follows.

Initially, the algorithm assigns a predefined quantum size to each flow, which is also the amount

of dominant service the flow is allowed to receive in one round. Each flow maintains a deficit counter


P1CPU

Link ...

...

Time

P1

P2

Q1

P3

P2 Q2 P3

P4 P5

Q3

P6

Q1 Q2 Q3 Q4 Q5

0

(a) Direct application of DRR to schedule multiple resources.

0 20 40 60 80 1000

20

40

60

80

100

Time

Do

min

an

t S

erv

ice

Re

ce

ive

d

Flow 1 (CPU processing time)Flow 2 (Transmission time)

(b) The dominant services received by two flows.

Figure 3.2: Illustration of a direct DRR extension. Each packet of flow 1 has processing times 〈7, 6.9〉,while each packet of flow 2 has processing times 〈1, 7〉.

that measures the currently unused portion of the allocated dominant service. Packets are scheduled

in rounds, and in each round, each backlogged flow schedules as many packets as it has, as long as the

dominant service consumed does not exceed the sum of its quantum and deficit counter. The unused

portion of this amount is carried over to the next round as the new value of the deficit counter.

As an example, consider two flows where flow 1 sends P1, P2, . . . , while flow 2 sends Q1, Q2, . . . .

Each packet of flow 1 has processing times 〈7, 6.9〉, i.e., it requires 7 time units for CPU processing and

6.9 time units for link transmission. Each packet of flow 2 requires processing times 〈1, 7〉. Fig. 3.2a

illustrates the resulting schedule of the above naive DRR extension, where the quantum size assigned

to both flows is 7. In round 1, both flows receive a quantum of 7, and can process 1 packet each, which

consumes all the quantum awarded on the dominant resources in this round. Such a process repeats in

the subsequent rounds. As a result, packets of the two flows are scheduled alternately. At the end of

each round, the received quantum is always used up, and the deficit counter remains zero.

Similar to single-resource DRR, the extension above schedules packets in O(1) time1. However, such

an extension fails to provide fair services in terms of DRF. Instead, it may lead to arbitrary unfairness

with an unbounded RFB. Fig. 3.2b depicts the dominant services received by the two flows. We see

that flow 1 receives nearly two times the dominant service flow 2 receives. With more packets being

scheduled, the service gap increases, eventually leading to an unbounded RFB.

It is to be emphasized that the problem of arbitrary unfairness is not limited to DRR extension only,

but generally extends to all round-robin variants. For example, one can extend Surplus Round Robin

(SRR) [67] and Elastic Round Robin (ERR) [68] to the multi-resource setting in a similar way (more

details can be found in Sec. 3.4.3). It is easy to verify that running the example above will give exactly

1The O(1) time complexity is conditioned on the quantum size being at least the maximum packet processing time.


P1CPU

Link ...

Time

P1

P2

Q1

P3

P2 Q2

Q1 Q2

0

Figure 3.3: Naive fix of the DRR extension shown in Fig. 3.2a by withholding the scheduling opportunityof every packet until its previous packet is completely processed on all resources.

the same schedule shown in Fig. 3.2a with an unbounded RFB2. In fact, due to the heterogeneous packet

processing times on different resources, the work progress on one resource may be far ahead of that on

the other. For example, in Fig. 3.2a, when CPU starts to process packet P6, the transmission of packet

P3 remains unfinished. It is such a progress mismatch that leads to a significant gap between the two

flows’ dominant services.

In summary, directly applying round-robin algorithms on the dominant resources would fail to provide

fair services. A new design is therefore required. We preview its basic idea in the next subsection.

3.4.2 Basic Intuition

The key reason that direct round-robin extensions fail is because they cannot track the flows’ dominant

services in real-time. Take the DRR extension as an example. In Fig. 3.2a, after packet Q1 is completely

processed on CPU, flow 2’s deficit counter is updated to 0, meaning that flow 2 has already used up the

quantum allocated for dominant services (i.e., link transmission) in round 1. This allows packet P2 to

be processed but erroneously, as the actual consumption of this quantum incurs only when packet Q1 is

transmitted on the link, after the transmission of packet P1.

To circumvent this problem, a naive fix is to withhold the scheduling opportunity of every packet

until its previous packet is completely processed on all resources, which allows the scheduler to track the

dominant services accurately. Fig. 3.3 depicts the resulting schedule when applying this fix to the DRR

extension shown in Fig. 3.2a. We see that the difference between the dominant services received by two

flows is bounded by a small constant. However, such a fairness improvement is achieved at the expense

of significantly lower resource utilization. Even though multiple packets can be processed in parallel on

different resources, the scheduler serves only one packet at a time, leading to poor resource utilization

and high packet latency. As a result, this simple fix cannot meet the demand of high-speed networks.

To strike a balance between fairness and latency, packets should not be deferred as long as the

difference of two flows’ dominant services is small. This can be achieved by bounding the progress

2In either SRR or ERR extension, by scheduling 1 packet, each flow uses up all the quantum awarded in each round.As a result, packets of the two flows are scheduled alternately, the same as that in Fig. 3.2a.


P1CPU

Link ...

Time

P1

P2

Q1

P3

P2 Q2 P3

P4

Q3

Q1 Q2 Q3 Q4

0

...

(a) Schedule by MR3.

0 20 40 60 80 1000

10

20

30

40

50

60

Time

Do

min

an

t S

erv

ice

Re

ce

ive

d

Flow 1 (CPU processing time)Flow 2 (Transmission time)

(b) The dominant services received by two flows.

Figure 3.4: Illustration of a schedule by MR3.

gap on different resources by a small amount. In particular, we may serve flows in rounds as follows.

Whenever a packet p of a flow i is ready to be processed on the first resource (usually CPU) in round k,

the scheduler checks the work progress on the last resource (usually the link bandwidth). If flow i has

already received services on the last resource in the previous round k − 1, or it has newly arrived, then

packet p is scheduled immediately. Otherwise, packet p is withheld until flow i starts to receive service

on the last resource in round k − 1. As an example, Fig. 3.4a depicts the resulting schedule with the

same input traffic as that in the example of Fig. 3.2a. In round 1, both packets P1 and Q1 are scheduled

without delay because both flows are new arrivals. In round 2, packet P2 (resp., Q2) is also scheduled

without delay, because when it is ready to be processed, flow 1 (resp., flow 2) has already started its

service on the link bandwidth in round 1. In round 3, while packet P3 is ready to be processed right

after packet Q2 is completely processed on CPU, it has to wait until the transmission of P2 starts. A

similar process repeats for all subsequent packets.

We will show later in Sec. 3.4.4 that such a simple idea leads to nearly perfect fairness across flows,

without incurring high packet latency. In fact, the schedule in Fig. 3.4a incurs the same packet latency

as that in Fig. 3.2a, but is much fairer. As we see from Fig. 3.4b, the difference between dominant

services received by the two flows is bounded by a small constant.

3.4.3 Algorithm Design

While the general idea introduced in the previous section is simple, implementing it as a concrete round-

robin algorithm is nontrivial. We next explore the design space of round robin algorithms and identify

Elastic Round Robin [68] as the most suitable variant for multi-resource fair queueing in middleboxes.

The resulting algorithm is referred to as Multi-Resource Round Robin (MR3).


Design Space of Round-Robin Algorithms

Many round-robin variants have been proposed in the traditional fair queueing literature. While all

these variants achieve similar performance and are all feasible for the single-resource scenario, not all of

them are suitable to implement the aforementioned idea in a middlebox. We investigate three typical

variants, i.e., Deficit Round Robin [25], Surplus Round robin [67], and Elastic Round Robin [68], and

discuss their implementation issues in middleboxes as follows.

Deficit Round Robin (DRR): We have introduced the basic idea of DRR in Sec. 4.2.4. As an

analogy, one can view the behavior of each flow as maintaining a banking account. In each round, a

predefined quantum is deposited into a flow’s account, tracked by the deficit counter. The balance of

the account (i.e., the value of the deficit counter) represents the dominant service the flow is allowed to

receive in the current round. Scheduling a packet is analogous to withdrawing the corresponding packet

processing time on the dominant resource from the account. As long as there is sufficient balance to

withdraw from the account, packet processing is allowed.

However, DRR is not amenable to implementation in middleboxes for the following two reasons.

First, to ensure that a flow has sufficient account balance to schedule a packet, the processing time

required on the dominant resource has to be known before packet processing. However, it is hard to

know what middlebox resources are needed and how much processing time is required until the packet

is processed. Also, the O(1) time complexity of DRR is conditioned on the quantum size that is at least

the same as the maximum packet processing time, which may not be easy to obtain in a real system.

Without satisfying this condition, the time complexity could be as high as O(N) [68].

Surplus Round Robin (SRR): SRR [67] allows a flow to consume more processing time on the

dominant resources of its packets in one round than it has in its account. As a compensation, the

excessive consumption, tracked by a surplus counter, will be deducted from the quantum awarded in

the future rounds. In SRR, as long as the account balance (i.e., surplus counter) is positive, the flow

is allowed to schedule packets, and the corresponding packet processing time is withdrawn from the

account after the packet is processed on its dominant resource. In this case, the packet processing time

is only needed after the packet has been processed.

While SRR does not require knowing packet processing time beforehand, its O(1) time remains

conditioned on the predefined quantum size that is at least the same as the maximum packet processing

time. Otherwise, the time complexity could be as high as O(N) [68]. SRR is hence not amenable to

implementation in middleboxes for the same reason mentioned above.

Elastic Round Robin (ERR): Similar to SRR, ERR [68] does not require knowing the processing


...

SeqNum1

f1

Round 1 Round 2 Round 3 Round 4

f2 f3 f1 f4 f2 f3 f4 f2 f3 f4

2 3 4 5 6 7 8 9 10 110

Figure 3.5: Illustration of the round-robin service and the sequence number.

time before the packet is processed. It allows flows to overdraw its permitted processing time in one

round on the dominant resource, with the excessive consumption deducted from the quantum received

in the next round. The difference is that instead of depositing a predefined quantum with a fixed size, in

ERR, the quantum size in one round is dynamically set as the maximum excessive consumption incurred

in the previous round. This ensures that each flow will always have a positive balance in its account at

the beginning of each round, and can schedule at least one packet. In this case, ERR achieves O(1) time

complexity without knowing the maximum packet processing time a priori, and is the most suitable for

implementation in middleboxes at high speeds.

MR3 Design

While ERR serves as a promising round-robin variant for extension to the multi-resource case, there

remain several challenges to implement the idea of scheduling deferral presented in Sec. 3.4.2. How

can the scheduler quickly track the work progress gap of two resources and decide when to withhold a

packet? To ensure efficiency, such a progress comparison must be completed within O(1) time. Note

that simply comparing the numbers of packets that have been processed on two resources does not give

any clue about the progress gap: due to traffic dynamics, each round may consist of different amounts

of packets.

To circumvent this problem, we associate each flow i with a sequence number SeqNumi, which is

increased from 0 and is the scheduling order of the flow. We use a global variable NextSeqNum to record

the next sequence number that will be assigned to a flow. The value of NextSeqNum is initialized to

0 and increased by 1 every time a flow is processed. Each flow i also records its sequence number in

the previous round, tracking by PreviousRoundSeqNumi. For example, consider Fig. 3.5. Initially, flows

1, 2 and 3 are backlogged and are served in sequence in round 1, with sequence numbers 1, 2 and 3,

respectively. While flow 2 is being served in round 1, flow 4 becomes active. Flow 4 is therefore scheduled

right after flow 1 in round 2, with a sequence number 5. After round 2, flow 1 has no packet to serve

and becomes inactive. As a result, only flows 2, 3 and 4 are serviced in round 3, where their sequence

numbers in the previous round are 6, 7 and 5, respectively.


We use sequence numbers to track the work progress on a resource. Whenever a packet p is scheduled

to be processed, it is stamped with its flow’s sequence number. By checking the sequence number stamped

to the packet that is being processed on a resource, the scheduler knows exactly the work progress on

that resource.

Besides sequence numbers, the following important variables/functions are also used in the algorithm.

Active list: The algorithm maintains an ActiveFlowList to track backlogged flows. Flows are served

in a round-robin fashion. The algorithm always serves the flow at the head of the list, and after the

service, this flow, if remaining active, will be moved to the tail of the list for service in the next round.

A newly arriving flow is always appended to the tail of the list, and will be served in the next round. We

also use RoundRobinCounter to track the number of flows that have not yet been served in the current

round. Initially, ActiveFlowList is empty and RoundRobinCounter is 0.

Excess counter: Each flow i maintains an excess counter ECi, recording the excessive dominant

service flow i incurred in one round. The algorithm also uses two variables, MaxEC and PreviousRound-

MaxEC, to track the maximum excessive consumption incurred in the current and the previous round,

respectively. Initially, all these variables are set to 0.

DominantProcessingTime(p): this function profiles a packet p and returns the (estimated) processing

time on its dominant resource. We will discuss how a packet is profiled in Sec. 3.4.5.

Our algorithm, referred to as MR3, consists of two functional modules, PacketArrival (Algorithm 1),

which handles packet arrival events, and Scheduler (Algorithm 2), which decides which packet should be

processed next.

PacketArrival: This module is invoked upon the arrival of a packet. It enqueues the packet to

the input queue of the flow to which the packet belongs. If this flow is previously inactive, it is then

appended to the tail of the active list and will be served in the next round. The sequence number of the

flow is also updated, as shown in Algorithm 1 (line 3 to line 5).

Algorithm 1 MR3 PacketArrival

1: Let i be the flow to which the packet belongs2: if ActiveFlowList.Contains(i) == FALSE then3: PreviousRoundSeqNumi = SeqNumi

4: NextSeqNum = NextSeqNum + 15: SeqNumi = NextSeqNum6: ActiveFlowList.AppendToTail(i)7: end if8: Enqueue the packet to queue i

Scheduler: This module decides which packet should be processed next. The scheduler first checks

the value of RoundRobinCounter to see how many flows have not yet been served in the current round.


If the value is 0, then a new round starts. The scheduler sets RoundRobinCounter to the length of the

active list (line 3), and updates PreviousRoundMaxEC as the maximum excessive consumption incurred

in the round that has just passed (line 4), while MaxEC is reset to 0 for the new round (line 5).

Algorithm 2 MR3 Scheduler

1: while TRUE do2: if RoundRobinCounter == 0 then3: RoundRobinCounter = ActiveFlowList.Length()4: PreviousRoundMaxEC = MaxEC5: MaxEC = 06: end if7: Flow i = ActiveFlowList.RemoveFromHead()8: Bi = PreviousRoundMaxEC − ECi

9: while Bi ≥ 0 and QueueIsNotEmpty(i) do10: Let q be the packet being processed on the last resource11: WaitUntil(q.SeqNum ≥ PreviousRoundSeqNumi)12: Packet p = Dequeue(i)13: p.SeqNum = SeqNumi

14: ProcessPacket(p)15: Bi = Bi − DominantProcessingTime(p)16: end while17: if QueueIsNotEmpty(i) then18: ActiveFlowList.AppendToTail(i)19: NextSeqNum = NextSeqNum + 120: PreviousRoundSeqNumi = SeqNumi

21: SeqNumi = NextSeqNum22: ECi = −Bi

23: else24: ECi = 025: end if26: MaxEC = Max(MaxEC, ECi)27: RoundRobinCounter = RoundRobinCounter − 128: end while

The scheduler then serves the flow at the head of the active list. Let flow i be such a flow. Flow i

receives a quantum equal to the maximum excessive consumption incurred in the previous round, and

has its account balance Bi equal to the difference between the quantum and the excess counter, i.e.,

Bi = PreviousRoundMaxEC− ECi. Since PreviousRoundMaxEC ≥ ECi, we have Bi ≥ 0.

Flow i is allowed to schedule packets (if any) as long as its balance is positive (i.e., Bi ≥ 0). To

ensure a small work progress gap between two resources, the scheduler keeps checking the sequence

number stamped to the packet that is being processed on the last resource3 and compares it with flow

i’s sequence number in the previous round. The scheduler waits until the former exceeds the latter, at

which time the progress gap between any two resources is within 1 round. The scheduler then dequeues

a packet from the input queue of flow i, stamps the sequence number of flow i, and performs deep

packet processing on CPU, which is also the first middlebox resource required by the packet. After CPU

3If no packet is being processed, we take the sequence number of the packet that has recently been processed.


processing, the scheduler knows exactly how the packet should be processed next and what resources

are required. The packet is then pushed to the buffer of the next resource for processing. Meanwhile,

the packet processing time on each resource can also be accurately estimated after CPU processing, e.g.,

using some simple packet profiling technique introduced in [4]. The scheduler then deducts the dominant

processing time of the packet from flow i’s balance. The service for flow i continues until flow i has no

packet to process or its balance becomes negative.

If flow i is no longer active after service in the current round, its excess counter will be reset to 0.

Otherwise, flow i is appended to the tail of the active list for service in the next round. In this case, a

new sequence number is associated with flow i. The excess counter ECi is also updated as the account

deficit of flow i. Finally, before serving the next flow, the scheduler updates MaxEC and decreases

RoundRobinCounter by 1, indicating that one flow has already finished service in the current round.

3.4.4 Performance Analysis

In this section, we analyze the performance of MR3 by deriving its time complexity, fairness, and delay

bound. We also compare MR3 with the existing multi-resource fair queueing design, e.g., DRFQ [4].

Complexity

MR3 is highly efficient as compared with DRFQ. One can verify that under MR3, at least one packet is

scheduled for each flow in one round. Formally, we have

Theorem 1. MR3 makes the scheduling decisions in O(1) time per packet.

Proof: We prove the theorem by showing that both enqueuing a packet (Module 1) and scheduling

a packet (Module 2) finish within O(1) time.

In Module 1, determining the flow at which a new packet arrives is an O(1) operation. By maintaining

the per-flow state, the scheduler knows if the flow is contained in the active list (line 2) within O(1)

time. Also, updating the sequence number (line 3 to 5), appending the flow to the active list (line 6),

and enqueueing a packet (line 8) are all of O(1) time complexity.

We now analyze the time complexity of scheduling a packet. In Module 2, since the quantum deposited

into each flow’s account is the maximum excessive consumption incurred in the previous round, a flow

will always have a positive balance at the beginning of each round, i.e., Bi ≥ 0 in line 15. As a

result, at least one packet is scheduled for each flow in one round. The time complexity of scheduling

a packet is therefore no more than the time complexity of all the operations performed during each

service opportunity. These operations include determining the next flow to be served, removing the flow


from the head of the active list and possibly adding it back at the tail, all of which are O(1) operations

if the active list is implemented as a linked list. Additional operations include updating the sequence

number, MaxEC, PreviousRoundMaxEC, RoundRobinCounter, and dequeuing a packet. All of them are

also executed within O(1) time. ut

Fairness

In addition to the low scheduling complexity, MR3 provides near-perfect fairness across flows. To see

this, let ECki be the excess counter of flow i after round k, and MaxECk the maximum ECki over all flow

i’s. Let Dki be the dominant service flow i receives in round k. Also, let Li be the maximum packet

processing time of flow i across all resources. Finally, let L be the maximum packet processing time

across all flows, i.e., L = maxi{Li}. We show that the following lemmas and corollaries hold throughout

the execution of MR3. The proofs are given in Sec. 3.8.1.

Lemma 2. ECki ≤ Li for all flow i and round k.

Corollary 1. MaxECk ≤ L for all round k.

Lemma 3. For all flow i and round k, we have

Dki = MaxECk−1 − ECk−1

i + ECki , (3.4)

where EC0i = 0 and MaxEC0 = 0.

Corollary 2. Dki ≤ 2L for all flow i and round k.

We are now ready to bound the difference of dominant services received by two flows under MR3.

Recall that Ti(t1, t2) denotes the dominant service flow i receives in time interval (t1, t2). We have the

following theorem.

Theorem 2. Under MR3, the following relationship holds for any two flows i, j that are backlogged in

any time interval (t1, t2):

|Ti(t1, t2)− Tj(t1, t2)| ≤ 6L.

Proof: Suppose at time t1 (resp., t2), the work progress of MR3 on resource 1 is in round R1 (resp.,

R2). For any flow i, we upper bound Ti(t1, t2) as follows. At time t1, the work progress on the last

resource must be at least in round R1 − 1, as the progress gap between the first and the last resource is

bounded by 1 round under MR3. Also note that at time t2, the progress on the last resource is at most


in round R2. As a result, flow i receives dominant services at most in rounds R1 − 1, R1, . . . , R2, i.e.,

Ti(t1, t2) ≤R2∑

k=R1−1

Dki

=

R2∑k=R1−1

MaxECk−1 − ECR1−2i + ECR2

i

≤R2∑

k=R1−1

MaxECk−1 + L , (3.5)

where the equality is derived from Lemma 3, and the last inequality is derived from Lemma 2.

We now derive the lower bound of Ti(t1, t2). Note that at time t1, the algorithm has not yet started

round R1 + 1, while at time t2, the algorithm must have finished all processing of round R2 − 2 on the

last resource. As a result, all packets that belong to rounds R1 + 1, . . . , R2 − 2 have been processed on

all resources within (t1, t2). We then have

Ti(t1, t2) ≥R2−2∑k=R1+1

Dki

=

R2−2∑k=R1+1

MaxECk−1 − ECR1i + ECR2−2

i

≥R2−2∑k=R1+1

MaxECk−1 − L

≥R2∑

k=R1−1

MaxECk−1 − 5L . (3.6)

where the last inequality is derived from Corollary 1.

For notation simplicity, let X =∑R2

k=R1−1 MaxECk−1. By (3.5) and (3.6), we have

X − 5L ≤ Ti(t1, t2) ≤ X + L . (3.7)

Note that this inequality also holds for flow j, i.e.,

X − 5L ≤ Tj(t1, t2) ≤ X + L . (3.8)

Taking the difference between (3.7) and (3.8) leads to the statement. ut

Corollary 3. The RFB of MR3 is 6L.

Based on Theorem 2 and Corollary 3, we see that MR3 bounds the difference between (normalized)


dominant services received by two backlogged flows in any time interval by a small constant. Note that

the interval (t1, t2) may be arbitrarily large. MR3 therefore achieves nearly perfect DRF across all active

flows, irrespective of their traffic patterns.

We note that this is a stronger fairness guarantee than the one provided by existing multi-resource

fair queueing schemes, e.g., DRFQ, which require that all packets of a flow have the same dominant

resource throughout each backlogged period (referred to as the resource monotonicity assumption in [4]).

Delay Analysis

In addition to the complexity and fairness, latency is also an important concern for a packet scheduling

algorithm. Recall that in Sec. 3.3.3, the scheduling delay is defined as the latency from the time when a

packet reaches the head of the input queue to the time when the scheduler finishes processing the packet

on all resources. Before we proceed to the general analysis for all packets, let us focus on the scheduling

delay of the very first packet of a flow, referred to as the startup latency. In particular, let m be the

number of resources concerned, and n the number of backlogged flows. We have the following theorem.

The proof is given in Sec. 3.8.2.

Theorem 3. When flows are assigned the same weights, the startup latency SL of any newly backlogged

flow is bounded by

SL ≤ 2(m+ n− 1)L . (3.9)

Based on the analysis of the startup latency, we next derive the upper bound of the scheduling delay.

The proof is given in Sec. 3.8.3.

Theorem 4. When flows are assigned the same weights, for any packet p, the scheduling delay D(p) is

bounded by

D(p) ≤ (4m+ 4n− 2)L . (3.10)

Table 3.1 summarizes the derived performance of MR3, in comparison with DRFQ [4]. We see that

MR3 significantly reduces the scheduling complexity per packet, while providing near-perfect fairness in

a more general case with arbitrary traffic patterns. The price we pay, however, is longer startup latency

for newly active flows. Since the number of middlebox resources is typically much smaller than the

number of active flows, i.e., m� n, the startup latency bound of MR3 is approximately two times that

of DRFQ, i.e.,

2(m+ n− 1)L ≈ 2nL.


Table 3.1: Performance comparison between MR3 and DRFQ, where L is the maximum packet processingtime, m is the number of resources, and n is the number of active flows.

Performance MR3 DRFQComplexity O(1) O(log n)

Fairness*(RFB) 6L 2LStartup Latency 2(m+ n− 1)L nLScheduling Delay (4m+ 4n− 2)L Unknown* The fairness analysis of DRFQ requires the re-

source monotonicity assumption [4]. Under thesame assumption, the RFB of MR3 is 4L [74].

Also note that, since the scheduling delay is usually hard to analyze, no analytical delay bound is given

in [4]. We shall experimentally compare the latency performance of MR3 and DRFQ later in Sec. 3.4.5.

3.4.5 Simulation Results

As a complementary study of the above theoretical analysis, we evaluate the performance of MR3 via

extensive simulations. First, we would like to confirm experimentally that MR3 offers predictable service

isolation and is superior to the naive first-come-first-served (FCFS) scheduler, as the theory indicates.

Second, we want to confirm that MR3 can quickly adapt to traffic dynamics and achieve nearly perfect

DRF across flows. Third, We compare the latency performance of MR3 with DRFQ [4] to see if the

extremely low time complexity of MR3 is achieved at the expense of significant packet delay. Fourth, we

investigate how sensitive the performance of MR3 is when packet size distributions and arrival patterns

change. Finally, we evaluate the scenario where flows are assigned uneven weights, under which we

compare the delay performance of weighted MR3 with DRFQ.

General Setup

All simulation results are based on our event-driven packet simulator written with 3,000 lines of C++

codes. We assume resources are consumed serially, with CPU processing first, followed by link trans-

mission. We implement 3 schedulers, FCFS, DRFQ and MR3. The last two inspect the flows’ input

queues and decide which packet should be processed next. By default, packets follow Poisson arrivals,

unless otherwise stated. The simulator models resource consumption of packet processing in 3 typical

middlebox modules, each corresponds to one of the flow types: basic forwarding, per-flow statistical

monitoring, and IPSec encryption. The first two modules are bandwidth-bound, with statistical mon-

itoring consuming slightly more CPU resources than basic forwarding, while IPSec is CPU intensive.

For direct comparison, we set the packet processing times required for each middlebox module the same

as those in [4], which are based on real measurements. In particular, the CPU processing time of each


Table 3.2: Linear model for CPU processing time in 3 middlebox modules. Model parameters are basedon the measurement results reported in [4].

Module CPU processing time (µs)Basic Forwarding 0.00286× PacketSizeInBytes + 6.2

Statistical Monitoring 0.0008× PacketSizeInBytes + 12.1IPSec Encryption 0.015× PacketSizeInBytes + 84.5

0 5 10 15 20 25 300

5

10

15

Flow ID

Dom

inant S

erv

ice (

s)

FCFS

MR3

(a) Dominant service received.

0 5 10 15 20 25 300

1

2

3

4

5

6

7

Flow IDT

hro

ughput (1

03 p

kts

/s)

FCFS

MR3

(b) Packet throughput of flows.

Figure 3.6: Dominant services and packet throughput received by different flows under FCFS and MR3.Flows 1, 11 and 21 are ill-behaving.

module is observed to follow a simple linear model based on packet size. Table 3.2 summarizes the

detailed parameters based on the measurement results reported in [4]. The link transmission time is

proportional to the packet size, and the output bandwidth of the middlebox is set to 200 Mbps, the

same as the experiment environment in [4]. Each of the simulation experiment spans 30 s.

Service Isolation

We start off by confirming that MR3 offers nearly perfect service isolation, which naive FCFS fails to

provide. We initiate 30 flows that send 1400-byte UDP packets. Flows 1 to 10 undergo basic forwarding;

11 to 20 undergo statistical monitoring; 21 to 30 undergo IPSec encryption. We generate 3 rogue flows,

i.e., 1, 11 and 21, each sending 10,000 pkts/s. All other flows behaves normally, each sending 1,000

pkts/s. Fig. 3.6a shows the dominant services received by different flows under FCFS and MR3. We see

that under FCFS, rogue flows grab an arbitrary share of middlebox resources, while under MR3, flows

receive fair services on their dominant resources. This result is further confirmed in Fig. 3.6b: Under

FCFS, the presence of rogue flows squeezes normal traffics to almost zero. In contrast, MR3 ensures that

all flows receive deserved, though uneven, throughput based on their dominant resource requirements,

irrespective of the presence and (mis)behaviour of other traffic.


0 50 100 1500

2

4

6

8

Flow ID

Sta

rtup L

ate

ncy (

ms)

DRFQ

MR3

(a) Startup latency of flows.

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Single Packet Delay (ms)

CD

F

DRFQ

MR3

(b) CDF of the scheduling delay.

Figure 3.7: Latency comparison between DRFQ and MR3.

Latency

We next evaluate the latency penalty MR3 pays for its extremely low time complexity, in comparison

with DRFQ [4]. We implement DRFQ and measure the startup latency as well as the single packet delay

of both algorithms. In particular, 150 flows with UDP packets are initiated sequentially, where flow 1

becomes active at time 0, followed by flow 2 at time 0.2 second, and flow 3 at time 0.3 second, and so on.

A flow is randomly assigned to one of the three types. To congest the middlebox resources, the packet

arrival rate of each flow is set to 500 pkts/s, and the packet size is uniformly drawn from 200 bytes

to 1400 bytes. Fig. 3.7a depicts the per-flow startup latency using both DRFQ and MR3. Clearly, the

dense and sequential flow starting times in this example represent a worst-case scenario for a round-robin

scheduler. We see that under MR3, flows joining the system later see larger startup latency, while under

DRFQ, the startup latency is relatively consistent. This is because under MR3, a newly active flow will

have to wait for a whole round before being served. The more active flows, the more time is required to

finish serving one round. As a result, the startup latency is linearly dependent on the number of active

flows. While this is also true for DRFQ in the worst-case analysis (see Table 3.1), our simulation results

show that on average, the startup latency of DRFQ is smaller than MR3. However, we see next that

this advantage of DRFQ comes at the expense of highly uneven single packet delays.

Compared with the startup latency, single packet delay is a much more important delay metric.

As we see from Fig. 3.7b, MR3 exhibits more consistent packet delay performance, with all packets

delayed less than 15 ms. In contrast, the latency distribution of DRFQ is observed to have a long tail:

90% packets are delayed less than 5 ms while the rest 10% are delayed from 5 ms to 50 ms. Further

investigation reveals that these 10% packets are uniformly distributed among all flows. All results above

indicate that the low time complexity and near-perfect fairness of MR3 is achieved at the expense of


only slight increase in packet latency.

Dynamic Allocation

We further investigate if the DRF allocation achieved by MR3 can quickly adapt to traffic dynamics. To

congest middlebox resources, we initiate 3 UDP flows each sending 20,000 1400-byte packets per second.

Flow 1 undergoes basic forwarding and is active in time interval (0, 15). Flow 2 undergoes statistical

monitoring and is active in two intervals (3, 10) and (20, 30). Flow 3 undergoes IPSec encryption and

is active in (5, 25). The input queue of each flow can cache up to 1,000 packets. Fig. 3.8 shows the

resource share allocated to each flow over time. Since flow 1 is bandwidth-bound and is the only active

flow in (0, 3), it receives 20% CPU share and all bandwidth. In (3, 5), both flows 1 and 2 are active.

They equally share the bandwidth on which both flows bottleneck. Later, when flow 3 becomes active

at time 5, all three flows are backlogged in (5, 10). Because flow 3 is CPU-bound, it grabs only 10%

bandwidth share from 2 and 3, respectively, yet is allocated 40% CPU share. Similar DRF allocation is

also observed in subsequent time intervals. We see that MR3 quickly adapts to traffic dynamics, leading

to nearly perfect DRF across flows.

Sensitivity

We also evaluate the performance sensitivity of MR3 under a mixture of different packet size distributions

and arrival patterns. The simulator generates 24 UDP flows with arrival rate 10,000 pkts/s each. Flows 1

to 8 undergo basic forwarding; 9 to 16 undergo statistical monitoring; 17 to 24 undergo IPSec encryption.

The 8 flows passing through the same middlebox module is further divided into 4 groups. Flows in group

1 send large packets with 1400 bytes; Flows in group 2 send small packets with 200 bytes; Flows in group

3 send bimodal packets that alternate between small and large; Flows in group 4 send packet with random

size uniformly drawn from 200 bytes to 1400 bytes. Each group contains exactly 2 flows, with exponential

and constant packet interarrival times, respectively. The input queue of each flow can cache up to 1,000

packets. Fig. 3.9a shows the dominant services received by all 24 flows, where no paritcular pattern

is observed in response to distribution changes of packet sizes and arrivals. Figs. 3.9b, 3.9c and 3.9d

show the average single packet delay observed in three middlebox modules, respectively. We find that

while the latency performance is highly consistent under different arrival patterns, it is affected by the

distribution of packet size. In general, flows with small packets are slightly preferred and will see smaller

latency than those with large packets. Similar preference for small-packet flows has also been observed

in our experiments with DRFQ.


0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Time (s)

CP

U S

ha

re

Flow 1Flow 2Flow 3

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Time (s)

Ba

nd

wid

th S

ha

re

Flow 1 Flow 2 Flow 3

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Time (s)

Do

min

an

t S

ha

re

Flow 1 Flow 2 Flow 3

Figure 3.8: MR3 can quickly adapt to traffic dynamics and achieve DRF across all 3 flows.

3.4.6 MR3 for the Weighted Flows

So far, we have assumed that all flows are of the same weights and are expected to receive the same level

of service. For the purpose of service differentiation, flows may be assigned different weights, based on

their respective Quality of Service (QoS) requirements. A scheduling algorithm should therefore provide

weighted fairness across flows. Specifically, the packet processing time a flow receives on the dominant

resources of its packets should be in proportion to its weight, over any backlogged periods.

A minor modification to the original MR3 is sufficient to achieve weighted fairness. In line 15 of

Algorithm 2, after packet processing, the scheduler deducts the dominant processing time of the packet,

normalized to its flow’s weight, from that flow’s balance, i.e.,

Bi = Bi −DominantProcessingTime(p)/wi , (3.11)

where wi is the weight associated with flow i. In other words, balance Bi now tracks the normalized


0 5 10 15 20 251704

1705

1706

Flow ID

Do

min

an

t S

erv

ice

(m

s)

(a) Dominant service received.

Large Small Bimodal Random0

1

2

3

4

Packet Size

Avera

ge L

ate

ncy (

ms)

ConstantExponential

(b) Basic forwarding.


1

2

3

4

Packet Size

Avera

ge L

ate

ncy (

ms)

ConstantExponential

(c) Statistical monitoring.


1

2

3

4

Packet Size

Avera

ge L

ate

ncy (

ms)

ConstantExponential

(d) IPSec encryption.

Figure 3.9: Fairness and delay sensitivity of MR3 in response to mixed packet sizes and arrival distribu-tions.

dominant processing time available to flow i. All the other parts of the algorithm remain the same as

its unweighted version. The resulting scheduler is referred to as weighted MR3.

All the analyses presented in Sec. 3.4.4 can also be extended to weighted MR3. Without loss of

generality, we assume the flow weights (wi’s) are normalized such that

n∑i=1

wi = 1.

It is easy to verify that having flows assigned uneven weights will have no impact on the scheduling

complexity of the algorithm. Formally, we have

Theorem 5. Weighted MR3 makes scheduling decisions in O(1) time per packet.

Moreover, the following theorem ensures that weighted MR3 provides near-perfect weighted fairness

across flows, as the difference between the normalized dominant services received by two flows is bounded

by a small constant. The proof is similar to the unweighted scenario and is briefly outlined in Sec. 3.8.4.

Theorem 6. Under weighted MR3, the following relationship holds for any two flows i, j that are

backlogged in any time interval (t1, t2):

∣∣∣∣Ti(t1, t2)

wi− Tj(t1, t2)

wj

∣∣∣∣ ≤ 6 max1≤i≤n

Liwi

.

Finally, to analyze the delay performance of weighted MR3, let

W = max1≤i,j≤n

wi/wj . (3.12)

The following two theorems bound the startup latency and scheduling delay for weighted MR3, respec-

tively. Their proofs resemble the unweighted case and are briefly described in Sec. 3.8.5.


CPU

Link ...

...

Time20 6 124 8 10 14

P 11 P 1

2 P 13 P 1

4 P 15

P 21 P 3

1 P 41 P

51 P 6

1

P 16P 3

1 P 41 P 5

1 P 61

P 16P 1

1 P 12 P 1

3 P 14 P 1

5

P 21

16

P 17

Round 1 Round 2

Figure 3.10: MR3 schedule fails to offer weight-proportional delay when flows are assigned unevenweights. P ik denotes the kth packet of flow i.

Theorem 7. Under weighted MR3, the startup latency is bounded by

SL ≤ (1 +W )(m+ n− 1)L .

Theorem 8. Under weighted MR3, for any flow i, the scheduling delay for its packet p is bounded by

Di(p) ≤ 2(m+W )2L/wi .

By Theorem 8, we see that the delay bound of weighted MR3 critically depends on the weight

distributions. When flows are assigned significantly uneven weights (i.e., W � 1), the delay bound may

become very large. While our analysis is based on the worst case, the following example shows that MR3

fails to offer a weight-proportional delay bound.

Suppose there are 6 flows competing for both middlebox CPU and link bandwidth. Each packet of

flow 1 requires 1 time unit for CPU processing and 2 for link transmission. Each packet of other flows

requires 2 time units for CPU processing and 1 for link transmission. Flow 1 weighs 1/2, while flow 2

to 6 each weighs 1/10. The amount of credits flow 1 receives in one round is hence 5 times those given

to the other flows. Fig. 3.10 illustrates an MR3 schedule, where P ik denotes the kth packet of flow i. We

see that in every round, flow 1 schedules five packets while each of the other flow schedules only one.

Such a schedule offers weight-proportional services (the dominant services flow 1 receives are 5 times

that of the other flow) but not weight-proportional delay. The maximum packet scheduling delay flow

1 experiences is 13 time units (e.g., packet P 16 is ready for service at time 5 but finishes service at time

18), more than half of that experienced by other flows (e.g., packet P 22 has been delayed by 20 time

units).

In addition to Theorem 8 and the example above, our simulation results presented in Sec. 3.5.8

further confirm that MR3 may incur large delays under uneven flow weights. MR3 is hence unsuitable

to schedule weighted flows.


To summarize, we have seen two alternative approaches used to design a multi-resource scheduler,

the timestamp-based DRFQ scheduler [4] and the round-robin variant of MR3. However, none of these

schedulers can achieve all the desirable scheduling properties introduced in Sec. 3.3. We note that sim-

ilar problems have also been a major challenge in the long evolution of single-resource fair queueing

algorithms for bandwidth sharing, where timestamp-based schemes and round robin are the two de-

sign approaches. The former provides tight delay bounds, but requires high complexity to sort packet

timestamps (e.g., [21–23, 26, 31]). The latter approach, on the other hand, has O(1) complexity, yet

incurs high scheduling delays (e.g., [25, 67, 68]). To achieve the best of both worlds, one approach is to

combine the fairness and delay properties of timestamp-based algorithms with the low time complexity

of round-robin schemes [71–73, 75, 76]. This is typically done by grouping flows into a small number

of classes. The scheduler then uses the timestamp-based algorithm to determine which class to serve.

Within a class, the scheduling resembles a round-robin scheme. While this strategy turns out to be an

effective approach for single-resource bandwidth sharing, generalizing it to schedule multiple resource

types imposes non-trivial technical challenges. As we shall show in the later sections, despite that flows

may have different dominant resources with different processing progresses, the scheduler has to maintain

a weight-proportional dominant service level across all flows. We shall describe our solution to answer

this challenge in the next section.

3.5 Group Multi-Resource Round Robin

In this section, we present an improved round-robin scheduler, referred to as Group Multi-Resource

Round Robin (GMR3), that addresses the delay problem of MR3. GMR3 not only provides a weight-

proportional delay bound, but also achieves near-perfect fairness in O(1) time.

3.5.1 Basic Intuition

While round robin may incur high delay in a general scenario, Theorem 8 indicates that it provides

a good delay bound when flows are of similar weights. In other words, if we group flows with similar

weights to a flow group, then within a group, round robin serves as an excellent scheduler. The challenge

is to schedule inter-group flows with different weights.

We have observed that in MR3, flows are always served in a “burst” mode under uneven flow weights

(see Sec. 3.4.3). For example, in Fig. 3.10, flow 1 schedules 5 packets in a row in round 1, and has to

wait for an entire round to schedule its next packet P 16 in round 2, resulting in a long scheduling delay of

that packet. Instead of serving flows in a “burst” mode, a better strategy is to spread their scheduling


CPU

Link ...

...

Time20 6 124 8 10 14

P 11 P 1

2 P 13 P 1

4 P 15

P 21 P 3

1 P 41 P 5

1 P 61

P 16P 3

1 P 41 P 5

1 P 61

P 16P 1

1 P 12 P 1

3 P 14 P 1

5

P 21

16

P 22

D1(P12 ) = 5

Figure 3.11: An improved schedule over MR3 in the example of Fig. 3.10, where the scheduling delay issignificantly reduced. P ik denotes the kth packet of flow i.

opportunities over time, in proportion to their respective weights. Fig. 3.11 illustrates an improved

schedule over MR3 in Fig. 3.10, where the scheduling opportunities of flow 1 are interleaved between

those of other flows. Compared to the MR3 schedule in Fig. 3.10, the maximum scheduling delay of flow

1 is reduced from 13 to 5 (shown in Fig. 3.11), and the delay of other flows is also reduced from 20 to 16.

Note that such a delay improvement is achieved without compromising fairness: the dominant services

each flow receives under this schedule remain proportional to its assigned weight.

Our design follows exactly this intuition. The algorithm aggregates flows with similar weights to a

flow group, and makes scheduling decisions in a two-level hierarchy. At the higher level, the algorithm

makes inter-group scheduling decisions to determine a flow group, with the objective of distributing the

scheduling opportunities over time, in proportion to the approximate weights of flows. Within a group,

the intra-group scheduler serves flows in a round-robin fashion. We shall show in Sec. 3.5.7 that this

simple combination leads to remarkable performance guarantees. For now, we focus on the detailed

design in the following subsections.

3.5.2 Flow Grouping

Suppose there are n backlogged flows sharing m middlebox resources. Without loss of generality, let the

flow weight wi be normalized such thatn∑i=1

wi = 1.

The scheduler collects flows with similar weights into a flow group. Specifically, flow group Gk is defined

as

Gk = {i : 2−k ≤ wi < 2−k+1}, k = 1, 2, . . . (3.13)

Thus, the weights of any two flows belonging to the same flow group are within a factor of 2 of each

other.

Similar grouping strategy has also been adopted in many single-resource fair queueing designs [71–73].


Slot20 6 124 8 10 14 16

R31 R3

2

R11 R1

2 R13 R1

4 R15 R1

6 R17 R1

8

R21 R2

2 R23 R2

4

R41

Figure 3.12: An illustration of the scheduling rounds of flow groups, where Rkl denotes the schedulinground l of flow group Gk.

The significance of this grouping strategy is that it leads to a small number of flow groups ng, bounded

by ng ≤ log2W . For a practical flow weight distribution, the number of flow groups ng ≤ 40 and can

hence be safely assumed as a small constant. This significantly reduces the complexity of the inter-group

scheduling.

3.5.3 Inter-Group Scheduling

The inter-group scheduler determines a flow group to potentially schedule a flow. Each group is associated

with a timestamp, and the one with the earliest timestamp is selected. With appropriate timestamps, the

scheduling opportunities of a flow group would be weight-proportionally distributed over time. Given a

small number of flow groups ng, the complexity of sorting the group timestamps is also a small constant

O(log ng). Among various timestamp-based algorithms, we find that [71] is particularly attractive for

multi-resource extension, due to its simple timestamp computation. In contrast, extending other inter-

group scheduling algorithms (e.g., [72, 73]) to multiple resources would have required referring to the

idealized fluid DRGPS model [66], incurring relatively high computational complexity.

The scheduler maintains an accounting mechanism consisting of a sequence of virtual slots, indexed

by 0, 1, 2, . . . . Each slot is exclusively assigned to one flow, and is the scheduling opportunity of this

flow. Each flow group Gk is associated with a set of scheduling rounds each spanning 2k contiguous

slots. The first scheduling round of flow group Gk, denoted Rk1 , starts at slot 0 and ends at slot 2k − 1,

while the second scheduling round, denoted Rk2 , starts at slot 2k and ends at slot 2k+1 − 1, and so on.

Fig. 3.12 gives an example. Note that the scheduling rounds of different flow groups overlap by design.

The scheduler assigns each backlogged flow i ∈ Gk exactly one slot per scheduling round of flow group

Gk. This allows flow i to receive one scheduling opportunity every 2k slots, roughly matching the flow’s

weight (i.e., 2−k ≤ wi < 2−k+1). The scheduling opportunities of flows are hence weight-proportionally

distributed over time.

Following the terminology used in [71], a flow group is called active if it contains at least one back-


Slot20 6 124 8 10 14

R11 R1

2 R13 R1

4 R15 R1

6 R17 R1

8

R41

31 7 135 11 159

f1 f2 f3 f4 f5 f6f1 f1 f1 f1 f1f1 f1

G1G4G1G4 G4 G4 G4G1 G1 G1 G1 G1 G1 ...

...

Group

Flow

Figure 3.13: An illustration of the inter-group scheduler selecting flow groups in the example of Fig. 3.10.Within a group, the flow is determined in a round-robin manner by the intra-group scheduler.

logged flow. A backlogged flow i ∈ Gk is called pending if it has not yet been assigned a slot in the

current scheduling round of Gk. A flow group is called pending if it contains at least one pending flow.

For every virtual slot t, the inter-group scheduler chooses among all pending flow groups the one with

the earliest timestamp, defined as the ending slot of the current scheduling round of that flow group. Ties

are broken arbitrarily. From the selected flow group, the intra-group scheduler then chooses a pending

flow and assigns it the current slot t (with details to be described in Sec. 3.5.4). A flow temporarily

ceases to be pending once it has been assigned a slot in the current scheduling round of its flow group,

and will become pending again at the beginning of the next scheduling round, if it remains backlogged.

If no group is pending in slot t, the slot is skipped. Algorithm 3 summarizes this inter-group scheduling

process.

Algorithm 3 InterGroupScheduling

1: t = 02: P = {flow groups that are pending in slot 0}3: while TRUE do4: Choose Gk ∈ P , where Gk has the earliest timestamp5: IntraGroupScheduling(Gk)6: P = P −Gk if Gk is no longer pending7: if P = ∅ then8: Do nothing until there is a backlogged flow9: Advance t to the next slot with pending flows

10: else11: t = t + 112: end if13: P = P ∪ {flow groups that become pending in slot t}14: end while

Fig. 3.13 illustrates an example of the inter-group scheduler assigning slots to groups and flows in

the example of Fig. 3.4a. Flow 1 belongs to G1 as its weight is 1/2, while flows 2 to 6 are grouped to

G4 as each of them weighs 1/10. At slot 0, both G1 and G4 are pending, with the end of their current

scheduling round (i.e., R11 and R4

1 in Fig. 3.13) at slot 1 and slot 15, respectively. The inter-group

scheduler hence picks G1, from which the intra-group scheduler selects flow 1 as it is the only backlogged

flow in G1. Flow 1 then schedules its packets for processing and ceases to be pending in the current


Slot10 3 62 4 5 7 8

Time20 6 124 8 10 14 16

CPU

Link ...

...

9 10 12 14 16 17

18 2220 24 26

f11

f11

f21

f21

f31 f4

1 f51 f6

1 f71

f81f3

1 f41 f5

1 f61 f7

1

f81 f9

1

f91

f12 f1

3

f14

f15

f16

f22

f12 f1

3

f14

f15

f16

Figure 3.14: The schedule determined by GMR3 in the example of Fig. 3.10, where f li denotes thepacket processing for flow i ∈ Gk in scheduling round l of its flow group Gk. The slot axis is only forthe accounting mechanism, while the time axis shows real time elapse.

scheduling round R11. As a result, at slot 1, only flow group G4 is pending and is hence selected, from

which the intra-group scheduler selects flow 2 and assigns it the slot. Flow 1 becomes pending again in

slot 2 as a new scheduling round of its flow group G1 starts (i.e., R12 in Fig. 3.13), and is selected because

its current scheduling round R12 ends earlier than R4

1, the current scheduling round of G4. Flow groups

G1 and G4 are hence selected alternately in the following slots until slot 9, after which all flows of G4

are assigned slots in the current scheduling round R41 and cease to be pending in the rest slots of R4

1.

For this reason, slots 11, 13, and 15 are not assigned to any flows as no one is pending then. All these

slots are simply skipped by the scheduler. It is worth mentioning that the schedule shown in Fig. 3.13

only gives the scheduling order of flows. The corresponding packet-level schedule is further determined

by the intra-group scheduler and is shown in Fig. 3.14, where f li denotes the actual packet processing

for flow i in scheduling round l of its flow group.

Inter-group scheduling distributes flows’ scheduling slots over time, in proportion to their approximate

weights. While this achieves coarse-grained fairness for single-resource scheduling [71–73], it is not the

case in the multi-resource setting, as the flows may not receive dominant services in their assigned

slots. For example, in Fig. 3.14, flow 1 is assigned slot 0 in time interval [0, 1), but receives dominant

services (i.e., link transmission) later in [1, 3). Flow 2, on the other hand, always receives dominant

services (i.e., CPU processing) in its assigned slots. Due to this service asynchronicity, distributing

flows’ scheduling slots weight-proportionally over time does not ensure that the opportunities of receiving

dominant services are also distributed weight-proportionally. Without appropriate control, the potential

service asynchronicity may lead to poor fairness with extremely unbalanced resource utilization. This

is also the key challenge of multi-resource scheduling as compared with its single-resource counterpart

(e.g., [71–73, 76]). We show in the next subsection that this challenge can be effectively addressed by a

carefully designed intra-group scheduler.


3.5.4 Intra-Group Scheduling

Once the flow group is determined, the intra-group scheduler chooses a pending flow from that group

in a round-robin manner. Compared to round robin for bandwidth sharing (e.g.,, [25,67,68,71–73,76]),

the intra-group scheduler operates with two important differences. First, for the purpose of DRF,

the scheduler maintains a credit system to keep track of the dominant services a flow receives, not the

amount of bits a flow transmits. Second, the scheduler employs a progress control mechanism to reinforce

a relatively consistent work progress across resources, so as to eliminate the adverse effects caused by

the aforementioned service asynchronicity.

Credit System: Every time a flow i is assigned a slot, it receives a credit ci (whose size is given in

(3.14) below), which is the time given to the flow for packet processing on its dominant resource in the

current scheduling round. As long as there are available credits, flow i is allowed to schedule a packet

for processing, and the corresponding packet processing time on the dominant resource is deducted from

its total credit. A flow i can overdraw the processing time by scheduling at most one more packet than

those allowed by the available credits. The excessive consumption of dominant services is tracked by the

excess counter ei, and will be deducted from the credit given in the next scheduling round as a penalty

of overconsumption.

While MR3 adopts a similar credit system in its design (see Sec. 3.4.3), the intra-group scheduler of

GMR3 operates with an important difference. Every time a flow i is assigned a slot, instead of receiving

an elastic amount of credits in different rounds, it is given a fixed-size credit that is proportional to its

weight wi. Specifically, for flow i ∈ Gk, the given credit ci is

ci = 2kLwi , (3.14)

where L is the maximum packet processing time. The motivation for defining credit in this manner is

two-fold.

To begin with, even if two flows i, j belong to the same group Gk, flow i’s weight wi may be up

to twice as large as wj . Despite their weight difference, both flows are assigned exactly one slot per

scheduling round of Gk. Therefore, to ensure weight-proportional dominant services, the given credits

as shown in (3.14) are proportional to their respective weights.

Moreover, for each flow i ∈ Gk, since 2−k ≤ wi < 2−k+1, the scaling factor 2kL in (3.14) ensures that

L ≤ ci < 2L . (3.15)


Because the given credits are larger than the maximum packet processing time, they can always com-

pensate for the overconsumption of dominant service flow i incurs in the previous scheduling round. As

a result, flow i will always have available credits when assigned a slot, and can schedule at least one

packet. In addition, by (3.15), the given credits are roughly the same across all flow groups. This is

significant as flow i ∈ Gk is already assigned slots in proportion to its approximate weight 2−k, so that

in each slot, the scheduler should allocate all flows approximately the same dominant services.

Progress Control Mechanism: In addition to the credit system, the scheduler also employs a

progress control mechanism to reinforce a relatively consistent processing rate across resources. Specifi-

cally, whenever a flow i ∈ Gk is assigned a slot t in the scheduling round l of Gk, the scheduler checks the

work progress on the last resource (usually the link bandwidth). If flow i has already received services

on the last resource in the previous scheduling round l − 1, or flow i is a new arrival, then its packet is

scheduled immediately. Otherwise, the scheduler defers packet scheduling until flow i starts to receive

service on the last resource in the previous scheduling round l − 1 of Gk. For example, as shown in

Fig. 3.14, in slot 12, the packet processing for flow 1 (i.e., f71 ) is withheld in round 7 of G1 until the

packet processed in round 6 (i.e., f61 ) starts transmission. Similar deferral has also been shown in slots

14 and 16.

Intuitively, this progress control mechanism ensures that the work progress on one resource is never

ahead of that on the other by more than 1 round, hence achieving an approximately consistent processing

rate across resources, in spite of the potential service asynchronicity. This progress control mechanism

is essential to the fairness and delay guarantee of GMR3, as shown in our analysis in Sec. 3.5.7.

To summarize, Algorithm 4 gives detailed design of the intra-group scheduling. Every flow group Gk

maitains an ActiveFlowList[k] for its backlogged flows. It also uses RoundRobinCounter[k] and Round[k]

to keep track of the current scheduling round. Every time flow group Gk is selected, the intra-group

scheduler chooses flow i ∈ Gk at the head of ActiveFlowList[k]. Flow i is given a credit to compensate

for its overdraft in the previous round, and schedule packets until no credit remains or no packet is

backlogged (line 6 to 15). After that, the flow ceases to be pending and is appended to the tail of the

active list if it remains backlogged. Flow group Gk ceases to be pending when all its backlogged flows

are serviced in the current scheduling round. If no flow is backlogged, flow group Gk becomes inactive.

3.5.5 Handling New Packet Arrivals

In addition to the inter- and intra-group scheduling, GMR3 scheduler also needs to handle new packet

arrivals. Algorithm 5 gives the detailed procedure. In addition to enqueueing the newly arrived packet p


Algorithm 4 IntraGroupScheduling(Gk)

1: if RoundRobinCounter[k] == 0 then2: RoundRobinCounter[k] = ActiveFlowList[k].Length()3: Round[k] += 1 . The current scheduling round of Gk

4: end if5: Flow i = ActiveFlowList[k].RemoveFromHead()6: bi = 2kLwi − ei . bi tracks the available credit of flow i7: while IsBacklogged(i) and bi ≥ 0 do8: while FlowProgressOnLastResource[i] < Round[k] − 1 do9: Withhold the scehduling opportunity of flow i

10: end while11: Packet p = Queue[i].Dequeue()12: p.SchedulingRound = Round[k]13: ProcessPacket(p) . Schedule for CPU processing14: bi = bi −DominantProcessingTime(p)15: end while16: if IsBacklogged(i) then17: ei = −bi . ei tracks the overdraft of credits of flow i18: ActiveFlowList[k].AppendToTail(i)19: else20: ei = 021: end if22: RoundRobinCounter[k] -= 123: if RoundRobinCounter[k] == 0 then24: Flow group Gk ceases to be pending25: end if26: if ActiveFlowList[k] = ∅ then27: Deactivate(Gk) . Flow group Gk ceases to be active28: end if

to the queue of flow i ∈ Gk to which the packet belongs, the scheduler also appends flow i to the active

list of its flow group Gk if flow i is previously inactive. Flow group Gk is also set to active accordingly.

Algorithm 5 PacketArrival(p)

1: Let i be the flow to which the newly arrived packet p belongs2: Queue[i].Enqueue(p)3: Let Gk be the flow group to which flow i belongs4: if ActiveFlowList[k].Contains(i) == FALSE then5: ActiveFlowList[k].AppendToTail(i)6: if IsActive(Gk) == FALSE then7: Activate(Gk) . Flow group Gk becomes active8: end if9: end if

3.5.6 Implementation and Complexity

So far, we have described the design of GMR3. We next show that appropriate implementations allow

GMR3 to make packet scheduling decisions in O(1) time for almost all practical scenarios.

Flow Grouping: To identify the flow group Gk of flow i, it suffices to locate the most significant

bit of wi that is set to 1, as 2−k ≤ wi < 2−k+1. Direct computation requires only O(logW ), which is a


small constant for almost all practical weight distributions (i.e., logW ≤ 40). In fact, with the support

of a standard priority encoder, this operation can be accomplished in a few bitwise operations [71, 73],

and is strictly O(1).

Inter-Group Scheduling: There are three important operations in Algorithm 3, i.e., choosing a

flow group (line 4), advancing to the earliest slot with pending groups (line 9), and updating the pending

set P (line 13). Given a small number of flow groups ng, all these operations can be accomplished in

O(1) time using the simple methods described in [71], which we briefly mention in the following.

The scheduler uses two bitmaps a = ang. . . a2a1 and g = gng

. . . g2g1 to track the active and pending

flow groups. Bit ak is set to 1 if flow group Gk is active, and 0 otherwise. Similarly, bit gk is 1 if group

Gk is pending, and 0 otherwise.

• Choosing a flow group: It is easy to check that, in all slot t, the scheduling round of flow group Gk

ends earlier than those of all flow groups Gk′ , where k′ > k (see Fig. 3.12). Flow group Gk hence

has a higher priority to be chosen than Gk′ . As a result, the chosen group Gk can be identified by

locating the rightmost bit gk of bitmap g that is set to 1. Such an operation can be done in O(1)

time by a standard priority encoder [71].

• Advancing to the earliest slot with pending groups: Because the start of the scheduling round for

group Gk is also the start of a scheduling round of all groups Gk′ , where k′ > k (see Fig. 3.12), the

scheduler should advance to the start of the next scheduling round of the lowest-numbered flow

group that is active. This can be identified by locating the rightmost bit ak that is set to 1, and

the new slot is the smallest multiple of 2k greater than the current slot t. With the surport of

priority encoder, all these operations are done in O(1) time.

• Updating the pending set: At slot t, an active flow group Gk becomes pending if 2k divides t. To

identify all these groups, it is sufficient to locate the least significant bit of t that is set to 1. Let it

be the kth least significant bit of t. Then all active flow groups Gk′ where k′ ≤ k become pending

at t, and can be found via some simple bit operations in O(1) [71].

Intra-Group Scheduling: In Algorithm 4, an essential operation is to track the work progress on

the last resource of the selected flow i (line 8 to 10) to determine if the scheduling opportunity of flow i

should be withheld. For the purpose of efficient implementation, a packet p of flow i, upon scheduling,

is associated with a tag recording the current scheduling round of flow group Gk to which flow i belongs

(line 12 of Algorithm 4). Whenever packet p starts service on the last resource m, the progress of flow

i on that resource is updated to the scheduling round tagged to packet p, which will be used later to


determine the timing of withholding packet processing of flow i (line 8). All these operations can be

done in O(1) time.

Another operations that may introduce additional complexity is to obtain the packet processing time

on the dominant resource (line 14). Note that such information is required only after the packet has

been processed by CPU. At that time the scheduler knows exactly how the packet should be processed

next and what resources are required. The packet processing time on each of the following resource can

hence be accurately inferred via some simple packet profiling techniques in O(1) time. For example, a

simple linear model based on the packet size is proved to be sufficiently accurate for estimation [4].

To conclude, with appropriate implementations mentioned above, both inter-group and intra-group

scheduling decisions can be made in O(1) time per packet, making GMR3 a highly efficient multi-resource

scheduler for middleboxes.


In this section, we analyze the properties of GMR3 and show that it achieves near-perfect fairness with

scheduling delays bounded by a small constant.

Fairness

For the purpose of fairness analysis, we derive the RFB of GMR3 defined in Sec. 3.3. We start by

bounding the dominant services a flow receives in any backlogged period (t1, t2) as follows.

Lemma 4. Let Ti(t1, t2) be the dominant service a backlogged flow i receives in a time interval (t1, t2).

We have

xLwi − 9L ≤ Ti(t1, t2) ≤ xLwi + 9L , (3.16)

where x is the number of slots, complete and partial, that have been assigned to flows in (t1, t2).

Proof: Let xi be the number of slots assigned to flow i ∈ Gk in (t1, t2). We show that flow i receives

services on its dominant resource at least in xi − 2 scheduling rounds, and at most in xi + 2 scheduling

rounds. To see this, let l1 (resp. l2) be the scheduling round of flow group Gk to which the first (resp. last)

slot assigned to flow i belongs, in (t1, t2). Clearly, we have xi = l2 − l1 + 1. By Algorithm 4, for any

flow, the progress gap between any two resources is upper bounded by one scheduling round. Therefore,

at time t1, the processing for flow i on the last resource m should progress to at least scheduling round

l1 − 2, as illustrated in Fig. 3.15. Also, by time t2, the work progress of flow i on the last resource m

is at most in scheduling round l2 (because the work progress of resource m is always behind that of


s-1 s f+1

r∗i

r∗j j

j

Round s

i

...f

j i j i

j

f+2

i i

t1 t2 Time

ij

...

...

...

Figure 3.15: Illustration of the maximum number of scheduling rounds of flow i on the last resource in(t1, t2).

resource 1). As a result, in (t1, t2), flow i receives dominant services at most in l2− (l1− 2) + 1 = xi + 2

scheduling rounds.

With a similar argument, we see that the work progress of flow i on the last resource m is at most

in scheduling round l1 at time t1, and is at least in scheduling round l2 − 2 at time t2. Flow i hence

receives its dominant services in at least xi − 2 scheduling rounds in (t1, t2).

Therefore, the dominant services flow i receives are at least (xi−2)ci−L and are at most (xi+2)ci+L,

where ci = 2kLwi is the credit given to flow i in each scheduling round, i.e.,

(xi − 2)ci − L ≤ Ti(t1, t2) ≤ (xi + 2)ci + L . (3.17)

Also note that for resource 1, the number of scheduling rounds of flow group Gk contained in (t1, t2)

is at least xi − 2, and is at most xi + 2. Because each scheduling round of Gk spans exactly 2k slots, we

have 2k(xi − 2) ≤ x ≤ 2k(xi + 2), which is equivalent to

2−kx− 2 ≤ xi ≤ 2−kx+ 2 . (3.18)

Substituting (3.18) to (3.17), we derive

Ti(t1, t2) ≤ (xi + 2)ci + L

≤ (2−kx+ 4)ci + L

= 2−kx2kLwi + 4ci + L

≤ xLwi + 9L ,

(3.19)

Similarly, we have

Ti(t1, t2) ≥ (xi − 2)ci − L ≥ xLwi − 9L . (3.20)


Combining (3.19) and (3.20) leads to the statement. ut

We are now ready to derive the RFB of GMR3 as follows.

Theorem 9. For any time interval (t1, t2) and any two flows i, j that are backlogged, we have

∣∣∣∣Ti(t1, t2)

wi− Tj(t1, t2)

wj

∣∣∣∣ ≤ 9L

(1

wi+

1

wj

).

Proof: For any flow i, applying Lemma 4 and dividing both sides of (3.16) by wi, we have

xL− 9L/wi ≤ Ti(t1, t2)/wi ≤ xL+ 9L/wi . (3.21)

Similar inequalities also hold for flow j, i.e.,

xL− 9L/wj ≤ Tj(t1, t2)/wj ≥ xL+ 9L/wj . (3.22)

Combining (3.21) and (3.22) leads to the statement. ut

Theorem 9 indicates that GMR3 bounds the difference between the normalized dominant services

received by two flows in any backlogged period by a small constant. GMR3 hence provides near-perfect

fairness across flows, irrespective of their traffic patterns. Furthermore, we note that the fairness guaran-

tees provided by existing multi-resource fair queueing schemes, e.g., [4], all assume flows do not change

their dominant resources throughout the backlogged periods (a.k.a., the resource monotonicity assump-

tion [4]), while Theorem 9 does not require such an assumption.

Scheduling Delay

In addition to the fairness guarantees, we show that GMR3 ensures that the scheduling delay is bounded

by a small constant that is inversely proportional to the flow’s weight. To see this, the following two

lemmas are needed in the analysis.

Lemma 5. Let dli be the dominant services flow i ∈ Gk receives in scheduling round l of Gk. We have

0 ≤ dli ≤ 3L . (3.23)

Proof: Let eli be the excessive consumption of dominant services flow i ∈ Gk receives in scheduling

round l of Gk. By Algorithm 4, flow i may overdraw the processing time by scheduling at most one


more packet than those allowed by the available credits. Therefore,

0 ≤ eli ≤ L . (3.24)

Also note that at the beginning of scheduling round l, flow i has available credits ci− el−1i . After round

l, all these credits are used, with an excessive consumption eli. The dominant services flow i receives in

round l are hence

dli = ci − el−1i + eli ≤ ci + eli ≤ 3L ,

where the last inequality is derived from (3.15) and (3.24). ut

Lemma 6. For flow i ∈ Gk and scheduling round l of Gk, let t0 be the time when flow i is completely

processed on resource 1 in round l of Gk, and t1 the time when flow i is completely processed on the last

resource m in the same round l. We have

t1 − t0 < 12mL/wi .

Proof: By Algorithm 4, the progress gap between any two resources is bounded by 1 round. There-

fore, at time t0, flow i must have already received its packet processing on the last resource m in round

l− 1 of Gk. Because there are at most 2k+1 slots assigned to flows in round l− 1 and l of Gk, and each

slot is assigned to at most one flow, the number of flows served on the last resource in (t0, t1), denoted

nf , is upper bounded by 2k+1, i.e.,

nf ≤ 2k+1 .

Let these flows be j1, . . . , jnf, operating in scheduling rounds l1, . . . , lnf

of their respective flow groups

on the last resource. In particular, jnf= i and lnf

= l. By Algorithm 4, flow j1 starts service on

resource 1 in round l1 of its flow group no later than the time when its previously served flow has been

processed on the last resource m, which is no later than t0. Therefore, flow j1 is completely processed

on the last resource in round l1 of its flow group no later than

t0 +mdl1j1 ≤ t0 + 3Lm , (3.25)

where the inequality is derived from Lemma 5. Similarly, flow j2 starts its service on resource 1 in round

l2 no later than the time its previous flow j1 is completely processed on the last resource m in round

l1, which is no later than t0 + 3Lm by (3.25). As a result, flow j2 is completely processed on the last


f l2j2

Time

...

...

1

m

...

f li

f li

t1t0

f l1j1

f l1j1

t2 t3 tnf

f l+1i

f l2j2 f l+1

i

ResourceScheduling delay Di(p)

... tnf−1

Figure 3.16: The illustration of a scenario where the scheduling delay Di(p) reaches the maximum. Here,f li denotes the processing of flow i in scheduling round l of its flow group.

resource in round l2 of its flow group no later than

t0 + 3Lm+mdl2j2 ≤ t0 + 6Lm .

By induction, flow ju is completely processed on the last resource in round lu of its flow group no later

than

t0 + 3uLm , u = 1, 2, . . . , nf .

Now letting u = nf and noting that nf ≤ 2k+1 and jnf= i, we have

t1 − t0 ≤ 3nfLm ≤ 3Lm2k+1 ≤ 12mL/wi ,

where the last inequality holds because 2−k ≤ wi < 2−k+1, which implies 2k+1 ≤ 4/wi. ut

We now bound the scheduling delay of GMR3 as follows.

Theorem 10. For all flow i, the scheduling delay of its packet p is bounded by

SDi(p) < 24mL/wi ,

where m is the number of resources.

Proof: For any flow i ∈ Gk, the scheduling delay of its packet p reaches its maximum when p reaches

the head of the queue in scheduling round l of Gk, but is not processed until the next round l+ 1 of Gk.

Since there are at most 2k+1 slots in between and each slot is assigned to one flow, the number of flows

that have been assigned slots during this time, denoted nf , is upper bounded by 2k+1, i.e.,

nf ≤ 2k+1 .

Let these flows be j1, . . . , jnf, with their assigned slots in their current scheduling rounds l1, . . . , lnf

of


their respective flow groups. In particular, jnf= i and lnf

= l+1. By Algorithm 4, flow j1 starts service

on resource 1 no later than the time its previous flow i is completely processed on the last resource m in

round l of Gk. Similarly, flow j2 starts its service on resource 1 no later than the time when its previous

flow j1 is completely processed on the last resource m in round l1 of its flow group, and so on. Fig. 3.16

illustrates this scenario, where tu is the latest time instant that flow ju receives service on resource 1 in

round lu of its flow group, u = 1, 2, . . . . We then have

tu+1 − tu ≤ mdluju ≤ 3Lm, u = 1, 2, . . . , (3.26)

where the second inequality is derived from Lemma 5. In other words, the time span of processing flow

ju on all resources in round lu reaches its maximum when the processing time is maximized on every

resource.

Now let t0 be the time when packet p reaches the head of the queue in scheduling round l of its flow

group, which is also the time when flow i is completely processed on resource 1 in the same round l (see

Fig. 3.16). By Lemma 6, we have

t1 − t0 ≤ 12mL/wi . (3.27)

With (3.26) and (3.27), we bound the scheduling delay SDi(p) as follows:

SDi(p) ≤nf∑u=1

(tu − tu−1)

< 12mL/wi + 3Lmnf

≤ 12mL/wi + 3Lm2k+1

≤ 24mL/wi ,

where the last inequality holds because 2−k ≤ wi < 2−k+1, which implies 2k+1 ≤ 4/wi. ut

Theorem 10 gives a strictly weight-proportional scheduling delay bound that is independent of the

number of flows. This implies that a flow is guaranteed to be scheduled within a small constant amount

of time that is inversely proportional to the processing rate (weight) the flow deserves, irrespective of

the behaviours of other flows. To our knowledge, this is the first multi-resource packet scheduler that

offers this property.

To conclude, Table 3.3 compares the performance of GMR3 with the two existing multi-resource fair

queueing schemes, i.e., DRFQ [4] and MR3. We see that GMR3 is the only scheduler that provides


Table 3.3: Summary of performance of GMR3 and existing schemes, where n is the number of flows,and m is the number of resources.

Scheme Complexity Fairness4 (RFB) Scheduling DelayDRFQ [4] O(log n) L(1/wi + 1/wj) Unknown

MR3 O(1) 2L(1/wi + 1/wj) 2(m+W )2L/wiGMR3 O(1) 9L(1/wi + 1/wj) 24mL/wi

provably good performance guarantees on fairness and delay in O(1) time.

3.5.8 Evaluation

For complementary study to our theoretical analysis, we experimentally evaluate the fairness and delay

performance of GMR3 via simulations.

General Setup

All simulation results are based on our event-driven packet simulator written with 3,000 lines of C++

code. Packets follow Poisson arrivals and are processed serially on resources, with CPU processing first,

followed by link transmission. In addition to GMR3, we also implement DRFQ [4] and MR3 for the

purpose of comparison. The simulator simulates packet processing in 3 typical middlebox modules, i.e.,

basic forwarding (Basic), statistical monitoring (Stat. Mon.), and IP security encryption (IPsec). The

first two modules are bandwidth intensive, with monitoring consuming slightly more CPU resources,

while IPsec is CPU intensive. According to the measurement results reported in [4], the CPU processing

time required by each middlebox module follows a simple linear model based on packet size x, and is

αkx+βk, where αk and βk are parameters of module k and are summarized in Table 3.2 (see Sec. 3.4.5).

The link transmission time is proportional to the packet size, and the output bandwidth of the middlebox

is 200 Mbps, the same as [4].

Fairness

We confirm experimentally that GMR3 provides near-perfect service isolation across flows, irrespective of

their traffic behaviours. The simulator generates 30 traffic flows that send 1300-byte UDP packets for 30

seconds. Flows 1 to 10 pass through the Basic module; flows 11 to 20 undergo statistical monitoring; while

flows 21 to 30 require IPsec encryption. Among all these flows, flow 1, 11, and 21 are rogue traffic, each

sending 30,000 pkts/s. All other flows behave normally, each sending 3,000 pkts/s. Flows are assigned

random weights uniformly drawn from 1 to 1000. Fig. 3.17a depicts the dominant services, in seconds,

4The fairness analysis of DRFQ requires that flows do not change their dominant resources throughout the backloggedperiods [4].


0 5 10 15 20 25 3038

38.5

39

39.5

40

Flow ID

Norm

aliz

ed D

om

. S

erv

. (s

)

BasicStat. Mon.IPsec

(a) Normalized dominant ser-vice.

0 20 40 60 800

0.2

0.4

0.6

0.8

1

Scheduling Delay (ms)

CD

F

DRFQ

MR3

GMR3

(b) CDF of the scheduling de-lay.

0 200 400 600 800 10000

30

60

90

120

150

Flow Weight

Me

an

Sch

ed

ulin

g D

ela

y (

ms)

DRFQ

MR3

GMR3

(c) Mean scheduling delay.

0 200 400 600 800 10000

100

200

300

400

Flow Weight

Ma

x S

ch

ed

ulin

g D

ela

y (

ms)

DRFQ

MR3

GMR3

(d) Maximum scheduling de-lay.

Figure 3.17: Simulation results of the fairness and delay performance of GMR3, as compared to DRFQand MR3. Figure (a) dedicates to the fairness evaluation, while (b), (c), and (d) compare the schedulingdelay of the three schedulers.

received by different flows under GMR3, normalized to their respective weights. We see that despite

the presence of ill-behaving traffic, GMR3 allows flows through different modules to receive weight-

proportional dominant services, enforcing service isolation. Similar results have also been observed

using DRFQ and MR3, and are not shown in the figure.

Scheduling Delay

We next confirm experimentally that GMR3 significantly improves the packet scheduling delay, as com-

pared to existing multi-resource scheduling alternatives. The simulator generates 150 UDP flows with

flow weights uniformly drawn from 1 to 1000. A flow randomly chooses one of the three middlebox

modules to pass through. To congest the middlebox resources, the flow rate is set to 500 pkts/s, with

packet sizes uniformly drawn from 200 B to 1400 B, which are the typical settings for Ethernet. For

each processed packet, we record its scheduling delay, using DRFQ, MR3, and GMR3, respectively. The

simulation spans 30 seconds.

Fig. 3.17b shows the CDF of the scheduling delay a packet experiences, from which we see the

significance of GMR3 on delay improvement: Using GMR3, over 95% packets are scheduled within

20 ms, which is roughly the minimum time a packet has to wait under DRFQ and MR3! A detailed

statistics breakdown is given in Fig. 3.17c and 3.17d. Fig. 3.17c shows the mean scheduling delay a flow

experiences with respective to its weight. We see that GMR3 consistently leads to a smaller mean delay

than the other two schedulers for almost all flows, especially for those with large weights. This delay

improvement is not limited to the average case. Fig. 3.17d gives the maximum delay a flow experiences

with respect to its weight. We see that both GMR3 and DRFQ offer a weight-proportional delay bound.

While DRFQ achieves a smaller delay bound for flows with smaller weights, GMR3 is generally better

for more important flows with medium to large weights. MR3, on the other hand, fails to provide service

differentiations among flows. Intuitively, since flows are served in rounds, in the worst case, a packet has


to wait for the entire scheduling round until it is processed, incurring a worst-case delay that is as long

as the span of an entire round. GMR3 avoids this problem by distributing the scheduling opportunities

over time, in proportion to the flows’ weights.

3.6 Discussion and Future Work

While both analysis and simulation show that GMR3 offers provably good performance on fairness,

delay, and complexity, we point out that it does not outperform existing schemes in every aspect, and

is by no means a “final solution” to multi-resource fair queueing. In particular, as shown in Table 3.3,

GMR3 offers a relatively looser fairness guarantee with a larger RFB than those of DRFQ and MR3.

This means that GMR3 is not as good as the other two schemes in terms of the short-term fairness, and

is not preferred by short-lived flows (a.k.a. mice flows). However, for long-lived flows (a.k.a. elephant

flows), all three schemes are similar as their RFBs are all bounded by a small constant. As a result,

GMR3 is not a perfect fit when the strict short-term fairness is of particular importance to the system.

Also, even excluding the slight disadvantage of short-term fairness, GMR3 is not an all-around

improvement over MR3 for the following two reasons. To begin with, while both MR3 and GMR3 operate

at O(1) complexity, the former has a simpler mechanism and is easier to be implemented than the latter.

Unlike GMR3, flows are simply served in rounds under MR3. No flow grouping or inter-group scheduling

strategies is needed. Moreover, MR3 operates without any a priori knowledge of packet processing,

which is not the case for GMR3. Specifically, in each round, MR3 awards each flow an elastic amount

of credits based on its excessive credit consumption in the previous round, such that the flow is always

given sufficient credits to process at least one packet (see Sec. 3.4.3). Such an elastic credit-awarding

mechanism enables MR3 to operate without a priori knowledge of the maximum packet processing time

L. In contrast, this information is required by GMR3 to compute the amount of credits awarded to a

flow in each round (see (3.14) and line 6 of Algorithm 4). We also note that the elastic credit-awarding

mechanism adopted by MR3 cannot be applied to GMR3: it only respects the intra-group fairness for

flows in one flow group, yet violates the inter-group fairness among flow groups.

In summary, GMR3 can be viewed as a balanced design in terms of fairness, delay, and complexity. It

remains open to investigate if and how these performance metrics can be further improved. Furthermore,

what is the ultimate performance tradeoff that one can expect? Given that similar questions have been

answered for single-resource fair queueing [77,78], investigations into these open questions in the multi-

resource setting may present a promising future direction.


3.7 Summary

The potential congestion of packet processing with respect to multiple resource types in a middlebox

complicates the design of packet scheduling algorithms. Previously proposed multi-resource fair queueing

schemes require O(log n) complexity per packet and are expensive to implement at high speeds. In this

chapter, we present two new schedulers both operating in O(1) time.

Our first scheduler, MR3, was designed for unweighted flows. MR3 serves flows in rounds. At the

same time, it keeps track of the work progress on each resource and withholds the scheduling opportunity

of a packet until the progress gap between any two resources falls below one round. Theoretical analyses

have indicated that MR3 implements near-perfect DRF across flows. MR3 is very easy to implement,

and is the first multi-resource fair scheduler that offers near-perfect fairness with O(1) time complexity.

Our second scheduler, GMR3, was designed for weighted flows. GMR3 collects flows with similar

weights into the same flow group, and makes scheduling decisions in a two-level hierarchy. The inter-

group scheduler determines a flow group, from which the intra-group scheduler picks a flow in a round-

robin manner. Through this design, GMR3 eliminates the sorting bottlenecks suffered by existing multi-

resource scheduling alternatives such as DRFQ, and is able to handle a large volume of traffic at high

speeds. More importantly, we have shown, both analytically and experimentally, that GMR3 ensures

a constant scheduling delay bound that is inversely proportional to the flow weight, hence offering

predictable delay guarantees for individual flows. To our knowledge, GMR3 is the first multi-resource

fair queueing algorithm that offers near-perfect fairness with a constant scheduling delay bound in O(1)

complexity.

We believe that both MR3 and GMR3 may find general applications in other multi-resource scheduling

contexts where jobs must be scheduled as entities, such as multi-tenant scheduling in deep software stacks

and VM scheduling inside a hypervisor.

3.8 Proofs

3.8.1 Fairness Analysis for MR3

In this subsection, we give detailed proofs of lemmas and corollaries that are used in the fairness analysis

presented in Sec. 3.4.4. We start by proving Lemma 2.

Proof of Lemma 2: If flow i has no packets to serve (i.e., the input queue is empty) after round

k, then ECki = 0 (line 24 in Module 2), and the statement holds. Otherwise, let packet p be the last


packet of flow i served in round k. Let B′i be its account balance before packet p is processed. We have

ECki = DominantProcessingTime(p)−B′i ≤ Li,

where the inequality holds because B′i ≥ 0 and DominantProcessingTime(p) ≤ Li. ut

We next show Lemma 3.

Proof of Lemma 3: At the beginning of round k, flow i has an account balance Bi = MaxECk−1−

ECk−1i . After round k, all this amount of processing time has been consumed on the dominant resources

of the processed packets, with an excessive consumption ECki . The dominant service flow i received in

round k is therefore Di = Bi + ECki , which is the RHS of (3.4). ut

Finally, we prove Corollary 2.

Proof of Corollary 2: By Lemma 3 and noticing that ECk−1i ≥ 0, we have:

Dki ≤ MaxECk−1 + ECki ≤ 2L,

where the last inequality holds because of Lemma 2 and Corollary 1. ut

3.8.2 Analysis of Startup Latency of MR3

In this subsection, we bound the startup latency of MR3 by proving Theorem 3.

Proof of Theorem 3: Without loss of generality, suppose n flows are backlogged in round k − 1,

with flow 1 being served first, followed by flow 2, and so on. After flow n has been served on resource 1,

flow n + 1 becomes active and is appended to the tail of the active list. Flow n + 1 is therefore served

right after flow n in round k. In particular, suppose flow n+ 1 becomes active at time 0. Let Ski,r be the

time when flow i starts to receive service in round k on resource r, and let F ki,r be the time when the

scheduler finishes serving flow i in round k on resource r. Specifically,

F k−1n,1 = 0 .

As shown in Fig. 3.18, the startup latency of flow n+ 1 is

SL = F kn,1 . (3.28)

The following two relationships are useful in the analysis. For all flows i = 1, . . . , n and resources


nk−2

Round k-2

rm

...

nk−1

nk−1 nk

F k−1n,1 = 0 Time

1k−1

...

...

r1

Start-up latency of flow n+1

1k

nk−1...1k−1r2

(n+ 1)k

F kn,1

Round k-1

... ...

1k

Round k-1 Round k

Sk1,1

Figure 3.18: Illustration of the startup latency. In the figure, flow i in round k is denoted as ik. Flown+ 1 becomes active when flow n has been served on resource 1 in round k− 1, and will be served rightafter flow n in round k.

r = 1, . . . ,m, we have

F ki,r ≤ Ski,r +Dki ≤ Ski,r + 2L , (3.29)

where the last inequality holds because of Corollary 2. Further, for all flows i = 1, . . . , n, we have

Ski,1 =

max{Sk−11,m , F

k−1n,1 }, i = 1,

max{Sk−1i,m , F ki−1,1}, i = 2, . . . , n .

(3.30)

That is, flow i is scheduled in round k right after its previous-round service has started on the last

resource m (in this case, the progress gap on two resources is no more than 1 round) and its previous

flow has been served on resource 1.

The following lemma is also required in the analysis.

Lemma 7. For all flows i = 1, . . . , n and all resources r = 1, . . . ,m, we have

Sk−1i,r ≤ 2(i+ r − 2)L ,

F k−1i,r ≤ 2(i+ r − 1)L .

(3.31)

Proof of Lemma 7: We observe the following relationship for all resources r = 2, . . . ,m:

Sk−1i,r ≤

max{F k−2n,r , F

k−11,r−1}, i = 1,

max{F k−1i−1,r, F

k−1i,r−1}, i = 2, . . . , n.

(3.32)

That is, flow i starts to receive service on resource r no later than the time when it has been served on


resource r− 1 and the time when its previous flow has been served on the same resource (see Fig. 3.18).

To see the statement, we apply induction to r and i. First, when r = 1, the statement trivially holds

because

Sk−1i,1 ≤ F k−1

i,1 ≤ F k−1n,1 = 0, i = 1, . . . , n. (3.33)

When r = 2, i = 1, we have Sk−11,2 ≤ max{F k−2

n,2 , F k−11,1 } ≤ 2L ,

F k−11,2 ≤ S

k−11,2 + 2L ≤ 4L .

(3.34)

This is because

F k−2n,2 ≤ F k−2

n,m

≤ Sk−2n,m + 2L (By (3.29))

≤ Sk−11,1 + 2L (By MR3 algorithm)

≤ 2L (By (3.33)) ,

(3.35)

and

Sk−11,2 ≤ max{F k−2

n,2 , F k−11,1 } ≤ max{2L, 0} = 2L . (3.36)

Now assume for some r, i and r − 1, i+ 1, the statement holds. Note that for r, i+ 1, we have

Sk−1i+1,r ≤ max{F k−1

i,r , F k−1i+1,r−1} (By (3.32))

≤ 2(i+ r − 1)L, (By induction)(3.37)

and

F k−1i+1,r ≤ S

k−1i+1,r + 2L ≤ 2(i+ r)L . (3.38)

Therefore, the statement holds for r, i = 1, . . . , n. We then consider the case of r + 1, 1. We have

Sk−11,r+1 ≤ max{F k−2

n,r+1, Fk−11,r } (By (3.32))

≤ max{F k−2n,m , 2rL}

≤ max{2L, 2rL} (By (3.35))

= 2rL ,

(3.39)

and

F k−11,r+1 ≤ S

k−11,r+1 + 2L ≤ 2(r + 1)L . (3.40)

Therefore, by induction, the statement holds. ut

Lemma 7 leads to the following lemma.


Lemma 8. The following relationship holds for all flows i = 1, . . . , n:

Ski,1 ≤ 2(m+ i− 2)L ,

F ki,1 ≤ 2(m+ i− 1)L .(3.41)

Proof of Lemma 8: For i = 1, we have

Sk1,1 = max{Sk−11,m , F

k−1n,1 } (By (3.30))

≤ max{2(m− 1)L, 0} (By Lemma 7)

= 2(m− 1)L ,

(3.42)

and

F k1,1 ≤ Sk1,1 + 2L ≤ 2mL . (3.43)

Assume the statement holds for some i. Then for flow i+ 1, we have

Ski+1,1 = max{Sk−1i+1,m, F

ki,1} (By (3.30))

≤ 2(i+m− 1)L (By Lemma 7 and (3.41))

and

F ki+1,1 ≤ Ski+1,1 + 2L = 2(i+m)L . (3.44)

Hence by induction, the statement holds. ut

We are now ready to prove Theorem 3. By (3.28), it is equivalent to show

SL = F kn,1 ≤ 2(m+ n− 1)L ,

which is a direct result of Lemma 8. ut

3.8.3 Analysis of Scheduling Delay of MR3

We next analyze the single packet delay of MR3 by proving Theorem 4.

Proof of Theorem 4: For any packet p, let a(p) be the time when packet p reaches the head of

the input queue and is ready for service. Let d(p) be the time when packet p has been processed on all

resources and leaves the system. The scheduling delay of packet p is defined as

SD(p) = d(p)− a(p) . (3.45)


nkprm

...

Time

...

...

r1

Single Packet Delay PD(p)

... ...Round k-1 Round k

nk−1 1k nkp

d(p)

1k+1

Round k+1

F kn,ma(p) = F k−1

n,1 = 0

ik−1

Skn,1

Figure 3.19: Illustration of the packet latency, where flow i in round k is denoted as ik. The figure showsthe scenario under which the latency reaches its maximal value: a packet p is pushed to the top of theinput queue in one round but is scheduled in the next round because of the account deficit.

Without loss of generality, assume packet p belongs to flow n, and is pushed to the top of the input

queue at time 0 in round k − 1. The delay D(p) reaches its maximal value when packet p is scheduled

in the next round k, as shown in Fig. 3.19.

We use the same notations as those in the proof of Theorem 3. Let Ski,r be the time when flow i

starts to receive service in round k on resource r, and let F ki,r be the time when flow i has been served

in round k on resource r. As shown in Fig. 3.19, the scheduling delay is

D(p) = d(p) ≤ F kn,m . (3.46)

We claim the following relationships for all flows i = 1, . . . , n and resources r = 1, . . . ,m, with which

the statement holds. Ski,r ≤ 2(m+ n+ i+ r − 2)L ,

F ki,r ≤ 2(m+ n+ i+ r − 1)L .(3.47)

To see this, we extend (3.32) to round k by replacing k − 1 in (3.32) with k:

Ski,r ≤

max{F k−1n,r , F

k1,r−1}, i = 1,

max{F ki−1,r, Fki,r−1}, i = 2, . . . , n.

(3.48)

We now show (3.47) by induction. First, by Lemma 8, (3.47) holds when r = 1. Also, for r = 2, i = 1,

we have

Sk1,2 ≤ max{F k−1n,2 , F k1,1} (By (3.48))

≤ max{2(n+ 1)L, 2mL} (By (3.31), (3.41))

≤ 2(m+ n+ 1)L ,


and

F k1,2 ≤ Sk1,2 + 2L = 2(m+ n+ 2)L .

Now assume for some r, i and r − 1, i+ 1, (3.47) holds. Note that for r, i+ 1, we have

Ski+1,r ≤ max{F ki,r, F ki+1,r−1} (By (3.48))

≤ 2(m+ n+ i+ r − 1)L, (By induction)

and

F ki+1,r ≤ Ski+1,r + 2L = 2(m+ n+ i+ r)L .

Therefore, by induction, (3.47) holds for r, i = 1, . . . , n. We then consider the case of r + 1, 1. We have

Sk1,r+1 ≤ max{F k−1n,r+1, F

k1,r} (By (3.48))

≤ max{2(n+ r)L, 2(m+ n+ r)L} (By Lemma 7)

= 2(m+ n+ r)L,

and

F k1,r+1 ≤ Sk1,r+1 + 2L = 2(m+ n+ r + 1)L .

Hence by induction, (3.47) holds. ut

3.8.4 Fairness Analysis of Weighted MR3

To prove Theorem 6, we require the following relationships, which are natural extensions to those given

in Sec. 4.2.1.

Lemma 9. Under weighted MR3, for all flow i and round k, we have

ECki ≤ Li/wi .

Corollary 4. Under weighted MR3, for all round k, we have

MaxECk ≤ max1≤i≤n

Li/wi .

Lemma 10. Under weighted MR3, for all flow i and round k, we have

Dki /wi = MaxECk−1 − ECk−1

i + ECki ,


where EC0i = 0 and MaxEC0 = 0.

Corollary 5. Under weighted MR3, for all flow i and round k, the following relationship holds:

Dki ≤ (1 +W )L .

Proof: By Lemma 10, we derive as follows

Dki ≤ wiMaxECk−1 + wiECki

≤ wi max1≤j≤n

Lj/wj + Li

≤ (1 +W )L ,

where the second inequality is derived from Lemma 9 and Corollary 4. ut

With all these relationships, theorem 6 can be proved in a similar way as Theorem 2.

Proof sketch of Theorem 6: Following the notation and the analysis given in the proof of Theo-

rem 15, we bound the normalized dominant service flow i receives in (t1, t2) as follows:

R2−2∑k=R1+1

Di

wi≤ Ti(t1, t2)

wi≤

R2∑k=R1−1

Di

wi.

Now applying Lemma 9 and 10, we have

X − 5Li/wi ≤ Ti(t1, t2)/wi ≤ X + Li/wi , (3.49)

where X =∑R2

k=R1−1 MaxECk−1. Similar inequality also holds for flow j, i.e.,

X − 5Lj/wj ≤ Tj(t1, t2)/wj ≤ X + Lj/wj , (3.50)

Taking the difference between (3.49) and (3.50) leads to the statement. ut

3.8.5 Delay Analysis of Weighted MR3

We now briefly outline how the analysis of startup latency presented in Sec. 3.8.3 extends to weighted

MR3.

Proof sketch of Theorem 7: Following the notations used in the proof of Theorem 3, it is equivalent

to show

SL = F kn,1 ≤ (1 +W )(m+ n− 1)L . (3.51)


By Corollary 5, we rewrite (3.29) as

F ki,r ≤ Ski,r +Dki ≤ Ski,r + (1 +W )L . (3.52)

Combining (3.30), (3.52) and following the analysis of Lemma 7, one can easily show that for all flows

i = 1, . . . , n and resources r = 1, . . . ,m, there must be

Sk−1i,r ≤ (1 +W )(i+ r − 2)L ,

F k−1i,r ≤ (1 +W )(i+ r − 1)L .

This leads to an extension of Lemma 8, where the following relationship holds for all flows i = 1, . . . , n:

Ski,1 ≤ (1 +W )(m+ i− 2)L ,

F ki,1 ≤ (1 +W )(m+ i− 1)L .

Taking i = n, we see (3.51) holds. ut

Similarly, the analysis for the scheduling delay also extends to weighted MR3.

Proof sketch of Theorem 8: We can easily extend (3.47) to the following inequalities, following

exactly the same induction steps given in the proof of Theorem 4:

Ski,r ≤ (1 +W )(m+ n+ i+ r − 2)L ,

F ki,r ≤ (1 +W )(m+ n+ i+ r − 1)L .(3.53)

As a result, we have

SDi(p) = d(p) ≤ F kn,m ≤ (1 +W )(2m+ 2n− 1)L < 2(1 +W )(m+ n)L. (3.54)

Also notice that flow weights are normalized such that

n∑j=1

wj = 1.

Dividing both sides by wi and noting that wj/wi ≥ 1/W , we have

n ≤W/wi.


Substituting this inequality to (3.54), we have

SDi(p) < 2(1 +W )(m+ n)L

≤ 2(1 +W )(m+W/wi)L

≤ 2(m+W )2L/wi,

where the last inequality holds because m ≥ 1 and wi ≤ 1. ut

Chapter 4

Fairness-Efficiency Tradeoff for

Multi-Resource Scheduling

4.1 Motivation

In the previous chapter, we have mainly focused on fairness in the design of a queueing algorithm. Here,

fairness means predictable service isolation among flows, and is embodied by Dominant Resource Fairness

(DRF) in the presence of multiple types of resources. However, fairness is just one side of the story.

In addition to fairness, resources should also be shared efficiently. In the context of packet scheduling,

efficiency measures the resource utilization achieved by a queueing algorithm. High resource utilization

naturally translates into high traffic throughput. This is of particular importance to enterprise networks,

given the surging volume of traffic passing through middleboxes [18,30].

Both fairness and efficiency can be achieved at the same time in traditional single-resource fair queue-

ing, where bandwidth is the only concern. As long as the schedule is work conserving [24], bandwidth

utilization is 100% given a non-empty system. That leaves fairness as an independent objective to

optimize.

However, in the presence of multiple resource types, fairness is often a conflicting objective against

efficiency. To see this, consider two schedules shown in Fig. 4.1 with two flows whose packets need CPU

processing before transmission. Packets that finishes CPU processing are placed into a buffer in front of

the output link. Each packet in Flow 1 has a processing time vector 〈2, 3〉, meaning that it requires 2

time units for CPU processing and 3 time units for transmission; each packet in Flow 2 has a processing

time vector 〈9, 1〉. The dominant resource of Flow 1 is link bandwidth, as it takes more time to transmit

93

Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 94

p6

q1

q1

...

...

p1

p1

p2 p3 p4 p5 p6

p2 p3 p4 p5

q2CPU

Link

Time0 6 8 12 14 182 4 10 16 20 22 24 26

(a) A packet schedule that is fair but inefficient.

CPU

Link p9

p10

q1

...

...

p1

p1

p2 p3 p4 p5 p6 p7 p8

p2 p3 p4 p5 p6 p8p7

p9

Time0 6 8 12 14 182 4 10 16 20 22 24 26

q1

(b) A packet schedule that is efficient but violates DRF.

Figure 4.1: An example showing the tradeoff between fairness and efficiency for multi-resource packetscheduling. Packets that finishes CPU processing are placed into a buffer in front of the output link.Flow 1 sends packets p1, p2, ..., each having a processing time vector 〈2, 3〉; Flow 2 sends packets q1, q2,..., each having a processing time vector 〈9, 1〉. Schedule (a) achieves DRF but is inefficient; Schedule(b) is efficient but unfair.

a packet than processing it using CPU; similarly, the dominant resource of Flow 2 is CPU. To achieve

DRF, the transmission time Flow 1 receives should be approximately equal to the CPU processing time

Flow 2 receives. In this sense, Flow 1 should schedule three packets whenever Flow 2 schedules one,

so that each flow receives 9 time units to process its dominant resource, as shown in Fig. 4.1a. This

schedule, though fair, leads to poor bandwidth utilization—the link is idle for 1/3 of the time. On the

other hand, Fig. 4.1b shows a schedule that achieves 100% CPU and bandwidth utilization by serving

eight packets of Flow 1 and one packet of Flow 2 alternately. The schedule, though efficient, violates

DRF. While Flow 1 receives 24/25 of the link bandwidth, Flow 2 receives only 9/25 of the CPU time.

The fairness-efficiency tradeoff shown in the example above generally exists for multi-resource packet

scheduling. However, existing multi-resource queueing algorithms focus solely on fairness, e.g., DRFQ [4],

MR3 (Sec. 3.4), and GMR3 (Sec. 3.5). On the other hand, for applications having a loose fairness

requirement, trading off a modest degree of fairness for higher efficiency and higher throughput is well

justified. In general, depending on the underlying applications, a network operator may weigh fairness

and efficiency differently. Ideally, a multi-resource queueing algorithm should allow network operators to

flexibly specify their tradeoff preference and implement the specified tradeoff by determining the “right”

packet scheduling order.

However, designing such a queueing algorithm is non-trivial. It remains to be seen how efficiency can

be quantitatively defined. Further, it remains open how the tradeoff requirement should be appropriately

specified. But most importantly, given a specific tradeoff requirement, how can the scheduling decision


be correctly made to implement it?

This chapter represents the first attempt to address these challenges. We clarify the efficiency mea-

sure as the schedule makespan, which is the completion time of the last flow. We show that achieving

a flexible tradeoff between fairness and efficiency is generally NP-hard. We hence limit our discussion

to a typical scenario where CPU and link bandwidth are the two types of resources required for packet

processing, which is usually the case in middleboxes. We show that the fairness-efficiency tradeoff can

be strictly enforced by a GPS-like (Generalized Processor Sharing [21, 22]) fluid model, where packets

are served in arbitrarily small increments on both resources. To implement the idealized fluid in the

real world, we design a packet-by-packet tracking algorithm, using an approach similar to the virtual

time implementation of Weighted Fair Queueing (WFQ) [21, 22, 79]. We have prototyped our tradeoff

algorithm in the Click modular router [80]. Both our prototype implementation and trace-driven simu-

lation show that a 15% ∼ 20% fairness tradeoff is sufficient to achieve the optimal efficiency, leading to

a nearly 20% improvement in bandwidth throughput with a significantly higher resource utilization.

4.2 Fairness and Efficiency

Before discussing the tradeoff between fairness and efficiency, we shall first clarify how the notion of

fairness is to be defined, and how efficiency is to be measured quantitatively. We model packet processing

as going through a resource pipeline, where the first resource is consumed to process the packet first,

followed by the second, and so on. A packet is not available for the downstream resource until the

processing on the upstream resource finishes. For example, a packet cannot be transmitted (which

consumes link bandwidth) before it has been processed by CPU.

4.2.1 Dominant Resource Fairness

As we have seen in Chapter 3, fairness is one of the primary design objectives for a queueing algorithm.

A fair schedule offers service isolation among flows by allowing each flow to receive the throughput at

least at the level when every resource is evenly allocated. The notion of Dominant Resource Fairness

(DRF) embodies this isolation property by achieving the max-min fairness on the dominant resources of

packets in their respective flows. In Chapter 3, we have defined the dominant resource of a packet as the

one that requires the maximum packet processing time. In particular, let τr(p) be the time required to

process packet p on resource r. The dominant resource of packet p is rp = arg maxr τr(p). Given a packet

schedule, let Di(t1, t2) be the time flow i receives to process the dominant resources of its packets in a

backlogged period (t1, t2). The function Di(t1, t2) is referred to as the dominant service flow i receives


in (t1, t2). A schedule is said to strictly implement DRF if for all flows i and j, and for any period (t1, t2)

they backlog, we have

Di(t1, t2) = Dj(t1, t2). (4.1)

In other words, a strict DRF schedule allows each flow to receive the same dominant service in any

backlogged period.

However, because packets are scheduled as separate entities and are transmitted in sequence, strictly

implementing DRF at all times may not be possible in practice. For this reason, a practical fair schedule

only requires flows to receive approximately the same dominant services over time [4], as shown in the

previous example of Fig. 4.1a.

4.2.2 The Efficiency Measure

In addition to fairness, efficiency is another important concern for a multi-resource scheduling algorithm,

but has received no significant attention before. Even the definition of efficiency needs clarification.

Perhaps the most widely adopted efficiency measure is system throughput, whose conventional def-

inition is the rate of completions [81], computed as the processed workload divided by the elapsed

time (e.g., bits per second). While this performance metric is well defined for single-resource systems,

extending its definition to multiple types of resources leads to a throughput vector, where each compo-

nent is the throughput of one type of resource (e.g., 10 CPU instruction completions per second and

5 bits transmitted through the output link per second), and different throughput vectors may not be

comparable.

Another possible efficiency measure is resource utilization given non- empty system, or simply resource

utilization in the remainder of this chapter.1 However, in a middlebox, different resources may see

different levels of utilization. The question is: how should the “system utilization” be properly defined?

One possible definition is to add up the utilization rates of all resources. This definition implicitly

assumes exchangeable resources, say, 1% CPU usage is equivalent to 1% bandwidth consumption, which

may not be well justified in many circumstance, especially when one type of resource is scarce in the

system and is valued more than the other.

In this chapter, we measure efficiency with the schedule makespan. Given input flows with a finite

number of packets, the makespan of a schedule is defined as the time elapsed from the arrival of the first

packet to the time when all packets finish processing on all resources. One can also view makespan as

1This definition is different from that of queueing theory, where the utilization is defined as the fraction of time a deviceis busy [81]. Under this definition, high utilization usually means a high congestion level with a large queue backlog andlong delays [82], and is usually not desired.


the completion time of the last flow. Intuitively, given a finite traffic input, the shorter the makespan is,

the faster the input traffic is processed, and the more efficient the schedule is.2

4.2.3 Tradeoff between Fairness and Efficiency

With the precise measure of efficiency, we are curious to know how much efficiency is sacrificed for fair

queueing. To answer this question, we first generalize the definition of work conserving schedules from

traditional single-resource fair queueing to multiple resources. In particular, we say a schedule is work

conserving if at least one resource is fully utilized for packet processing when there is a backlogged flow.

In other words, a work conserving schedule does not allow resources to be wasted in idle if they can be

used to process a backlogged packet. Existing multi-resource fair queueing algorithms (e.g., DRFQ [4],

MR3, and GMR3) use the goal of achieving work conservation as an indication of efficiency. However,

in the theorem below, we observe that such an approach is ineffective.

Theorem 11. Let m be the number of resource types concerned. Given any traffic input I, let Tσ(I)

be the makespan of a work conserving schedule σ, and T ∗(I) the minimum makespan of an optimal

schedule. We have

Tσ(I) ≤ mT ∗(I). (4.2)

Proof. Given a traffic input I, let the work conserving schedule σ consist of nb busy period. A busy

period is a time interval during which at least one type of resource is used for packet processing. When

the system is empty and a new packet arrives, a new busy period starts. The busy period ends when

the system becomes empty again. We consider the following two cases.

Case 1: nb = 1. Let traffic input I consist of N packets, ordered based on their arrival times, where

packet 1 arrives first. For packet i, let τ(i)r be its packet processing time on resource r. It is easy to

check that the following inequality holds for the optimal schedule with the minimum makespan:

T ∗(I) ≥ maxr

N∑i=1

τ (i)r . (4.3)

On the other hand, for work conserving schedule σ, its makespan reaches the maximum when packet

2In general, makespan is not the only efficiency measure that one can define. For example, we can also measure efficiencywith the average flow completion time. We choose makespan as the efficiency measure in this chapter because it leadsto tractable analysis. More importantly, makespan closely relates to “system utilization” and is conceptually easy tounderstand. The discussion of other possible efficiency measures is out of the scope of this chapter.


processing does not overlap in time, across all resources, i.e.,

Tσ(I) ≤N∑i=1

m∑r=1

τ (i)r (4.4)

This leads to the following inequalities:

Tσ(I) ≤N∑i=1

m∑r=1

τ (i)r ≤

N∑i=1

mmaxrτ (i)r ≤ mT ∗(I). (4.5)

Case 2: nb > 1. Given traffic input I, let I(t+) be the packets that arrive on or after time t. For

schedule σ, let t0 be the time when its second last busy period (nb − 1) ends, and t1 the time when the

last busy period (nb) starts. Because schedule σ is work conserving, no packet arrives between t0 and

t1. We have

Tσ(I) = t1 + Tσ(I(t+1 )), (4.6)

and

T ∗(I) = t1 + T ∗(I(t+1 )). (4.7)

Note that given traffic input I(t+1 ), schedule σ consists of only one busy period. By the discussion of

Case 1, we have

Tσ(I) = t1 + Tσ(I(t+1 ))

≤ t1 +mT ∗(I(t+1 ))

≤ mT ∗(I),

(4.8)

where the last inequality is derived from (4.7). ut

We make the following three observations from Theorem 11. First, the tradeoff between fairness

and efficiency is a unique challenge facing multi-resource scheduling. When the system consists of only

one type of resource (i.e., m = 1), work conservation is sufficient to achieve the minimum makespan,

leaving fairness as the only concern. For this reason, efficiency has never been a problem for traditional

single-resource fair queueing. Second, while work conservation also provides some efficiency guarantee

for multi-resource scheduling, the more types of resources, the weaker the guarantee. Third, even

with a small number of resource types, the efficiency loss could be quite significant. Since bandwidth

throughput is inversely proportional to the schedule makespan, Theorem 11 implies that solely relying

on work conservation may incur up to 50% loss of bandwidth throughput when there are two types of


resources. While this is based on the worst case, as we shall see later in Sec. 4.6, our experiments confirm

that a throughput loss of as much as 20% is introduced by the existing fair queueing algorithms. Trading

off some degree of fairness for higher efficiency is therefore well justified, especially for applications with

loose fairness requirements.

4.2.4 Challenges

Unfortunately, striking a desired balance between fairness and efficiency in a multi-resource system

is technically non-trivial. Even minimizing the makespan without regard to fairness – a special case

of fairness-efficiency tradeoff – is NP-hard. In particular, we note that minimizing the makespan of

a packet schedule can be modeled as a multi-stage flow shop problem [83–85] studied in operations

research, where the equivalent of a packet is a job, and the equivalent of a type of resource is a machine.

However, flow shop scheduling is a notoriously hard problem, even in its offline setting where the entire

input is known beforehand. Specifically, when all jobs (packets) are available at the very beginning,

finding the minimum makespan is strongly NP-hard when the number of machines (resources) is greater

than two [86].

Given the hardness results above, in this chapter, we limit our discussion to two types of resources,

CPU and link bandwidth, as these are the two most concerned middlebox resources [4,17]. We note that

even with two types of resources, minimizing the schedule makespan remains a hard problem. Because

packets arrive dynamically over time, the problem resembles a 2-machine online flow shop scheduling

problem where jobs (packets) do not reveal their information until they arrive. For this problem, only a

limited amount of negative results is known [83–85,87,88]. Specifically, no online algorithm can ensure a

makespan within a factor of 1.349 of the optimum in all cases [89]. We also notice that no existing work

gives a concrete solution, even a heuristic algorithm, that jointly considers both makespan and fairness.

4.3 Fairness, Efficiency, and Their Tradeoff in the Fluid Model

The difficulty of makespan minimization is mainly introduced by the combinatorial nature of multi-

resource scheduling. One approach to circumvent this problem is to consider a fluid relaxation, where

packets are served in arbitrarily small increments on all resources. For each packet, this is equivalent

to processing it simultaneously on all resources with the same progress, and head-of-line packets of

backlogged flows can also be served in parallel, at (potentially) different processing rates. Such a parallel

processing fluid model eliminates the need for discussing the scheduling orders of flows. Instead, it

allows us to focus on the resource shares allocated to flows, hence relaxing a combinatorial optimization


problem to a simpler dynamic resource allocation problem. While in general, optimally solving such

a dynamic problem requires knowing future packet arrivals, we show in this section that, under some

practical assumptions, a greedy algorithm gives an optimal online schedule with the minimum makespan.

We can then strike a balance between efficiency and fairness by imposing some fairness constraints to

the fluid schedule. We shall discuss later in Sec. 4.4 and Sec. 4.5 how this fluid schedule is implemented

in practice with a packet-by-packet tracking algorithm at acceptable complexity.

4.3.1 Fluid Relaxation

In the fluid model, a flow is relaxed to a fluid where each of its packets is served simultaneously on all

resources with the same progress. Packets of different flows are also served in parallel. The schedule

needs to decide, at each time, the resource share allocated to each backlogged flow. In particular, let Bt

be the set of flows that are backlogged at time t. Let ati,r be the fraction (share) of resource r allocated

to flow i at time t. The fluid schedule determines, at each time t, the resource allocation ati,r for each

backlogged flow i and each resource r.

Two constraints must be satisfied when making resource allocation decisions. First, we must ensure

that no resource is allocated more than its total availability:

∑i∈Bt

ati,r ≤ 1, r = 1, 2. (4.9)

The second constraint ensures that a packet is processed at a consistent rate across resources. In

particular, for a backlogged flow i and its head-of-line packet at time t, let τ ti,r be its packet processing

time on resource r, and

ri = arg maxr

τ ti,r (4.10)

be its dominant resource. The processing rate that this packet receives on resource r is computed as

the ratio between the resource share allocated and the processing time required: ati,r/τti,r. To ensure a

consistent processing rate, we have

ati,r/τti,r = ati,r′/τ

ti,r′ , for all r and r′.

Substituting ri into r′ above, we see a linear relation between the allocation share of resource r and that

of the dominant resource:

ati,r =τ ti,rτ ti,ri

ati,ri = τ ti,rdti, (4.11)


Table 4.1: Main notations used in the fluid model. The superscript t is dropped when time can be clearlyinferred from the context.

Notation Explanation

n maximum number of flows that are con-currently backlogged

α fairness knob specified by the networkoperator

B (or Bt) set of flows that are currently back-logged (at time t)

di (or dti) dominant share allocated to flow i (attime t)

d (or dt) fair dominant share (at time t), givenby (4.16)

τi,r (or τ ti,r) packet processing time on resource r re-quired by the head-of-line packet of flowi (at time t)

τi,r (or τ ti,r) normalized τi,r (or τ ti,r), defined by(4.12)

where

τ ti,r = τ ti,r/τti,ri (4.12)

is the normalized packet processing time on resource r, and

dti = ati,ri (4.13)

is the dominant share allocated to flow i at time t. Plugging (4.11) into (4.9), we combine the two

constraints into one feasibility constraint of a fluid schedule:

∑i∈Bt

τ ti,rdti ≤ 1, r = 1, 2. (4.14)

Before we discuss the tradeoff between fairness and efficiency, we first consider two special cases,

where either fairness or efficiency is the only objective to optimize in the fluid model. For ease of

presentation, we drop the superscript t when time can be clearly inferred from the context. Table 4.1

summarizes the main notations used in the fluid model.

4.3.2 Fluid Schedule with Perfect Fairness

We first consider the fairness objective. To achieve perfect DRF, the fluid schedule enforces strict max-

min fairness on flows’ dominant shares, under the feasibility constraint. Specifically, the fluid schedule


q1 q2p1

CPU

Link

p2 p3 p4 p5 p6

p1 p2 p3 p4 p5 p6

Time0 6 8 12 14 182 4 10 16 20 22 24 26

q1 q2

Figure 4.2: The DRGPS fluid that implements the perfect fairness in the example of Fig. 4.1. Flow 1sends packets p1, p2, ..., and receives 〈3/5 CPU, 1/15 bandwidth〉; Flow 2 sends packets q1, q2, ..., andreceives 〈3/5 CPU, 1/15 bandwidth〉. Only 2/3 of the link bandwidth is utilized.

solves the following DRF allocation problem [12, 16] at each time t:

maxdi

mini∈B

di

s.t.∑i∈B

τi,rdi ≤ 1, r = 1, 2.(4.15)

Let n be the number of backlogged flows. The optimal solution, denoted by d = (d1, . . . , dn), allocates

each backlogged flow the same dominant share, i.e.,

di = d = 1/max {∑i τi,1,

∑i τi,2} . (4.16)

In any backlogged periods, because flows are allocated the same dominant shares, they receive the same

dominant services, achieving strict DRF at all times. The resulting fluid schedule is also known as

DRGPS [66], a multi-resource generalization to the well-known GPS [21,22].

Any discrete fair schedule is essentially a packet-by-packet approximation to DRGPS. For instance,

applying DRGPS to the example of Fig. 4.1 leads to a fluid schedule shown in Fig. 4.2, where the

normalized packet processing times of Flow 1 and Flow 2 are 〈τ1,1, τ1,2〉 = 〈2/3, 1〉 and 〈τ2,1, τ2,2〉 =

〈1, 1/9〉, respectively. By (4.16), both flows are allocated the same dominant share d = 3/5. Specifically,

Flow 1 receives 〈2/5 CPU, 3/5 bandwidth〉; Flow 2 receives 〈3/5 CPU, 1/15 bandwidth〉. In total, only

2/3 of the bandwidth is utilized, the same as the discrete fair schedule shown in Fig. 4.1a.

4.3.3 Fluid Schedule with Optimal Efficiency

We next discuss the efficiency objective. While there are some schedules proposed in the operations

research literature that can achieve the minimum makspan for a flow shop problem, none of them

applies in the context of packet scheduling: they either assume no packet arrivals (e.g., [90]) or require

full knowledge of future information (e.g., [91]). We propose a simple greedy fluid schedule as follows.

For a given time instant, we define the system’s instantaneous dominant throughput as the sum of


the dominant share allocated, i.e.,∑i∈B di. Intuitively, by maximizing

∑i∈B di at all times, one would

expect a high average dominant throughput∑i∈BDi/T , where T is the schedule makespan and Di is

the total dominant services (processing time) required by flow i. Given dominant workload∑i∈BDi,

maximizing the average dominant throughput is equivalent to minimizing the schedule makespan T .

Following this intuition, we propose a greedy fluid schedule that solves the following resource allocation

problem to maximize the instantaneous dominant throughput at every time:

maxdi≥0

∑i∈B

di

s.t.∑i∈B

τi,rdi ≤ 1, r = 1, 2.

(4.17)

In case that the optimal solution, denoted d∗ = (d∗1, . . . , d∗n), is not unique, the schedule chooses the one

with the maximum overall utilization:

maxd∗i

∑r

∑i∈B

τi,rd∗i . (4.18)

In the example of Fig. 4.1, solving (4.17) allocates Flow 1 the dominant share d∗1 = 9/25 and Flow 2 the

dominant share d∗2 = 24/25. It is easy to check that both CPU and link bandwidth are fully utilized.

Compared to those schedules proposed in the operations research literature, the greedy schedule

defined by (4.17) is particularly attractive for packet scheduling due to the following three properties.

First, it is an online algorithm without any a priori knowledge of future packet arrivals. Further, among

all packets that are backlogged, only the information regarding head-of-line packets is required. This

suggests that the schedule only needs to maintain a very simple per-flow state. Most importantly, the

greedy schedule is more than a simple heuristic. Below we show that under some practical assumptions,

greedily maximizing the dominant throughput gives the minimum makespan. Our analysis requires the

following lemma, where we show that the schedule will not waste any resource in idle, unless all flows

bottleneck on the same resource, in which case the other resource cannot be fully utilized anyway. The

proof is given in Sec. 4.9.1.

Lemma 11. The fluid schedule defined by (4.17) fully utilizes both resources if there are two head-of-line

packets with different dominant resource, i.e., there exist two flows j and l, such that τj,1 = 1 > τj,2 and

τl,1 < τl,2 = 1.

With Lemma 11, we analyze the makespan of the fluid schedule defined by (4.17). Following [4], we

say a flow is dominant-resource monotonic if it does not change its dominant resource during backlogged

periods. To make the analysis tractable, we assume that flows are dominant-resource monotonic. This


is often true in practice as packets in the same flow usually undergo the same processing, and hence

have the same dominant resource. The following lemma, whose proof can be found in Sec. 4.9.2, states

the optimality of the fluid schedule in a static scenario without dynamic packet arrivals.

Lemma 12. For dominant-resource monotonic flows, the fluid schedule defined by (4.17) gives the

minimum makespan if all packets are available at the beginning.

We now extend the results of Lemma 12 to an online case where packets dynamically arrive over

time. The following theorem gives the optimality condition of the fluid schedule. The proof can be found

in Sec. 4.9.3.

Theorem 12. For dominant-resource monotonic flows, the fluid schedule defined by (4.17) gives the

minimum makespan among all schedules, if after the system has two flows with different dominant

resources, whenever a new flow arrives, there exist two backlogged flows with different dominant resources.

The optimality conditions required by Theorem 12 can be easily met in practice. Because the number

of backlogged flows is usually large, it is almost true that we can always find two flows with different

dominant resources. In fact, even in a very unfortunate case where all flows bottleneck on the same

resource, the greedy fluid schedule does not deviate far away from the optimum: no matter what fluid

schedule is used, the bottleneck resource is always fully utilized when the system is non-empty and hence

has the same backlog, which is a dominant factor in determining the schedule makespan.

The significance of Theorem 12 is that it connects makespan, a measure defined in the time domain,

to the instantaneous dominant throughput, a measure defined in the space domain. More importantly,

it shows that minimizing the former is, in a practical sense, equivalent to maximizing the latter at all

times, without the need to know future packet arrivals. We shall use this intuition to strike a balance

between fairness and efficiency in the next subsection.

4.3.4 Tradeoff between Fairness and Efficiency

When both fairness and efficiency are considered, we express the tradeoff between the two conflicting

objectives as a constrained optimization problem – minimizing makespan under some specified fairness

requirements.3 Recall that when perfect fairness is enforced, all flows receive the same dominant share

d computed by (4.16), i.e., di = d for all i. When fairness is not a strict requirement, we introduce

a fairness knob α ∈ [0, 1] to specify the fairness degradation. In particular, an allocation d is called

α-portion fair if di ≥ αd for all backlogged flow i. In other words, each flow receives at least an α-portion

3In general, the formulation of the tradeoff is not unique. However, we found that our formulation is particularlyattractive: it leads to both the tractable analysis and practical implementation, as we shall show later in this chapter.


of its fair dominant share d. A fluid schedule is called α-portion fair if it achieves the α-portion fair

allocation at all times.

By choosing different values for α, a network operator can precisely control the fairness degradation.

As two extreme cases, setting α = 0 means that fairness is not considered at all; setting α = 1 means

that perfect fairness must be enforced at all times.

Given the specified fairness knob α, the fluid schedule tries to minimize makespan under the corre-

sponding α-portion fairness constraints. Since minimizing makespan is, in a practical sense, equivalent

to maximizing the system’s dominant throughput, we obtain a simple tradeoff heuristic that maximizes

the dominant throughput, subject to the required α-portion fairness at every time t:

maxdi

∑i∈B

di

s.t.∑i∈B

τi,rdi ≤ 1, r = 1, 2,

di ≥ αd, ∀i ∈ B,

(4.19)

where the fair share d is given by (4.16). We see that the fluid schedule captures both DRGPS and the

greedy schedule defined by (4.17) as special cases with α = 1 and 0, respectively.

Special Solution Structure. The tradeoff problem (4.19) has a closed-form solution, based on

which the tradeoff schedule can be easily computed. We first allocate each flow its guaranteed portion

of dominant share αd. We then denote

di = di − αd (4.20)

as the bonus dominant share allocated to flow i. Substituting (4.20) into (4.19), we equivalently rewrite

(4.19) as a problem of determining the bonus dominant share received by each flow:

maxdi≥0

∑i∈B

di + |B|αd

s.t.∑i∈B

τi,rdi ≤ µr r = 1, 2,

(4.21)

where

µr = 1− αd∑i∈B

τi,r, r = 1, 2, (4.22)

and is the remaining share of resource r after each flow receives its guaranteed dominant share αd.

Without loss of generality, we sort all the backlogged flows based on the processing demands on the two


types of resources required by their head-of-line packets as follows:

τ1,1/τ1,2 ≥ · · · ≥ τn,1/τn,2. (4.23)

The following theorem shows that at most two flows are awarded the bonus share at a time. Its proof

is given in Sec. 4.9.4.

Theorem 13. There exists an optimal solution d∗ to (4.21) where d∗i = 0 for all 2 ≤ i ≤ n − 1. In

particular, d∗ is given in the following three cases:

Case 1: µ1/µ2 < τn,1/τn,2. In this case, resource 1 is fully utilized, with d∗n = µ1/τn,1 and d∗i = 0 for

all i < n.

Case 2: µ1/µ2 > τ1,1/τ1,2. In this case, resource 2 is fully utilized, with d∗1 = µ2/τ1,2 and d∗i = 0 for

all i > 1.

Case 3: τn,1/τn,2 ≤ µ1/µ2 ≤ τ1,1/τ1,2. In this case, both resources are fully utilized, and we have

d∗i =

(µ1τn,2 − µ2τn,1)/(τ1,1τn,2 − τ1,2τn,1), i = 1;

(µ2τ1,1 − µ1τ1,2)/(τ1,1τn,2 − τ1,2τn,1), i = n;

0, o.w.

Once the optimal bonus dominant share has been determined as shown above, the optimal solution

d∗ to (4.19), which is the dominant share allocated to each flow, can be easily computed as the sum of

the bonus share and the guaranteed share:

d∗i = d∗i + αd, for all i. (4.24)

We give an intuitive explanation of Theorem 13 as follows. The first two cases of Theorem 13

correspond to the scenario where after each flow receives its guaranteed share, the remaining amounts of

the two types of resources are unbalanced and cannot be fully utilized simultaneously. In this case, the

schedule awards the bonus share to the flow (either Flow 1 or Flow n) whose processing demands can

better utilize the remaining resources. The third case covers the scenario where the remaining amounts

of the two types of resources are balanced, and can be fully utilized when the system is non-empty. In

this case, they are allocated to two flows (Flow 1 and Flow n) with complementary resource demands

as their bonus shares.

Theorem 13 reveals an important structure, that at most two flows are allocated more dominant

shares than others. We refer to these flows as the favored flows and all the others as the regular flows.


We shall show in Sec. 4.5 that this structure leads to an efficient O(log n) implementation of the fluid

schedule.

4.4 Packet-by-Packet Tracking

So far, all our discussions are based on an idealized fluid model. In practice, however, packets are

processed as separate entities. In this section, we present a discrete tracking algorithm that implements

the fluid schedule as a packet-by-packet schedule in practice. We show that the discrete schedule is

asymptotically close to the fluid schedule, in terms of both fairness and efficiency. We start with a

comparison between two typical tracking approaches.

4.4.1 Start-Time Tracking vs. Finish-Time Tracking

Two common tracking algorithms may be used to implement a fluid schedule in practice, start-time

tracking and finish-time tracking. The former tracks the order of packet start times—among all packets

that have already started in the fluid schedule, the one that starts the earliest is scheduled first. Finish-

time tracking, on the other hand, assigns the highest scheduling priority to the packet that completes

service the earliest in the fluid schedule. In traditional single-resource fair queueing, FQS [79] uses the

former approach to track GPS, while WFQ [21,22,26] adopts the latter approach.

While both algorithms closely track the fluid schedule of fair queueing, only start-time tracking is

well defined for the tradeoff schedule given by (4.19). This is due to the fact that, in the tradeoff

schedule, future traffic arrivals may lead to a different allocation of packet processing rates and may

subsequently change the packet finish times of current packets. As a result, determining the order of

finish times requires future traffic arrival information and hence is unrealistic.4 Start-time tracking

avoids this problem as packets are scheduled only after they start in the fluid schedule.

For this reason, we use start-time tracking to implement the fluid schedule. We say a discrete

schedule and a fluid schedule correspond to each other if the former tracks the latter by the packet start

time. Specifically, we maintain the fluid schedule in the background. Whenever there is a scheduling

opportunity, among all head-of-line packets that have already started in the fluid schedule, the one that

starts the earliest is chosen. Below we show that this discrete schedule is asymptotically close to its

corresponding fluid schedule.

4This is not a problem of single-resource fair queueing as different flows are allocated the same processing rate, so thatfuture traffic arrivals will not affect the order of finish times of current packets.



To analyze the performance of start-time tracking, we introduce the following notations. Let τmax be

the maximum packet processing time required by any packet on any resource. Let n be the maximum

number of flows that are concurrently backlogged. Let TF be the makespan of the fluid schedule, and

TD the makespan of its corresponding discrete schedule. All proofs are given in Secs. 4.9.5, 4.9.6, and

4.9.7.

The following theorem bounds the difference between the make-span of the fluid schedule and its

corresponding discrete schedule.

Theorem 14. For the fluid schedule with α > 0 and its corresponding discrete schedule, we have

TD ≤ TF + nτmax. (4.25)

The error bound nτmax can be intuitively explained as the total packet processing time required by

all n concurrent flows, each sending only one packet. In practice, the number of packets a flow sends

is usually significantly larger than one. As a result, the traffic makespan is significantly larger than the

error bound, i.e., TF � nτmax. Theorem 14 essentially indicates that in terms of makespan, the two

schedules are asymptotically close to each other.

We next analyze the fairness performance of the discrete schedule by comparing the dominant services

a flow receives under both schedules. In particular, let DFi (0, t) be the dominant services flow i receives

in (0, t) under the fluid schedule, and DDi (0, t) the dominant services flow i receives in (0, t) under the

corresponding discrete schedule. The following theorem shows that flows receive approximately the same

dominant services under both schedules.

Theorem 15. For the fluid schedule with α > 0 and its corresponding discrete schedule, the following

inequality holds for any flow i and any time t:

DFi (0, t)− 2(n− 1)τmax ≤ DD

i (0, t) ≤ DFi (0, t) + τmax. (4.26)

In other words, the difference between the dominant services a flow receives under the two corre-

sponding schedules is bounded by a constant amount, irrespective of the time t. Over the long run, the

discrete schedule achieves the same α-portion fairness as its corresponding fluid schedule. To summarize,

start-time tracking retains both the efficiency and fairness properties of its corresponding fluid schedule

in the asymptotic regime.


4.5 An O(log n) Implementation

To implement the aforementioned start-time tracking algorithm, two modules are required: packet pro-

filing and fluid scheduling. The former estimates the packet processing time on both CPU and link

bandwidth; the latter maintains the fluid schedule as a reference system based on the packet profiling

results. We show in this section that packet profiling can be quickly accomplished in O(1) time using

a simple approach proposed in [4]. The main challenge comes from the complexity of maintaining the

fluid schedule, where direct implementation requires O(n) time. Here, n is the number of backlogged

flows. We give an O(log n) implementation based on an approach similar to virtual time. We shall show

in Sec. 4.6 that the implementation can be easily prototyped in the Click modular router [80].

4.5.1 Packet Profiling

As pointed out by Ghodsi et al. [4], any multi-resource fair queueing algorithm, including our fluid

schedule, requires knowledge of the packet processing time on each resource. Fortunately, as shown

in [4], CPU processing time can be accurately estimated as a linear function of packet size. Specifically,

for a packet of size l, the CPU processing time is estimated as al + b, where a and b are the coefficients

depending on the type of packet processing (e.g., IPsec). We have validated this linear model through

an upfront experiment using Click [80]. For each type of packet processing, we measure the exact CPU

processing time required by packets of different sizes. This allows us to determine the coefficients a and

b. We fit such a linear model to the scheduler and use it to estimate the CPU processing time required

by a packet. As for the packet transmission time, the estimation is simply the packet size divided by

the outgoing bandwidth, which is known a priori.

4.5.2 Direct Implementation of Fluid Scheduling

Based on the packet profiling results, the fluid schedule is constructed and is maintained by the fluid

scheduler. In particular, we need to determine the next packet that starts in the fluid schedule. This

requires tracking the work progress of all n flows. Below we give a direct implementation that will be

used later in our virtual time implementation.

For each flow i, we record d∗i , which is the dominant share the flow receives in the fluid schedule at

the current time and is computed by (4.24). We also record Ri, the remaining dominant processing time

required by the head-of-line packet of the flow at the current time.

For flow i, its head-of-line packet will finish in Ri/d∗i time if no event occurs then. An event is either a

packet departure or a packet being the new head-of-line in the fluid schedule. Either of them may change


the head-of-line packet of a flow, leading to different coefficients of the tradeoff problem (4.19). With d∗i

and Ri, we can accurately track the work progress of flow i in an event-driven basis. Specifically, upon

the occurrence of an event, let ∆t be the time elapsed since the last update. If ∆t < Ri/d∗i , meaning

that the event occurs before the head-of-line packet finishes, we update Ri ← Ri− d∗i∆t. If ∆t = Ri/d∗i ,

meaning that the event occurs at the time when the head-of-line packet finishes, we check if flow i has

a next packet p to process. If it does, then packet p becomes the new head-of-line and should start in

the fluid schedule. We update Ri as the dominant processing time required by p. Otherwise, we reset

Ri ← 0, and flow i leaves the fluid system. We also recompute d∗i after Ri is updated. (Note that it is

impossible to have ∆t > Ri/d∗i .)

However, purely relying on the approach above to track the work progress of all n flows is highly

inefficient. Whenever an event occurs, each flow must be updated individually, which requires at least

O(n) time per event and is too expensive. We next introduce a more efficient implementation that

requires the above procedure for at most two flows.

4.5.3 Virtual Time Implementation of Fluid Scheduling

To avoid the high complexity required by the direct implementation above, we have noted, by The-

orem 13, that at most two flows are favored and are allocated more dominant shares than others.

Therefore, it suffices to maintain at most three dominant shares at a time—two for the favored flows

and one for the other regular flows. For regular flows, we track their work progress using an approach

similar to the virtual time implementation of GPS [21, 22]. Our intuition is that, by Theorem 13, all

the regular flows are allocated the same dominant share, and their scheduling resembles fair queueing.

For favored flows, since there are at most two of them, we track their work progress directly, using the

direct implementation above. Our approach is detailed below.

Identifying Favored and Regular Flows

We first discuss how favored and regular flows can be quickly identified upon the occurrence of an

event. By Theorem 13, it suffices to sort flows in order (4.23) and examine the three cases. Flows

that receive the bonus share (i.e., d∗i > 0) are favored. Note that the entire computation requires only

information regarding the head-of-line packet of the first and the last flow in order (4.23) (τ1,r and

τn,r, the normalized dominant processing time). We store all the head-of-line packets in a double-ended

priority queue maintained by a min-max heap for fast retrieval, where the packet order is defined by

(4.23). This allows us to apply Theorem 13 and identify the favored and regular flows in O(log n) time.


Tracking Favored Flows

For favored flows, because there are at most two of them, we track their work progress using the direct

implementation mentioned in Sec. 4.5.2, where we record d∗i and Ri for each favored flow i. It is easy to

see that the update complexity is dominated by the computation of d∗i . As mentioned in the previous

discussion, this can be done in O(log n) time by Theorem 13. Also, since there are at most two favored

flows, the overall tracking complexity remains O(log n) per event.

Tracking Regular Flows

For regular flows, since they receive the same dominant share, their scheduling resembles fair queueing.

We hence track their work progress using virtual time [21, 22, 26]. Specifically, we define virtual time

V (t) as a function of real time t evolving as follows:

V (0) = 0,

V ′(t) = αdt, t > 0.

(4.27)

Here, dt is the fair dominant share computed by (4.16) at time t, and is fixed between two consecutive

events; αdt is the dominant share each regular flow receives.5 Thus, V can be interpreted as increasing

at the marginal rate at which regular flows receive dominant services. Each regular flow i also maintains

virtual finish time Fi, indicating the virtual time at which its head-of-line packet finishes in the fluid

schedule. The virtual finish time Fi is updated as follows when flow i has a new head-of-line packet p

at time t:

Fi = V (t) + τ∗(p), (4.28)

where τ∗(p) is the dominant packet processing time required by p. Among all the regular flows, the

one with the smallest Fi has its head-of-line packet finishing first in the fluid schedule. Unless some

event occurs in between, at time t, the next packet departure for the regular flows would be in tN =

(mini Fi − V (t))/αd time.

Using virtual time defined by (4.27), we can accurately track the work progress of regular flows in

an event-driven basis. Specifically, upon the occurrence of an event at time t, let t0 be the time of the

last update, and ∆t = t− t0 the time elapsed since the last update. If ∆t < tN , meaning that the event

occurs before the next packet departure of regular flows, we simply update the virtual time following

5We restore the superscript t here to emphasize that the fair dominant share computed by (4.16) may change over time.


(4.27):

V (t) = V (t0) + αd∆t. (4.29)

If ∆t = tN , then the event occurs at the time when a packet of a regular flow, say flow i, finishes in the

fluid schedule. In addition to updating the virtual time, we check to see if flow i has a next packet p to

process. If it does, meaning that the packet p should start in the fluid schedule, we update its virtual

finish time Fi following (4.28). Otherwise, flow i departs the system. We also recompute d by (4.16).

The tracking complexity is dominated by the computation of the minimum virtual finish time, i.e.,

mini Fi. By storing Fi’s in a priority queue maintained by a heap, we see that the tracking complexity

is O(log n) per event.

Handling Identity Switching

We note that the identity of a flow is not fixed: upon the occurrence of an event, a favored flow may

switch to a regular flow, and vice versa. We show that such identity switching can also be easily handled

in O(log n) time.

We first consider a favored flow i switching to a regular one at time t, which requires the computation

of the virtual finish time Fi. Recall that we have recorded Ri, the remaining dominant processing time

required by the head-of-line packet, for flow i as it is previously favored. By definition, the virtual finish

time Fi can be simply computed as

Fi = V (t) +Ri. (4.30)

Adding Fi to the heap takes at most O(log n) time.

We next consider a regular flow i switching to a favored one at time t, which requires the computation

of Ri. Recall that we have recorded the virtual finish time Fi for flow i. By definition, the remaining

dominant processing time required by its head-of-line packet is simply

Ri = Fi − V (t), (4.31)

which is a dual of (4.30).

We also need to remove the virtual finish time, Fi, from the heap. To do so, we maintain an index

for each regular flow, recording the location of its virtual finish time stored in the heap. Following this

index, we can easily locate the position of Fi and delete it from the heap, followed by some standard

“trickle-down” operations to preserve the heap property in O(log n) time.

To summarize, our approach maintains the fluid schedule by identifying favored and regular flows,


tracking their work progress, and handling the potential identity switching. We show that any of these

operations can be accomplished in O(log n) time. As a result, maintaining the fluid schedule takes

O(log n) time per event.

4.5.4 Start-Time Tracking and Complexity

With the fluid schedule maintained as a reference system, the implementation of start-time tracking is

straightforward. Whenever a packet starts in the fluid schedule, it is added to a FIFO queue. Upon a

scheduling opportunity, the scheduler polls the queue and retrieves a packet to schedule. This ensures

that packets are scheduled in order of their start times in the fluid schedule. To minimize the update

frequency, the scheduler lazily updates the fluid schedule only when the FIFO queue is empty.

We now analyze the scheduling complexity of the aforementioned implementation. The scheduling

decisions are made by updating the fluid schedule in an event-driven basis. For each event, the update

takes O(log n) time, where n is the number of backlogged flows. Note that there are only two types

of events in the fluid schedule, new head-of-line and packet departure. Because a packet served in the

fluid schedule triggers exactly these two events over the entire scheduling period, scheduling N packets

triggers 2N updates in the fluid schedule, with the overall complexity O(2N log n). On average, the

scheduling decision is made in O(2 log n) time per packet, the same order as that of DRFQ [4].

4.6 Evaluation

We evaluate the tradeoff algorithm via both our prototype implementation and trace-driven simulation.

We use a prototype implementation to investigate the detailed functioning of the algorithm, in a micro-

scopic view. We then take a macroscopic view to evaluate the algorithm using trace-driven simulation,

where flows dynamically join and depart the system.

4.6.1 Experimental Results

We have prototyped our tradeoff algorithm as a new scheduler in the Click modular router [80], based

on the O(log n) implementation given in the previous section. The scheduler classifies packets to flows

(based on the IP prefix and port number) and identifies the types of packet processing based on the port

number specified by a flow class table. The scheduler also exposes an interface that allows the operator

to dynamically configure the fairness knob α. Our implementation consists of roughly 1,000 lines of

C++ code.


Table 4.2: Schedule makespan observed in Click at different fairness levels. The queue capacity is infinite.

Fairness knob α Makespan (s) Normalized Makespan1.00 55.68 100.00%0.95 52.50 94.28%0.90 48.97 87.95%0.85 47.17 84.72%0.70 47.13 84.64%0.60 47.07 84.54%0.50 47.07 84.54%

We run our Click implementation in user mode on a Dell PowerEdge server with an Intel Xeon

3.0 GHz processor and 1 Gbps Ethernet interface. To make fairness relevant, we throttle the outgoing

bandwidth to 200 Mbps while keeping the inbound bandwidth as is. We also throttle the Click module

to use only 20% CPU so that CPU could also be a bottleneck.6 We configure three packet processing

modules in Click to emulate a multi-functioning middlebox: packet checking, statistical monitoring, and

IPsec. The former two modules are bandwidth-bound, though statistical monitoring requires more CPU

processing time than packet checking does. The IPsec module encrypts packets using AES (128-bit key

length) and is CPU-bound. We configure another server as a traffic source, initiating 60 UDP flows each

sending 2000 800-byte packets per second to the Click router. The first 20 flows pass through the packet

checking module; the next 20 flows pass through the statistical monitoring module; and the last 20 flows

pass through the IPsec module.

Fairness-Efficiency Tradeoff

We first evaluate the achieved tradeoff between schedule fairness and makespan. To fairly compare the

makespan at different fairness levels, it is critical to ensure the same traffic input when running the

algorithm with different values of fairness knob α. Therefore, we initially consider an idealized scenario

where each flow queue has infinite capacity and never drops packets. Table 4.2 lists the observed

makespans with various fairness requirements, in an experiment where each flow keeps sending packets

for 10 seconds. We see that, as expected, trading off some level of fairness leads to a shorter makespan

and higher efficiency. Furthermore, the marginal improvement of efficiency is decreasing. This suggests

that one does not need to compromise too much fairness in order to achieve high efficiency. In our

experiment, trading off 15% of fairness shortens the makespan by 15.3% from the strictly fair schedule

(α = 1), which is equivalent to a 18.1% bandwidth throughput enhancement and is near-optimal as seen

in Table 4.2. Fig. 4.3 gives a detailed look into the achieved resource utilization over time, at four fairness

6We configured Click to run in the user mode, which does not support high-speed network. We therefore throttle theCPU cycles so as to match the processing speed with the transmission speed.


0 10 20 30 40 50 600

20

40

60

80

100

CP

U U

tilz

atio

n (

%)

0 10 20 30 40 50 600

20

40

60

80

100

Time (s)

B/W

Utilz

atio

n (

%)

α = 0.85

α = 0.90

α = 0.95

α = 1.00

Figure 4.3: Overall resource utilization observed in Click. No packet drops.

50 60 70 80 90 1000

1

2

3

4

5

6

7

α (%)

Do

min

an

t S

ha

re (

%)

Figure 4.4: Dominant share each flow receives per second in Click. No packet drops. The strict fairshare is 2%.

levels. We see that strictly fair queueing (α = 1) wastes 30% of CPU cycles, leaving the bandwidth as

the bottleneck at the beginning. This situation remains until bandwidth-bound flows finish, at which

time the bottleneck shifts to CPU. By relaxing fairness, CPU-bound flows receive more services, leading

to a steady increase of CPU utilization up to 100%. Meanwhile, bandwidth-bound flows experience

slightly longer completion times due to the fairness tradeoff.

We now verify the fairness guarantee. We run the scheduler at various fairness levels. At each level,

for each flow, we measure its received dominant share every second for the first 20 seconds, during which

all flows are backlogged. Fig. 4.4 shows the results, where each cross (“x”) corresponds to the dominant

share of a flow measured in one second. As expected, under strict fairness (α = 1), all flows receive the

same dominant share (around 2%). As α decreases, the fairness requirement relaxes. Some flows are

hence favored and are allocated more dominant share, while others receive less. However, the minimum


80 85 90 95 10060

70

80

90

100

α (%)

Avera

ge U

tiliz

ation (

%)

CPUB/W

(a) Resource utilization.

0 20 40 601

2

3

4

5

Flow ID

Mean D

om

. S

hare

(%

)

α = 0.85

α = 0.90

α = 0.95

α = 1.00

(b) Dominant share.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Per−Packet Latency (s)

CD

F

α = 0.85

α = 0.90

α = 0.95

α = 1.00

(c) Per-packet latency

Figure 4.5: Average resource utilization and dominant share each flow receives in Click at differentfairness levels. The queue capacity is 200 packets. The measurement of resource utilization and dominantshare is conducted every second over the entire schedule.

dominant share a flow receives is lower bounded by the α-portion of the fair share, shown as the solid

line in Fig. 4.4. This shows that the algorithm is correctly operating at the desired fairness level.

We next extend the experiment to a more practical setup, where each flow queue has a limited

capacity and drops packets when it is full. We set the queue size to 200 packets for each flow and repeat

the previous experiments. In this case, comparing makespan is inappropriate as the scheduler may drop

different packets when running at different fairness level. We instead measure the resource utilization

achieved every second over the entire scheduling period. Fig. 4.5a illustrates the average utilization of

both CPU and bandwidth, where the error bar shows one standard deviation. Similar to the previous

experiments, a fairness degradation of 15% is sufficient to achieve the optimal efficiency, enhancing the

CPU utilization from 71% to 100%. Further trading off fairness is not well justified. As shown in

Fig. 4.5b, the increased CPU throughput is mainly used to process those CPU-bound flows (Flows 41 to

60), doubling their dominant shares. Meanwhile, the dominant share received by all the other flows is

at least 85% of the fair share, as promised by the algorithm. We also depict the per-packet latency CDF

in Fig. 4.5c. We see that trading off fairness for efficiency significantly improves the tail latency, usually

caused by flows that finish the last. On the other hand, flows whose shares have been traded off see

slightly longer delays of their packets. Fortunately, these latency penalties are strictly bounded—thanks

to the fairness guarantee—and are compensated by the significant latency improvement of favored flows.

We have also measured the scheduling overhead in the experiments. In particular, we configure the

tradeoff scheduler for strict fair queueing by setting α = 1. We then compare the incurred CPU overhead

with that of MR3, a low complexity fair scheduler introduced in Sec. 3.4.3. Our measurement shows

that the tradeoff scheduler introduces 1% CPU overhead compared with MR3.


1 2 310

20

30

40

50

Flow ID

Me

an

La

ten

cy (

ms)

α = 0.8

α = 0.9

α = 1.0

(a) Elephant flows.

4 5 60.1

0.2

0.3

0.4

Flow ID

Me

an

La

ten

cy (

ms)

α = 0.8 α = 0.9 α = 1

(b) Mice flows.

Figure 4.6: Mean per-packet latency of elephant (sending 20,000 pkts/s) and mice flows (sending 2pkts/s) in Click. The error bar shows the standard deviation.

Service Isolation

We next examine the impact of fairness tradeoff on service isolation. We initiate 6 UDP flows sending

800-byte packets. Flows 1 to 3 are elephant flows, each sending 20,000 packets per second, and undergo

the checking, monitoring, and IPsec modules, respectively. Flows 4 to 6 are mice flows, each sending 2

packets per second, and undergo the checking, monitoring, and IPsec modules, respectively. The queue

capacity is set to 200 packets. Fig. 4.6 shows the per-packet latency of each flow at different fairness

levels. We see that the tradeoff mainly affects those high-rate flows. For mice flows, even if they may

receive less resource share when α < 1, the guaranteed share is sufficient to accommodate their low-rate

traffic. As a result, their packets are scheduled almost immediately upon arrival, with two orders of

magnitude lower latency than the elephant flows.

We also compare our tradeoff scheduler against other fair queueing algorithms. In particular, we have

implemented MR3 and GMR3 as two other round-robin O(1) schedulers in Click, and conducted the

same experiments mentioned above. We find that they achieve almost the same makespan and resource

utilization as that of the tradeoff scheduler running with the strict fairness requirement (α = 1). This

should come with no surprises, as all the existing multi-resource fair queueing algorithms are essentially

different approximations to the fluid schedule with perfect DRF. These results are omitted to avoid

redundancy.

4.6.2 Trace-Driven Simulation

Next, we use trace-driven simulation to further evaluate the proposed algorithm from a macroscopic

perspective. We have written a packet-level simulator consisting of 3,000 lines of C++ code and fed it

with real-world traces [92] captured in a university switch. The traces are dominated by UDP packets.


50 60 70 80 90 10075

80

85

90

95

100

α (%)

CP

U U

tiliz

atio

n (

%)

CPU−BoundBalancedB/W−Bound

(a) CPU utilization.

50 60 70 80 90 10020

40

60

80

100

α (%)

B/W

Utiliz

atio

n (

%)

CPU−BoundBalancedB/W−Bound

(b) Bandwidth utilization.

Figure 4.7: Resource utilization achieved at different fairness levels in the simulation, averaged over 10runs.

0 100 200 300 400 5000.2

0.4

0.6

0.8

1

Per−Packet Latency (ms)

CD

F

α = 0.8

α = 1.0

MR3

(a) CPU-Bound.

0 50 100 150 2000.5

0.6

0.7

0.8

0.9

1


CD

F

α = 0.8

α = 1.0

MR3

(b) Balanced.

0 30 60 90 120 1500.5

0.6

0.7

0.8

0.9

1


CD

F

α = 0.8

α = 1.0

MR3

(c) Bandwidth-Bound.

CPU−bound Bal. B/W−bound10

20

30

40

50

60

70

Packet D

rop R

ate

(%

)

MR3

α = 1

α = 0.9

α = 0.8

(d) Packet drop rate.

Figure 4.8: The improvement of per-packet latency and packet drop rate due to the fairness tradeoff.

Based on the IP prefix and port number, we classify packets in the traces into nearly 3,000 flows and

synthesize the input traffic by randomly assigning each flow to one of three middlebox modules: basic

forwarding, statistical monitoring, and IPsec. The CPU processing time of each module follows a linear

model based on the measurement results of [4]. The flow queue size is set to 200 packets, and the

outgoing bandwidth is set to 200 Mbps. We linearly scale up the traffic by 5× to simulate a heavy

load. Depending on the total resource consumption, the synthesized traffic is classified into the following

three patterns: CPU-bound traffic where the CPU processing time exceeds 1.2× the transmission time,

bandwidth-bound traffic where the transmission time exceeds 1.2× the CPU time, and balanced traffic

otherwise.

Fig. 4.7 shows the mean utilization achieved at various fairness levels, where each data point is

averaged over 10 runs under the corresponding traffic pattern. The error bar shows one standard

deviation. We observe similar trends in all three patterns, that trading off fairness leads to higher

utilization on both resources. Similar to our Click implementation results, we see that the marginal

improvement in utilization is decreasing, with the optimal utilization achieved by trading off more than

15% of fairness. Among all three patterns, traffic with balanced resource consumption has the least

incentive to trade off fairness, as flows have complementary resource demands and can dove-tail one


102

103

104

105

106

107

108

109

0

50

100

150

200

250

300

Flow Size (B)

Me

an

La

ten

cy (

ms)

(a) α = 1.

102

103

104

105

106

107

108

109

0

50

100

150

200

250

300

Flow Size (B)

Me

an

La

ten

cy (

ms)

(b) α = 0.8.

Figure 4.9: Per-packet latency against flow sizes in the first 20s of simulation, feeding CPU-bound traffic,with and without the fairness tradeoff.

another [4]. In this case, fair queueing is sufficient to realize high efficiency. We have also simulated the

other two multi-resource fair queueing algorithms, DRFQ [4] and MR3, and observed almost the same

performance as that of the strictly fair queueing (α = 1). We omit these results to avoid redundancy.

We examine in Fig. 4.8 the tradeoff impact on other measures relevant to efficiency. Specifically,

we depict the per-packet latency CDF of three scheduling algorithms – tradeoff with α = 0.8, complete

fairness (α = 1.0), and MR3 – in Figs. 4.8a, 4.8b, and 4.8c, for each of the three traffic patterns. In

general, the enhanced resource utilization due to the fairness tradeoff translates into shorter latencies

in all three traffic patterns. The improvements are mainly attributed to the shortened packet latency

of favored flows. Furthermore, the packet drop rates are compared in Fig. 4.8d. We observe an average

of 15% to 20% decrease in the packet drop rate under all three traffic patterns, suggesting that higher

bandwidth throughput is achieved.

Finally, we investigate the tradeoff impact on service isolation in a dynamic environment. Ideally, we

would like to see that compromising a small percentage of fairness will not affect the per-packet latency

of mice flows as their guaranteed resource share, even when traded off, is sufficient to support their low

packet rate. Fig. 4.9 confirms this isolation property with CPU-bound traffic, where we depict the mean

packet latency for each flow in the first 20 seconds of the simulation, running with complete fairness and

80% of fairness, respectively. We see that compared to strictly fair queueing, trading off fairness has

no impact on the latency of small flows, but it affects those medium and large ones: Some see shorter

latency while others experience longer delay, depending on if they are favored flows or not. We have

similar observations in the other two traffic patterns.


4.7 Related Work

Ghodsi et al. [4] identified the need of multi-resource fair queueing for deep packet inspection in middle-

boxes. They compared a set of queueing alternatives and proposed DRFQ, the first multi-resource fair

queueing algorithm, that implements DRF in the time domain. We have proposed two follow-up queue-

ing algorithms in Chapter 3 with lower scheduling complexity and bounded scheduling delay. All these

works focus solely on fairness, and use the goal of achieving work conservation as the only indication of

efficiency, similar to traditional single-resource fair queueing [21,22].

However, as we have shown in this chapter, unlike single-resource scheduling, there is a general

tradeoff between fairness and efficiency when packet processing requires multiple types of resources. We

have briefly mentioned this problem in our previous position paper [93], where we shared our visions on

several possible directions that may lead to a concrete solution. We have materialized some of our visions

in this chapter by formally characterizing the tradeoff problem and proposing an implementable queueing

algorithm to achieve a flexible balance between fairness and efficiency in the two-resource setting.

While the tradeoff problem has received little attention in the fair queueing literature, striking a

balance between allocation fairness and efficiency has been a focus of many recent works in both net-

working and operations research. Specifically, Danna et al. [94] have presented an efficient bandwidth

allocation algorithm to achieve a flexible tradeoff between fairness and throughput for traffic engineer-

ing. Joe-Wong et al. [13] have proposed a unifying framework with fairness and efficiency requirements

specified by two parameters for a given multi-resource allocation problem. Discussions on the tradeoff

between fairness and performance have also been given in the context of P2P networks [95, 96]. In the

literature of operations research, Bertsimas et al. [97] have derived a tight bound to characterize the

efficiency loss under proportional fairness and max-min fairness, respectively. They have later developed

a more general framework to characterize the fairness-efficiency tradeoff in a family of “α-fair” welfare

functions [98]. All these works focus on one-shot resource allocation in the space domain. In contrast,

our focus in this chapter is a packet scheduling problem where resources are shared in the time domain.

As explained in Sec. 4.2.4, our tradeoff problem captures the flow shop problem [83–86, 88, 89, 91,

99, 100] as a special case when efficiency is the only concern. Our approach borrows the idea of WFQ

[21, 22, 26], in relaxing a discrete packet flow to an idealized fluid and tracking the fluid schedule based

on virtual time. However, for single-resource fair queueing, the main challenge is to design a packet-

by-packet tracking algorithm, because the fluid schedule, GPS [21, 22, 26], is fairly straightforward and

easy to compute. Our problem is more complex, requiring more careful modeling of the fluid schedule,

packet-by-packet tracking with multiple resources, and tradeoff analysis.


4.8 Summary and Future Work

Middleboxes perform complex network functions whose packet processing requires the support of multiple

types of hardware resources. A multi-resource packet scheduling algorithm is therefore needed. Unlike

traditional single-resource fair queueing, where bandwidth is the only concern, there exists a general

tradeoff between fairness and efficiency in the presence of multiple resources. Ideally, we would like

to achieve flexible tradeoff to meet QoS requirements while maintaining the system at a high resource

utilization level. We show the difficulty of the general problem and limit our discussion to a common

scenario where CPU and link bandwidth are the two resources required for packet processing. We

propose an efficient scheduling algorithm by tracking an idealized fluid schedule. We show through both

our Click implementation and trace-driven simulation that our algorithm achieves a flexible tradeoff

between fairness and efficiency in various scenarios.

Despite the initial progress made by this dissertation, many challenges remain open. First, while

optimized, the current implementation requires O(log n) time per packet scheduling, which may be a

concern given a large number of flows. A simpler scheduler with lower complexity may be desired. Given

the extensive techniques developed for low-complexity scheduler in the fair queueing literature, it would

be interesting to see if and how these techniques extend to the multi-resource setting. Also, as shown by

Theorem 11, in general, the more types of resources a system has, the more salient the fairness-efficiency

tradeoff would be. For a system with more than two types of resources, we believe the intuition and

technique developed in this chapter may still be applied. We could define a similar fluid schedule that

maximizes the dominant throughput under the specified fairness constraint, and use start-time tracking

to implement it in practice.

4.9 Proofs

4.9.1 Proof of Lemma 11

Suppose there are n backlogged flows at time t, where flow i has head-of-line packet pi. Let d∗ =

(d∗1, . . . , d∗n) be the optimal solution to (4.17). It is equivalent to show that d∗ leads to tight constraints

of (4.17) if the stated condition is met. Let

A = {pi|τi,1 = 1 > τi,2}


be the set of head-of-line packets whose dominant resource is resource 1, and let

B = {pi|τi,1 < τi,2 = 1}

be the set of head-of-line packets whose dominant resources is resource 2. We know that A,B 6= ∅

according to the stated condition.

We first claim that for the optimal solution d∗, there exist pj ∈ A and pl ∈ B, such that d∗j > 0 and

d∗l > 0. To prove this claim, let us assume the opposite and see what happens. Without loss of generality,

suppose for all d∗i > 0, we have either pi ∈ A or pi /∈ A∪B (i.e., τi,1 = τi,2 = 1). This implies∑i d∗i = 1.

We show a contradiction by constructing a feasible allocation that leads to a dominant throughput higher

than 1. Consider two packets pj ∈ A and pl ∈ B. We construct the following allocation:

di =

min{1− τl,1

2 , 12τj,2}, i = j,

1/2, i = l,

0, otherwise.

To see that the allocation is feasible, we substitute di to the constraints of (4.17) and have

∑i

τi,1di = dj + τl,1dl = min

{1− τl,1

2,

1

2τj,2

}+τl,12≤ 1

for resource 1 and ∑i

τi,2di = τj,2dj + dl ≤ 1

for resource 2. To see that the allocation leads to a dominant throughput higher than 1, we first have

dj = min

{1− τl,1

2,

1

2τj,2

}>

1

2

by noting that τj,2, τl,1 < 1 (because pj ∈ A and pl ∈ B). We then have

∑i

di = dj + dl > 1 =∑i

d∗i ,

contradicting the fact that d∗ optimally solves (4.17).

We are now ready to prove the statement of the lemma. We assume the opposite that at least one

resource is not fully utilized under allocation d∗, hoping to show a contradiction. Since (4.17) is a linear

program, one constraint must be tight for the optimal solution. Without loss of generality, we assume


that only resource 1 is fully utilized under allocation d∗. That is, for resource 1 we have

∑i

τi,1d∗i = 1 ,

but for resource 2 we have ∑i

τi,2d∗i = 1−∆ < 1

for some ∆ ∈ (0, 1). We construct an allocation d with the same dominant throughput as that of d∗,

but does not fully utilize any of the two resources. This suggests that a higher dominant throughput

can be achieved, leading to a contradiction.

By our previous claim, there exist pj ∈ A and pl ∈ B such that d∗j > 0 and d∗l > 0. We construct the

following allocation:

di =

d∗j − δ, i = j,

d∗l + δ, i = l,

d∗i , otherwise,

where δ = min{d∗j ,∆/2(1 − τj,2δ)}. Clearly, the constructed allocation d leads to the same dominant

throughput as that of d∗, i.e.,∑i di =

∑i d∗i , but does not fully utilize any of the two resources. In

particular, we have

∑i

τi,1di =∑i

τi,1d∗i − (τj,1 − τl,1)δ

= 1− (1− τl,1)δ < 1

for resource 1 and

∑i

τi,2di =∑i

τi,2d∗i + (τl,2 − τj,2)δ

= 1−∆ + (1− τj,2)δ

≤ 1−∆/2

< 1

for resource 2. This suggests that strictly higher dominant throughput can be achieved (by increasing

some di by some sufficiently small ε > 0). We hence see a contradiction. ut


4.9.2 Proof of Lemma 12

It suffices to consider the following two cases.

Case 1: All flows have the same dominant resource. Without loss of generality, assume that resource

1 is the dominant resource of all flows. In this case, the constraint of resource 2 in (4.17) is ineffective,

and the constraint of resource 1 is tight under the optimal solution of (4.17). Resource 1 is hence fully

utilized until the end of the schedule. The makespan of the schedule is equal to the total amount of time

required to process all packets on resource 1, and is the minimum.

Case 2: There are two flows with different dominant resources. By Lemma 11, the fluid schedule

schedule fully utilizes both resources until some flows complete services and all the remaining backlogged

flows have the same dominant resource, say, resource 1. Since then, the fluid schedule fully utilizes

resource 1 until the end of the schedule. Therefore, resource 1 is fully utilized in the entire schedule.

The makespan is equal to the total amount of time required to process all packets on resource 1, and is

the minimum. ut

4.9.3 Proof of Theorem 12

Let t0 be the first time when the system has two flows having different dominant resources. A flow is

called an early flow if it arrives before t0. All early flows have the same dominant resource. Without loss

of generality, let resource 1 be their dominant resource. Flows that arrive at or after t0 are called late

flows. Let Wr be the total amount of time required to process all packets on resource r. For an arbitrary

schedule σ, let Pσr (t) be the total amount of time that resource r is busy in (0, t). The makespan of the

schedule σ is lower bounded by the following equation:

Tσ ≥ t0 + maxr=1,2

{Wr − Pσr (t0)} , (4.32)

where Wr − Pσr (t0) is the remaining amount of packet processing time required on resource r.

Now for the fluid schedule schedule ρ, since all the early flows have resource 1 as their dominant

resource, any vector d∗ with∑i d∗i = 1 optimally solves (4.17). By the fluid schedule model, this implies

that resource 1 is fully utilized while the utilization rate of resource 2 is also maximized, at all times

before t0. Therefore, we have P ρ1 (t0) = t0 ,

P ρ2 (t0) = maxσ′ Pσ′

2 (t0) .(4.33)

Let t1 be the last flow arrival time after t0. (If no flow arrives after t0, let t1 = t0.) One can always


find two backlogged flows with different dominant resources in (t0, t1). By Lemma 11, both resources

are fully utilized in (t0, t1), i.e.,

P ρr (t1)− P ρr (t0) = t1 − t0 . (4.34)

After t1, since there is no flow arrivals, no newly arrived packet would become the head-of-line packet.

Therefore, for the fluid schedule schedule, packets are scheduled as if they were available at t1. By

Lemma 12, the schedule fully utilizes one resource until the end of the makespan, i.e.,

T ρ − t1 = maxr=1,2

{Wr − P ρr (t1)} . (4.35)

Plugging (4.33) and (4.34) into (4.35), we have

T ρ = t0 + maxr=1,2

{Wr − P ρr (t0)}

= t0 + max{W1 − t0,W2 −maxσ′

Pσ′

2 (t0)} .(4.36)

Now for any schedule σ, we have

Pσ1 (t0) ≤ t0 = P ρ1 (t0) ,

Pσ2 (t0) ≤ maxσ′ Pσ′

2 (t0) = P ρ2 (t0).(4.37)

Substituting (4.37) to (4.36), we have

T ρ ≤ t0 + maxr=1,2

{Wr − Pσr (t0)} ≤ Tσ, (4.38)

where the last inequality is derived from (4.32). Since (4.38) holds for any schedule σ, we see that the

fluid schedule schedule ρ is optimal. ut


We first show that given any optimal solution d∗ with d∗j > 0 for some 1 < j < n, we can convert it

to another optimal solution d with dj = 0 and di = d∗i for all i 6= 1, j, n. It is easy to check that there

exists some k such that

1 = τ1,1 = · · · = τk,1 > τk+1,1 ≥ · · · ≥ τn,1 ,

τ1,2 ≤ · · · ≤ τk,2 ≤ τk+1,2 = · · · = τn,2 = 1 .


In particular, if j ≤ k, we let

di =

d∗1 + d∗j , i = 1,

0, i = j,

d∗i , o.w.

We show that d is a feasible solution to (4.21). Consider the constraint of resource 1 in (4.21). Because

τ1,1 = τj,1 = 1 for j ≤ k, we have

∑i

τi,1di =∑i6=j

τi,1d∗i + τ1,1d

∗j =

∑i

τi,1d∗i ≤ µ1 .

For the constraint of resource 2 in (4.21), because τ1,2 ≤ τj,2 when j ≤ k, we have

∑i

τi,2di =∑i6=j

τi,2d∗i + τ1,2d

∗j ≤

∑i

τi,2d∗i ≤ µ2 .

Also note that∑i di =

∑i d∗i , we see that d is an optimal solution to (4.21).

If j > k, we let

di =

0, i = j,

d∗n + d∗j , i = n,

d∗i , o.w.

With similar arguments, we see that d optimally solves (4.21).

Repeatedly applying the approach above to all the non-zero components, except the first and the

last, of an optimal solution, we see that the statement holds. Because only d1, dn > 0, problem (4.21)

reduces to a simple linear program with two variables and two constraints. Exhaustive discussions lead

to the results in all the three cases. ut

4.9.5 Preliminaries for the Proofs of Theorems 14 and 15

Because the discrete schedule tracks the fluid schedule based on the packet start times, the packet

scheduling orders of both schedules are the same. Let the orders be p1, . . . , pN , where pi is the ith packet

scheduled. For the discrete schedule, let sDi,r be the start time of packet pi on resource r, and fD

i,r the

finish time of pi on resource r. For the fluid schedule, since each packet starts (completes) processing the

same time on both resources, let sFi and fF

i be the start and finish time of packet pi, respectively. Let

τr(pi) be the packet processing time required by pi on resource r, and τmax = maxi,r τr(pi) the maximum

packet processing time required by any packet on any resource. The following lemma generally holds for


any discrete schedule.

Lemma 13. For any packet pi, there exists some pivotal packet pk, 1 ≤ k ≤ i, such that the processing

of pk on resource 2 follows immediately after the processing on resource 1 completes, and the processing

of pk, . . . , pi is continuous on resource 2, i.e.,

fDi,2 = fD

k,1 +

i∑j=k

τ2(pj) . (4.39)

Proof: For the pivotal packet pk, the processing on resource 2 starts immediately after its processing

on resource 1 completes, i.e.,

fDk,1 = sD

k,2 . (4.40)

To find the pivotal packet, we search from packet pi and check if it satisfies (4.40). If it does, then

pi is the pivotal packet and the search stops. Otherwise, the processing of packet pi on resource 2 is

delayed for a certain amount of time after the processing completes on resource 1. This implies that the

processing of pi on resource 2 starts immediately after its previous packet pi−1 completes processing on

resource 2:

fDi,1 < sD

i,2 = fDi−1,2 . (4.41)

We continue the search to pi−1 and check if it satisfies (4.40). If it does, then pi−1 is the pivotal packet

because the processing of pi−1 and pi is continuous on resource 2, as suggested by (4.41), and the search

stops. Otherwise, we must have

fDi−1,1 < sD

i−1,2 = fDi−2,2 , (4.42)

for the similar reason as that of (4.41). We continue the search to pi−2. Note that the search is

guaranteed to stop at packet p1, because it is the first packet scheduled and there is no delay between

the processing on the two resources. In this case, p1 is the pivotal packet. ut

Lemma 14. For any fluid schedule with α > 0 and its corresponding discrete schedule, we have

sDi,1 ≤ sF

i + (n− 1)τmax, for all i, (4.43)

where n is the maximum number of concurrent flows in the fluid schedule.


Proof: For any packet pi, let pj be the earliest-scheduled packet where pj , pj+1, . . . , pi are continu-

ously processed on resource 1 in the discrete schedule, i.e.,

j = arg min1≤l≤i

{fDl,1 = sD

l−1,1} .

The reason that pj is not processed right after its previous packet completes processing on resource 1 is

because it has not yet started processing in the corresponding fluid schedule. Packet pj should hence be

scheduled immediately in the discrete schedule upon its start in the fluid counterpart, i.e.,

sDj,1 = sF

j . (4.44)

Now for the fluid schedule, consider the time interval [sFj , s

Fi ), during which packets pj , . . . , pi−1 start

processing. When packet pi starts processing at time sFi , there are at most (n − 1) other packets that

have not yet completed processing, all of which must start earlier than pi. Therefore, the length of busy

period of fluid schedule in [sFj , s

Fi ) is at least

i−1∑l=j

τ1(pl)− (n− 1)τmax ≤ sFi − sF

j = sFi − sD

j,1, (4.45)

where the equality is derived from (4.44). We rewrite (4.45) as

sFi ≥ sD

j,1 +

i−1∑l=j

τ1(pl)− (n− 1)τmax

= sDi,1 − (n− 1)τmax,

(4.46)

where the equality holds because packets pj , . . . , pi are processed continuously on resource 1 in the

discrete schedule. ut

With Lemma 13 and Lemma 14, we give proofs of Theorem 14 and Theorem 15 in the following two

subsections.


By Lemma 13, for the discrete schedule, there exists a pivotal packet pk, such that

fDN,2 = fD

k,1 +

N∑i=k

τ2(pi) = sDk,1 + τ1(pk) +

N∑i=k

τ2(pi). (4.47)


Since packets pk, . . . , pN are all processed after sFk in the corresponding fluid schedule, we have

N∑i=k

τ2(pi) ≤ TF − sFk . (4.48)

On the other hand, Lemma 14 suggests that

sDk,1 ≤ sF

k + (n− 1)τmax. (4.49)

Plugging (4.48) and (4.49) into (4.47), we have

fDN,2 ≤ TF + (n− 1)τmax + τ1(pk) ≤ TF + nτmax.

By noting that fDN,2 = TD, we see that the statement holds. ut


Without loss of generality, we limit the discussion to a busy period of the fluid schedule. We first prove

the lower bound of DDi (0, t). This requires the following lemma.

Lemma 15. For any fluid schedule with α > 0 and its corresponding discrete schedule, we have

fDi,2 ≤ fF

i + (2n− 1)τmax, for all i.

Proof: For any packet pi, by Lemma 13, there exists a pivotal packet pk where

fDi,2 = fD

k,1 +

i∑j=k

τ2(pj) = τ1(pk) + sDk,1 +

i∑j=k

τ2(pj) . (4.50)

For the fluid schedule, consider the time interval [sFk , f

Fi ), during which packets pk, . . . , pi start processing.

Right before packet pi completes processing at time fFi , there are at most n− 1 other packets processed

in parallel, all of which start earlier than pi. Therefore, the workload processed by the fluid schedule

during [sFk , f

Fi ) is at least

i∑l=k

τl,2 − (n− 1)τmax ≤ fFi − sF

k .


We then have

fFi ≥ sF

k +∑il=k τl,2 − (n− 1)τmax

≥ sDk,1 +∑il=k τl,2 − 2(n− 1)τmax (By Lemma 14)

= fDi,2 − τ1(pk)− 2(n− 1)τmax (By (4.50))

≥ fDi,2 − (2nc − 1)τmax .

ut

We are now ready to establish the lower bound of DDi (0, t).

Lemma 16. For any Fluid with α > 0 and its corresponding Discrete, at any time t, we have

DDi (0, t) ≥ DF

i (0, t)− 2(n− 1)τmax, for all i.

Proof: For any flow i, because 0 ≤ ddtD

Fi (0, t) ≤ 1 and d

dtDDi (0, t) ∈ {0, 1}, equation DF

i (0, t) −

DDi (0, t) reaches its maximum at some time t when a packet p of flow i starts processing on its dominant

resource in the discrete schedule. Packet p completes its dominant processing at time t + τ∗, where τ∗

is the packet processing time required by p on its dominant resource. Let fD be the time when packet

p completes processing (on resource 2) in the discrete schedule. We have

fD ≥ t+ τ∗ . (4.51)

Let fF be the time when packet p completes processing in the fluid schedule. We have

DFi (0, fF) = DD

i (0, t+ τ∗) = DDi (0, t) + τ∗ . (4.52)

By Lemma 15 and (4.51), we have

fF ≥ fD − (2n− 1)τmax ≥ t+ τ∗ − (2n− 1)τmax .

This immediately suggests that

DFi (0, t+ τ∗ − (2n− 1)τmax) ≤ DF

i (0, fF)

= DDi (0, t) + τ∗ .

(4.53)


On the other hand, we have

DFi (0, t+ τ∗ − (2nc − 1)τmax) ≥ DF

i (0, t) + τ∗ − (2nc − 1)τmax . (4.54)

Combining (4.53) and (4.54), we see that the statement holds. ut

We next prove the upper bound of DDi (0, t) using the following lemma, hence completing the proof

of Theorem 15.

Lemma 17. For any fluid schedule with α > 0 and its corresponding discrete schedule, at any time t,

we have

DDi (0, t) ≤ DF

i (0, t) + τmax, for all i.

Proof: Let p(i,k) be the kth packet of flow i. Let sD(i,k) be the time when p(i,k) starts processing (on

resource 1) in the discrete schedule, and fD(i,k) the time when p(i,k) completes processing (on resource

2) in the discrete schedule. Let sF(i,k) and fF

(i,k) be similarly defined for the fluid schedule. We note the

following two facts. First, a packet does not start processing in the discrete schedule until it starts in

the fluid schedule, i.e.,

sD(i,k) ≥ s

F(i,k), for all p(i,k) . (4.55)

Second, because packets of a flow are processed in sequence in both fluid schedule and discrete schedule,

the following must hold for all p(i,k):

DDi (0, sD

(i,k)) = DFi (0, sF

(i,k)) . (4.56)

Without loss of generality, we assume sF(i,k) ≤ t ≤ s

F(i,k+1) for some k. It suffices to consider the following

two cases.

Case 1: sD(i,k) ≥ t. In this case, we have

DDi (0, t) ≤ DD

i (0, sD(i,k)) = DF

i (0, sF(i,k)) ≤ D

Fi (0, t) .

Case 2: sD(i,k) < t. Let τ∗(i,k) be the packet processing time required by p(i,k) on its dominant

resource. Because the dominant service flow i receives in [sD(i,k), t) in the discrete schedule is limited by


the dominant processing time of p(i,k) and the processing rate, we have

DDi (sD

(i,k), t) ≤ min{τ∗(i,k), t− sD(i,k)}

≤ min{τ∗(i,k), t− sF(i,k)} ,

(4.57)

where the second inequality holds because of (4.55). Now consider the dominant service flow i receives

in the fluid schedule in [sF(i,k), t). Let β be the minimum dominant processing rate flow i receives in

[sF(i,k), t). We have 0 < β ≤ 1 and

DFi (sF

(i,k), t) ≥ min{τ∗(i,k), (t− sF(i,k))β} . (4.58)

By (4.56), (4.57), and (4.58), we have

DDi (0, t)−DF

i (0, t)

=DDi (0, sD

(i,k)) +DDi (sD

(i,k), t)−DFi (0, sF

(i,k))−DFi (sF

(i,k), t)

=DDi (sD

(i,k), t)−DFi (sF

(i,k), t)

≤min{τ∗(i,k), t− sF(i,k)} −min{τ∗(i,k), (t− s

F(i,k))β}

=

(t− sF(i,k))(1− β), t− sF

(i,k) < τ∗(i,k),

τ∗(i,k) − (t− sF(i,k))β, (t− sF

(i,k))β < τ∗(i,k) < t− sF(i,k),

0, τ∗(i,k) ≤ (t− sF(i,k))β .

It is easy to check that in either case above, we have

DDi (0, t)−DF

i (0, t) ≤ τ∗(i,k) ≤ τmax .

ut

Chapter 5

Concluding Remarks

5.1 Conclusions

This dissertation studies several fundamental fair sharing problems with multiple resource types in cloud

datacenters, and makes contributions in both algorithmic design and prototype implementation.

We started by studying the multi-resource allocation problem in cloud computing systems where the

resource pool is constructed from a large number of heterogeneous servers, representing different points in

the configuration space of resources such as processing, memory, and storage. We have designed a multi-

resource sharing policy, called DRFH, that generalizes the notion of Dominant Resource Fairness (DRF)

from a single server to multiple heterogeneous servers. DRFH provides a number of highly desirable

“fair” properties. In particular, with DRFH, no user prefers the allocation of another user; no one can

improve its allocation without decreasing that of the others; and more importantly, no coalition behavior

of misreporting resource demands can benefit all its members. DRFH also ensures some level of service

isolation among the users. We have prototyped DRFH as a pluggable resource allocator in Apache Mesos,

an open-source cluster management system widely adopted by industry. Experimental studies show that

our implementation can lead to accurate DRFH allocation at all times. Large-scale simulations driven

by Google cluster traces show that DRFH significantly outperforms the traditional slot-based scheduler,

leading to much higher resource utilization with substantially shorter job completion times.

Multi-resource fair sharing is not only a fundamental system design problem for large computer

clusters, but also a new challenge for middleboxes, software routers, and other appliances that are

widely deployed in datacenters. These appliances perform a wide range of important network functions,

including WAN optimizations, intrusion detection systems, network and application level firewalls, etc.

133

Chapter 5. Concluding Remarks 134

Depending on the underlying applications, packet processing for different traffic flows may consume vastly

different amounts of hardware resources (e.g., CPU and link bandwidth). Multi-resource fair queueing

allows each traffic flow to receive a fair share of multiple middlebox resources. Previous schemes for

multi-resource fair queueing, however, are expensive to implement at high speeds. Specifically, the time

complexity to schedule a packet is O(log n), where n is the number of backlogged flows. We have designed

Multi-Resource Round Robin (MR3), a new fair queueing scheme for unweighted flows that schedules

packets in a manner similar to Elastic Round Robin. MR3 requires only O(1) work to schedule a packet

and is simple enough to implement in practice. It serves as a foundation for a more generalized fair

scheduler, called Group Multi-Resource Round Robin (GMR3), that also runs in O(1) time, yet provides

weight-proportional delay bound for flows with uneven weights. We have shown, both analytically and

experimentally, both MR3 and GMR3 can achieve nearly perfect Dominant Resource Fairness.

We have also identified a new challenge that is unique to multi-resource scheduling. Unlike tradi-

tional fair queueing where bandwidth is the only concern, we have shown in Chapter 4 that fairness and

efficiency are conflicting objectives that cannot be achieved simultaneously in the presence of multiple

resource types. Ideally, a scheduling algorithm should allow network operators to flexibly specify their

fairness and efficiency requirements, so as to meet the Quality of Service demands while keeping the sys-

tem at a high utilization level. Yet, existing multi-resource scheduling algorithms focus on fairness only,

and may lead to poor resource utilization. Driven by this problem, we have proposed a new scheduling

algorithm to achieve a flexible tradeoff between fairness and efficiency for packet processing, consuming

both CPU and link bandwidth. Experimental results based on both real-world implementation and

trace-driven simulation suggested that trading off a modest level of fairness can potentially improve the

efficiency to the point where the system capacity is almost saturated.

5.2 Future Directions

Our work on resource scheduling in cloud systems is by no means complete. Large-scale datacenter

and analytic systems are undergoing fundamental shifts, from application-specific to general-purpose,

from batch processing to interactive analysis, from a small number of long-lasting tasks with low fan-

out to a large number of short-lived tasks that are highly parallelized. All these shifts pose significant

scheduling challenges on the system availability, scalability, and responsiveness. These challenges provide

rich research problems requiring both analytical studies and new implementations.

Availability. Production computing tasks typically have placement requirements, some of which

are hard constraints that must be enforced (e.g., some tasks can only run on nodes with public IPs),

Chapter 5. Concluding Remarks 135

while others are preferences that could be violated at the expense of performance degradation (e.g.,

data locality). In addition to placement requirements, tasks may have dynamic resource demands. For

example, a Spark task is usually CPU-intensive in the mapping stage, but will soon become bandwidth-

bound in the reducing stage. All these complexities make resource availability a severe problem, giving

rise to many delicate issues that must be carefully addressed. In particular, how should tasks be scheduled

under complex placement requirements while ensuring basic fairness and high efficiency? How should the

system quickly respond to these resource dynamics, and make task scheduling adjustment accordingly?

These questions are arising in Hadoop, Mesos, and Spark deployments, and require both new system

abstractions and analytical work.

Scalability. Driven by demand for lower-latency interactive data analytics, large-scale clusters are

shifting towards shorter tasks with higher degrees of parallelism. The foreseeable trend of adopting sub-

second tasks in new-generation analytic frameworks (e.g., Dremel, Spark, Impala) would further increase

the number of tasks by two orders of magnitude. Scheduling such a large number of tasks in a short

period of time would easily congest the centralized scheduler. Moreover, having much higher degrees of

parallelism will further aggravate the already severe Incast problem in clusters. Some promising ideas

are to parallelize the computation of task scheduling via distributed schedulers and to reduce traffic load

in the reduce stage via in-network aggregation. However, these ideas are far from maturity and require

further analytical investigations, engineering innovations, and implementation efforts.

Responsiveness. The demand for lower-latency interactive data analytics requires large-scale clus-

ters to provide fluid responsiveness, which naturally translates into a short tail of latency distribution.

While many techniques have been proposed to reduce the latency tail, most of them are based on con-

ventional parallel computing frameworks such as MapReduce. It remains open to see if and how these

techniques extend to new-generation data analytic frameworks that support in-memory computing and

data streaming (e.g., Spark and Shark). It is also unclear how these techniques can be implemented

in distributed scheduling given the scalability problem suffered by centralized scheduling. Answering

these questions requires new insights derived from extensive measurement studies in the production

data analytic clusters, based on which new techniques can be proposed and implemented.

Bibliography

[1] W. Wang, D. Niu, B. Li, and B. Liang, “Dynamic cloud resource reservation via cloud brokerage,”

in Proc. IEEE ICDCS, 2013.

[2] C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozuch, “Heterogeneity and dynamicity of

clouds at scale: Google trace analysis,” in Proc. ACM SoCC, 2012.

[3] C. Reiss, J. Wilkes, and J. L. Hellerstein, “Google Cluster-Usage Traces,” http://code.google.com/

p/googleclusterdata/.

[4] A. Ghodsi, V. Sekar, M. Zaharia, and I. Stoica, “Multi-resource fair queueing for packet process-

ing,” in Proc. ACM SIGCOMM, 2012.

[5] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson,

A. Rabkin, I. Stoica, and M. Zaharia, “A view of cloud computing,” Commun. ACM, vol. 53,

no. 4, pp. 50–58, 2010.

[6] B. Farley, A. Juels, V. Varadarajan, T. Ristenpart, K. D. Bowers, and M. M. Swift, “More for

your money: Exploiting performance heterogeneity in public clouds,” in Proc. ACM SoCC, 2012.

[7] Z. Ou, H. Zhuang, A. Lukyanenko, J. Nurminen, P. Hui, V. Mazalov, and A. Yla-Jaaski, “Is the

same instance type created equal? exploiting heterogeneity of public clouds,” IEEE Trans. Cloud

Computing, vol. 1, no. 2, pp. 201–214, 2013.

[8] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vijaykumar, “Tarazu: Optimizing mapre-

duce on heterogeneous clusters,” in Proc. ACM ASPLOS, 2012.

[9] R. Nathuji, C. Isci, and E. Gorbatov, “Exploiting platform heterogeneity for power efficient data

centers,” in Proc. USENIX ICAC, 2007.

[10] “Apache Hadoop,” http://hadoop.apache.org.

136

Bibliography 137

[11] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs

from sequential building blocks,” in Proc. EuroSys, 2007.

[12] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, “Dominant resource

fairness: Fair allocation of multiple resource types,” in Proc. USENIX NSDI, 2011.

[13] C. Joe-Wong, S. Sen, T. Lan, and M. Chiang, “Multi-resource allocation: Fairness-efficiency trade-

offs in a unifying framework,” in Proc. IEEE INFOCOM, 2012.

[14] D. Dolev, D. Feitelson, J. Halpern, R. Kupferman, and N. Linial, “No justified complaints: On

fair sharing of multiple resources,” in Proc. ACM ITCS, 2012.

[15] A. Gutman and N. Nisan, “Fair allocation without trade,” in Proc. AAMAS, 2012.

[16] D. Parkes, A. Procaccia, and N. Shah, “Beyond dominant resource fairness: Extensions, limita-

tions, and indivisibilities,” in Proc. ACM EC, 2012.

[17] V. Sekar, N. Egi, S. Ratnasamy, M. Reiter, and G. Shi, “Design and implementation of a consoli-

dated middlebox architecture,” in Proc. USENIX NSDI, 2012.

[18] J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy, S. Ratnasamy, and V. Sekar, “Making middle-

boxes someone else’s problem: Network processing as a cloud service,” in Proc. ACM SIGCOMM,

2012.

[19] A. Greenhalgh, F. Huici, M. Hoerdt, P. Papadimitriou, M. Handley, and L. Mathy, “Flow pro-

cessing and the rise of commodity network hardware,” ACM SIGCOMM Comput. Commun. Rev.,

vol. 39, no. 2, pp. 20–26, 2009.

[20] J. Anderson, R. Braud, R. Kapoor, G. Porter, and A. Vahdat, “xOMB: Extensible open middle-

boxes with commodity servers,” in Proc. ACM/IEEE ANCS, 2012.

[21] A. Demers, S. Keshav, and S. Shenker, “Analysis and simulation of a fair queueing algorithm,” in

Proc. ACM SIGCOMM, 1989.

[22] A. Parekh and R. Gallager, “A generalized processor sharing approach to flow control in integrated

services networks: The single-node case,” IEEE/ACM Trans. Netw., vol. 1, no. 3, pp. 344–357,

1993.

[23] S. Golestani, “A self-clocked fair queueing scheme for broadband applications,” in Proc. IEEE

INFOCOM, 1994.

Bibliography 138

[24] H. Zhang, “Service disciplines for guaranteed performance service in packet-switching networks,”

Proc. IEEE, vol. 83, no. 10, pp. 1374–1396, 1995.

[25] M. Shreedhar and G. Varghese, “Efficient fair queuing using deficit round-robin,” IEEE/ACM

Trans. Netw., vol. 4, no. 3, pp. 375–385, 1996.

[26] J. Bennett and H. Zhang, “WF2Q: Worst-case fair weighted fair queueing,” in Proc. IEEE INFO-

COM, 1996.

[27] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, F. Huici, and L. Mathy, “Towards high performance

virtual routers on commodity hardware,” in Proc. ACM CoNEXT, 2008.

[28] H. Dreger, A. Feldmann, V. Paxson, and R. Sommer, “Predicting the resource consumption of

network intrusion detection systems,” in Recent Advances in Intrusion Detection (RAID), vol.

5230. Springer, 2008, pp. 135–154.

[29] M. Honda, Y. Nishida, C. Raiciu, A. Greenhalgh, M. Handley, and H. Tokuda, “Is it still possible

to extend TCP?” in Proc. ACM IMC, 2011.

[30] Z. Wang, Z. Qian, Q. Xu, Z. Mao, and M. Zhang, “An untold story of middleboxes in cellular

networks,” in Proc. SIGCOMM, 2011.

[31] P. Goyal, H. Vin, and H. Cheng, “Start-time fair queueing: A scheduling algorithm for integrated

services packet switching networks,” IEEE/ACM Trans. Netw., vol. 5, no. 5, pp. 690–704, 1997.

[32] A. Gember, P. Prabhu, Z. Ghadiyali, and A. Akella, “Toward software-defined middlebox network-

ing,” in Proc. ACM Hotnets, 2012.

[33] W. Wang, B. Li, and B. Liang, “Dominant resource fairness in cloud computing systems with

heterogeneous servers,” in Proc. IEEE INFOCOM, 2014.

[34] W. Wang, B. Liang, and B. Li, “Multi-resource fair allocation in heterogeneous cloud computing

systems,” IEEE Trans. Parallel Distrib. Syst., (to appear).

[35] W. Wang, B. Li, and B. Liang, “Multi-resource round robin: A low complexity packet scheduler

with dominant resource fairness,” in Proc. IEEE ICNP, 2013.

[36] W. Wang, B. Liang, and B. Li, “Low complexity multi-resource fair queueing with bounded delay,”

in Proc. IEEE INFOCOM, 2014.

Bibliography 139

[37] W. Wang, C. Feng, B. Li, and B. Liang, “On the fairness-efficiency tradeoff for packet processing

with multiple resources,” in Proc. ACM CoNEXT, 2014.

[38] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker,

and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster

computing,” in Proceedings of the 9th USENIX Conference on Networked Systems Design and

Implementation, ser. NSDI’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 2–2. [Online].

Available: http://dl.acm.org/citation.cfm?id=2228298.2228301

[39] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and

I. Stoica, “Mesos: A platform for fine-grained resource sharing in the data center,” in Proc.

USENIX NSDI, 2011.

[40] W. Wang, B. Li, and B. Liang, “Dominant resource fairness in cloud computing systems with

heterogeneous servers,” in Proc. IEEE INFOCOM, 2014.

[41] I. Kash, A. Procaccia, and N. Shah, “No agent left behind: Dynamic fair division of multiple

resources,” in Proc. AAMAS, 2013.

[42] J. Li and J. Xue, “Egalitarian division under Leontief preferences,” Econ. Theory, vol. 54, no. 3,

pp. 597–622, 2013.

[43] A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, “Choosy: Max-min fair sharing for datacenter

jobs with constraints,” in Proc. ACM EuroSys, 2013.

[44] A. D. Procaccia, “Cake cutting: Not just child’s play,” Commun. ACM, 2013.

[45] “Hadoop Fair Scheduler,” http://hadoop.apache.org/docs/r0.20.2/fair scheduler.html.

[46] M. Mitzenmacher, “The power of two choices in randomized load balancing,” IEEE Trans. Parallel

Distrib. Syst., vol. 12, no. 10, pp. 1094–1104, 2001.

[47] A. Singhal, “Modern information retrieval: A brief overview,” IEEE Data Eng. Bull., vol. 24,

no. 4, pp. 35–43, 2001.

[48] “Apache Mesos,” http://mesos.apache.org.

[49] S. Baruah, J. Gehrke, and C. Plaxton, “Fast scheduling of periodic tasks on multiple resources,”

in Proc. IEEE IPPS, 1995.

Bibliography 140

[50] S. Baruah, N. Cohen, C. Plaxton, and D. Varvel, “Proportionate progress: A notion of fairness in

resource allocation,” Algorithmica, vol. 15, no. 6, pp. 600–625, 1996.

[51] F. Kelly, A. Maulloo, and D. Tan, “Rate control for communication networks: Shadow prices,

proportional fairness and stability,” J. Oper. Res. Soc., vol. 49, no. 3, pp. 237–252, 1998.

[52] J. Mo and J. Walrand, “Fair end-to-end window-based congestion control,” IEEE/ACM Trans.

Networking, vol. 8, no. 5, pp. 556–567, 2000.

[53] J. Kleinberg, Y. Rabani, and E. Tardos, “Fairness in routing and load balancing,” in Proc. IEEE

FOCS, 1999.

[54] J. Blanquer and B. Ozden, “Fair queuing for aggregated multiple links,” in Proc. ACM SIGCOMM,

2001.

[55] Y. Liu and E. Knightly, “Opportunistic fair scheduling over multiple wireless channels,” in Proc.

IEEE INFOCOM, 2003.

[56] C. Koksal, H. Kassab, and H. Balakrishnan, “An analysis of short-term fairness in wireless media

access protocols,” in Proc. ACM SIGMETRICS (poster session), 2000.

[57] M. Bredel and M. Fidler, “Understanding fairness and its impact on quality of service in IEEE

802.11,” in Proc. IEEE INFOCOM, 2009.

[58] R. Jain, D. Chiu, and W. Hawe, A quantitative measure of fairness and discrimination for re-

source allocation in shared computer system. Eastern Research Laboratory, Digital Equipment

Corporation, 1984.

[59] T. Lan, D. Kao, M. Chiang, and A. Sabharwal, “An axiomatic theory of fairness in network

resource allocation,” in Proc. IEEE INFOCOM, 2010.

[60] “Hadoop Capacity Scheduler,” http://hadoop.apache.org/docs/r0.20.2/capacity scheduler.html.

[61] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg, “Quincy: Fair

scheduling for distributed computing clusters,” in Proc. ACM SOSP, 2009.

[62] A. A. Bhattacharya, D. Culler, E. Friedman, A. Ghodsi, S. Shenker, and I. Stoica, “Hierarchical

scheduling for diverse datacenter workloads,” in Proc. ACM SoCC, 2013.

[63] E. Friedman, A. Ghodsi, and C.-A. Psomas, “Strategyproof allocation of discrete jobs on multiple

machines,” in Proc. ACM EC, 2014.

Bibliography 141

[64] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella, “Multi-resource packing for

cluster schedulers,” in Proc. ACM SIGCOMM, 2014.

[65] D. Bertsekas and R. Gallager, Data Networks. Prentice-Hall, 2004.

[66] W. Wang, B. Liang, and B. Li, “Multi-resource generalized processor sharing for packet process-

ing,” in Proc. ACM/IEEE IWQoS, 2013.

[67] S. Floyd and V. Jacobson, “Link-sharing and resource management models for packet networks,”

IEEE/ACM Trans. Netw., vol. 3, no. 4, pp. 365–386, 1995.

[68] S. Kanhere, H. Sethu, and A. Parekh, “Fair and efficient packet scheduling using elastic round

robin,” IEEE Trans. Parallel Distrib. Syst., vol. 13, no. 3, pp. 324–336, 2002.

[69] N. Egi, A. Greenhalgh, M. Handley, G. Iannaccone, M. Manesh, L. Mathy, and S. Ratnasamy,

“Improved forwarding architecture and resource management for multi-core software routers,”

Proc. IFIP NPC, 2009.

[70] “Cisco GSR,” http://www.cisco.com/.

[71] S. Ramabhadran and J. Pasquale, “Stratified round robin: A low complexity packet scheduler with

bandwidth fairness and bounded delay,” in Proc. ACM SIGCOMM, 2003.

[72] B. Caprita, J. Nieh, and W. C. Chan, “Group round robin: Improving the fairness and complexity

of packet scheduling,” in Proc. ACM ANCS, 2005.

[73] X. Yuan and Z. Duan, “Fair round-robin: A low-complexity packet scheduler with proportional

and worst-case fairness,” IEEE Trans. Comput., 2009.

[74] W. Wang, D. Niu, B. Li, and B. Liang, “Dynamic cloud resource reservation via cloud brokerage,”

in Proc. IEEE ICDCS, 2013.

[75] S. Y. Cheung and C. S. Pencea, “BSFQ: Bin sort fair queueing,” in Proc. IEEE INFOCOM, 2002.

[76] C. Guo, “Srr: An o (1) time complexity packet scheduler for flows in multi-service packet networks,”

in Proc. ACM SIGCOMM, 2001.

[77] D. Stiliadis and A. Varma, “Rate-proportional servers: A design methodology for fair queueing

algorithms,” IEEE/ACM Trans. Netw., vol. 6, no. 2, pp. 164–174, 1998.

[78] ——, “Latency-rate servers: A general model for analysis of traffic scheduling algorithms,”

IEEE/ACM Trans. Netw., vol. 6, no. 5, pp. 611–624, 1998.

Bibliography 142

[79] A. Greenberg and N. Madras, “How fair is fair queuing,” J. ACM, vol. 39, no. 3, pp. 568–598,

1992.

[80] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek, “The Click modular router,”

ACM Trans. Comput. Sys., vol. 18, no. 3, pp. 263–297, 2000.

[81] M. Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in

Action. Cambridge University Press, 2013.

[82] W. Whitt, “Understanding the efficiency of multi-server service systems,” Management Sci.,

vol. 38, no. 5, pp. 708–723, 1992.

[83] B. Chen, C. N. Potts, and G. J. Woeginger, “A review of machine scheduling: Complexity, al-

gorithms and approximability,” in Handbook of Combinatorial Optimization, D.-Z. Du and P. M.

Pardalos, Eds. Springer, 1999, pp. 1493–1641.

[84] J. Y. Leung, Handbook of Scheduling: Algorithms, Models, and Performance Analysis. CRC

Press, 2004.

[85] M. L. Pinedo, Scheduling: Theory, Algorithms, and Systems. Springer, 2012.

[86] M. R. Garey, D. S. Johnson, and R. Sethi, “The complexity of flowshop and jobshop scheduling,”

Math. Oper. Res., vol. 1, no. 2, pp. 117–129, 1976.

[87] J. Sgall, “On-line scheduling,” in Online Algorithms, A. Fiat and G. J. Woeginger, Eds. Springer,

1998.

[88] A. P. A. Vestjens, “On-line machine scheduling,” Ph.D. dissertation, Technische Universiteit Eind-

hoven, 1997.

[89] S. S. Seiden, “A guessing game and randomized online algorithms,” in Proc. ACM STOC, 2000.

[90] J. B. Sidney, “The two-machine maximum flow time problem with series parallel precedence rela-

tions,” Oper. Res., vol. 27, no. 4, pp. 782–791, 1979.

[91] D. Bertsimas and J. Sethuraman, “From fluid relaxations to practical algorithms for job shop

scheduling: the makespan objective,” Math. Program., vol. 92, no. 1, pp. 61–102, 2002.

[92] T. Benson, “Data set for IMC 2010 data center measurement,” http://pages.cs.wisc.edu/∼tbenson/

IMC DATA/univ2 trace.tgz, 2010.

Bibliography 143

[93] W. Wang, B. Liang, and B. Li, “On fairness-efficiency tradeoffs for multi-resource packet process-

ing,” in Proc. IEEE ICDCS Workshop on Data Center Performance (DCPerf), 2013.

[94] E. Danna, S. Mandal, and A. Singh, “A practical algorithm for balancing the max-min fairness

and throughput objectives in traffic engineering,” in Proc. IEEE INFOCOM, 2012.

[95] B. Fan, D.-M. Chiu, and J. C. Lui, “The delicate tradeoffs in bittorrent-like file sharing protocol

design,” in Proc. IEEE ICNP, 2006.

[96] B. Zhang, S. C. Borst, and M. I. Reiman, “Optimal server scheduling in hybrid P2P networks,”

Perform. Eval., vol. 67, no. 11, 2010.

[97] D. Bertsimas, V. F. Farias, and N. Trichakis, “The price of fairness,” Oper. Res., vol. 59, no. 1,

pp. 17–31, 2011.

[98] ——, “On the efficiency-fairness trade-off,” Management Sci., vol. 58, no. 12, pp. 2234–2250, 2012.

[99] J. G. Dai and G. Weiss, “A fluid heuristic for minimizing makespan in job shops,” Oper. Res.,

vol. 50, no. 4, pp. 692–707, 2002.

[100] T. Gonzalez and S. Sahni, “Flowshop and jobshop schedules: Complexity and approximation,”

Oper. Res., vol. 26, no. 1, pp. 26–52, 1978.

by wei wang - university of toronto t-space...wei wang doctor of philosophy graduate department of...

Documents