greener big data: optimizing data exchange and power ... · fashion. in particular, the rise of the...

$: Greener Big Data: Optimizing Data Exchange and Power ... · fashion. In particular, the rise of the cloud and distributed data-intensive (\big data") applications puts pressure on$
Greener Big Data: Optimizing Data Exchange and PowerProcurement for Data Centers

by

Zakia Asad

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2017 by Zakia Asad

Abstract

Greener Big Data: Optimizing Data Exchange and Power Procurement for Data

Centers

Zakia Asad

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2017

Big data is revolutionizing modern day life at an environmental cost due to the “un-

green” practices associated with the operation of data centers, a workhorse for big data

processing. Turning the big data enterprise into a greener enterprise requires two funda-

mental steps. The first step is curtailing resource consumption, and making data centers

energy efficient. The second step is using greener energy sources to power data centers.

Big data processing has challenged the ability of data centers to operate in a greener

fashion. In particular, the rise of the cloud and distributed data-intensive (“big data”)

applications puts pressure on data center networks due to the movement of massive

volumes of data. Reducing the volume of communication is crucial for embracing greener

data exchange by efficient utilization of the resources. This thesis proposes the use of a

mixing technique, spate coding, as a means of dynamically-controlled reduction in volume

of communication. We introduce real-world use-cases, and present a novel spate coding

algorithm for the data center networks. We also analyze the computational complexity

of minimizing the volume of communication in a distributed data center application, and

provide theoretical limits. Moreover, we proceed to bridge the gap between theory and

practice by performing a proof-of-concept implementation of the proposed system in a

real-world data center. We use MapReduce, the most widely used Big Data processing

framework, as our target. Experimental results employing industry standard benchmarks

ii

show the advantage of our proposed system compared to the current state-of-the-art.

Another important step in promoting green practices for the big data enterprise is

reshaping the energy mix required to operate data centers. In this context, we con-

sider power procurement for data centers while incorporating the green initiatives and

cyber-physical constraints imposed by evolving power systems. Moreover, we consider

implications of our formulation to solution complexity. We use the concept of matroid

theory to model the problem as a combinatorial problem and propose a distributed and

optimal solution. We account for communications and computational complexity and

provide solutions for practical scenarios.

In the end, we propose a futuristic cross-plane green orchestrator.

iii

iv

Dedicated to my love, Asad

v

Acknowledgements

I owe all to my God, the Lord of the heavens and the earth, the compassionate and the

merciful. I thank Him for opening up my heart and mind to the wonders of knowledge.

It is with immense gratitude that I acknowledge the valuable support, availability, and

help of Prof. Frank Kschschiang. Long discussions with him have been very enriching.

His intriguing questions provided me with an opportunity to better organize my thoughts

as well as writing. I like to credit Prof. Cristiana Amza for being very kind, supportive,

and encouraging throughout this process. I am very thankful to Prof. Baochun Li for

his help, guidance, and support in reviewing my work and providing me with his very

helpful feedback to start with. I am also very thankful to Prof. Ben Liang, questions

posed by him helped me to better position the work. His suggestions and comments

refined various aspects of my work. I also want to acknowledge Prof. David Lie and Ms.

Darlene Gorzo for their valuable assistance when it was most needed. Special thanks to

Schlumberger Foundation, and Soptimizer for their support.

I am truly indebted to my parents, and in-laws for their continuous support and

prayers that kept me in high spirits throughout this journey. I owe my deepest gratitude

to my father-in-law who has always been there to provide unconditional support by being

my coach, mentor, counselor, and doctor. I also want to thank my father who envisioned

this path for me and helped me to grow and become what I am today. He has always been

a source of confidence and inspiration for me. I can only imagine the happiness and joy my

late mother would have felt on successful completion of my PhD. Her friendship is always

remembered, and has always been a source of resilience and strength. My mother-in-law,

and my sister-in-laws have always been very supportive of my goals and I thank them

from the core of my heart. I cannot express in words the support that I receive from my

wonderful kids Abd-Ur-Rehman, Marium, and Abd-Ur-Rahim for always understanding

my workload and assisting me in their own little ways. Their involvement and interest in

my work have always sparked extra endurance in my inner-self and brightened my life.

vi

My kids are one of the strongest reasons for me to come so far.

Last but certainly not the least, my PhD was nothing less than a dream if Asad has not

been there. His unreserved friendship, definite support, and outstanding love have shaped

my career. He has not only been my buddy but also my collaborator. He helped me to

carry out my research by transforming our home to into a small research center. He has

been a beacon of happiness and source of inspiration for me. His everlasting commitment

to the family has bound us together as a strong and happy unit. He has always been

there to accompany the kids during my work, and cover for my role—as a mother—so

they are never left unloved even for a moment. His compassion and encouragement has

been a great gift from Allah. I cannot thank Allah, the Almighty, enough for having such

a perfect life partner, and a wonderful family that kept me exuberant no matter what.

Love, support and joy—every bit to enjoy,

Unconditional support to keep me buoy,

Indebted to Asad for helping me to fly,

And reach my dreams high up in the sky

vii

Contents

1 Introduction 1

1.1 Energy Insights of the Ungreen Big Data Enterprise . . . . . . . . . . . 3

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Dissecting Big Data Enterprise . . . . . . . . . . . . . . . . . . . 5

1.2.2 A Coding Based Optimization for Big Data Processing . . . . . . 5

1.2.3 Distributed Power Procurement . . . . . . . . . . . . . . . . . . . 6

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Preliminaries 9

2.1 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Matroid Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Matroid Intersection . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Hadoop MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Combiner and In-network Combiner . . . . . . . . . . . . . . . . . 18

3 Five Vital Planes for Green Big Data Enterprise 21

3.1 Server Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Virtualization Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Interconnect Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Frameworks Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Power Procurement Plane . . . . . . . . . . . . . . . . . . . . . . . . . . 24

viii

3.6 Connecting the Dots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 A Coding Based Optimization for Big Data Processing 26

4.1 Facing the Challenge: High Volume of Communication and Big Data Pro-

cessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Motivating Use-Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.1 Hadoop MapReduce for Electricity Theft Detection . . . . . . . . 34

4.4.1.1 Hadoop Job . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4.1.2 Standard Mechanism . . . . . . . . . . . . . . . . . . . . 38

4.4.1.3 Proposed Coding Based Mechanism . . . . . . . . . . . . 39

4.4.2 Storm for DNA Sequencing . . . . . . . . . . . . . . . . . . . . . 41

4.4.2.1 Storm Job . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.2.2 Standard Mechanism . . . . . . . . . . . . . . . . . . . . 44

4.4.2.3 Proposed Coding Based Mechanism . . . . . . . . . . . . 45

4.5 Spate Coding Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5.2 Spate Coding and its Relationship with Network Coding and Index

Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5.3 Solution for Spate Coding Problem . . . . . . . . . . . . . . . . . 51

4.6 Challenges and Considerations . . . . . . . . . . . . . . . . . . . . . . . . 55

4.7 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8 Theory to Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.8.1 Coding Based Middlebox and its Components . . . . . . . . . . . 58

4.8.1.1 Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.8.1.2 Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

ix

4.8.2 PreReducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.8.3 Seamless Integration using OpenFlow . . . . . . . . . . . . . . . . 68

4.9 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.9.1 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.9.2 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.9.2.1 Volume-of-Communication . . . . . . . . . . . . . . . . . 80

4.9.2.2 Goodput . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.9.2.3 Disk IO Statistics . . . . . . . . . . . . . . . . . . . . . . 84

4.9.2.4 Bits-per-Joule . . . . . . . . . . . . . . . . . . . . . . . . 85

4.10 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.11 Connecting the Dots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5 Distributed Power Procurement for Data Centers 91

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.1.1 Example Case Study: Power Procurement over a 12 bus power

system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4.1 Why Matroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.5 Restricted Constrained Power Procurement Problem . . . . . . . . . . . 107

5.5.1 Relation to Matroids . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5.1.1 Policy Constraints . . . . . . . . . . . . . . . . . . . . . 108

5.5.1.2 Transmission Line Constraints . . . . . . . . . . . . . . . 109

5.5.2 A Distributed Algorithm for the 1-RCPP Problem . . . . . . . . . 110

5.5.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.5.4 T-Restricted Constrained Power Procurement Problem . . . . . . 116

5.5.4.1 Extended Ground Set . . . . . . . . . . . . . . . . . . . 116

x

5.5.4.2 Policy Constraints . . . . . . . . . . . . . . . . . . . . . 116

5.5.4.3 Transmission Line Constraints . . . . . . . . . . . . . . . 117

5.6 Constrained Power Procurement Problem . . . . . . . . . . . . . . . . . . 119

5.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.7.1 An overview of relevant solution techniques . . . . . . . . . . . . . 122

5.7.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.7.3 Percentage procurement and Cost . . . . . . . . . . . . . . . . . . 124

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6 Concluding Remarks and Path Forward 132

6.1 Open Issues and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2 Green Orchestrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Appendices 141

A 5Ps: Literature Survey and Discussions 142

A.1 Server Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

A.2 Virtualization Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.3 Interconnect Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

A.4 Frameworks Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.5 Power Procurement Plane . . . . . . . . . . . . . . . . . . . . . . . . . . 147

B Proof: A Coding Based Optimization for Big Data Processing 151

B.1 Proof of Theorem 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152






xi


C Proof: Distributed Power Procurement for Data Centers 158

C.1 Proof of Lemma 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

C.2 Proof of Theorem 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

Bibliography 160

xii

List of Figures

1.1 Big data: greener optimization. . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 A graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Graph H for iterations 1,2, and 3 during execution of the matroid intersec-

tion algorithm for finding least cost common independent set of maximum

cardinality of two matroids M1 and M2. . . . . . . . . . . . . . . . . . . 16

2.3 A toy word count example demonstrating basic tasks of Hadoop Mapreduce. 18

2.4 A toy word count example demonstrating the use of combiner. . . . . . . 19

2.5 A toy word count example demonstrating the use of in-network combiner. 20

3.1 6Ps of the big data enterprise. . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Topology and mapper/reducer output/input in the electricity theft detec-

tion example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Twelve records emitted by spout for the DNA-sequence processing job. . 42

4.3 Storm topology for DNA sequencing example. . . . . . . . . . . . . . . . 43

4.4 Placement of tasks of bolt A and bolt B in the DNA sequencing example

on a four node Storm cluster. . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5 An instance of spate coding for the example 4.4.1. . . . . . . . . . . . . . 47

4.6 (a) An instance of spate coding problem. (b) A similar instance of index

coding problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

xiii

4.7 Dependency graph for the transmissions crossing AS for the Example given

in 4.4.1. The execution of the Algorithm 1 results in four multicast packets

shown by four cliques each of size 2. . . . . . . . . . . . . . . . . . . . . 53

4.8 The proposed coding based Hadoop MapReduce data flow. . . . . . . . . 59

4.9 Architecture for middlexbox (sampler, and coder) . . . . . . . . . . . . . 60

4.10 Architecture for preReducer. . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.11 Coding for variable packet lengths. . . . . . . . . . . . . . . . . . . . . . 65

4.12 The proposed packet payload format for the coding based shuffle at the

application-layer. Depending on size, the format shown may be trans-

ported using multiple segments/packets. . . . . . . . . . . . . . . . . . . 67

4.13 OpenFlow based multicasting using coding groups coordinator for the ex-

ample given in section 4.4.1. Flow Tables and the corresponding Group

Tables for each switch in the multicast is also shown. . . . . . . . . . . . 71

4.14 A word count example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.15 Prototype setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.16 Testbed architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.17 Topologies used for testbed: (a) Top-1 (b) Top-2. . . . . . . . . . . . . . 79

4.18 Normalized VoC using Grep benchmark for both topologies Top-1 and

Top-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.19 Normalized VoC using Terasort benchmark for both topologies Top-1 and

Top-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.20 Normalized VoC for different values of λ (maximum clique size) used in

Algorithm 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.21 Goodput versus link rates for sorting benchmark for topology Top-1. . . 84

4.22 Goodput for different oversubscription ratios using sorting benchmark for

topology Top-1 with link rate at 500 Mbps. . . . . . . . . . . . . . . . . . 85

xiv

4.23 Goodput for different oversubscription ratios using sorting benchmark for

topology Top-1 with link rate at 1000 Mbps. . . . . . . . . . . . . . . . . 86

4.24 %d-util versus link rates using sorting benchmark for topology Top-1. . . 86

4.25 Qsz versus link rates using sorting benchmark for topology Top-1. . . . . 87

4.26 BpJ versus link rates using sorting benchmark for topology Top-1. . . . . 88

5.1 Three data center facilities A1, A2, A3 and a 12 bus power system with

eight generating sources g1, g2 · · · , g8. The constrained transmission lines,

γ1 and γ2, are shown in bold. . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2 An equivalent representation of the imposed constraints for the 12 bus

case study shown in Figure 5.1. . . . . . . . . . . . . . . . . . . . . . . . 96

5.3 Startup delays and generation cost associated with each generating point

for each time unit t0, t1, t2 and a planning horizon of three time units for

the 12 bus case study shown in Figure 5.1. . . . . . . . . . . . . . . . . 96

5.4 An instance of the CPP problem. . . . . . . . . . . . . . . . . . . . . . . 105

5.5 Algorithm 1−RCPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.6 Algorithm b = ChkL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.7 Algorithm T−RCPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.8 A sample execution of Algorithm T-RCPP for the system shown in Ex-

ample 5.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.9 Average Percentage procurement α for random power networks as a func-

tion of Transmission Network Density TND for PCF = 2. . . . . . . . 125





5.12 Average Procurement Cost per Facility β for random power networks as

a function of Transmission Network Density TND for PCF = 2. . . . . 127

xv






tion of Procurement Choices per Facility PCF for TND = 30. . . . . . 129


tion of Procurement Choices per Facility PCF for TND = 45. . . . . . 129


a function of Procurement Choices per Facility PCF for TND = 30. . . 130


a function of Procurement Choices per Facility PCF for TND = 45. . . 130

6.1 Green Orchestrator for big data processing. . . . . . . . . . . . . . . . . . 137

B.1 An instance of the Problem ES. . . . . . . . . . . . . . . . . . . . . . . . 154

xvi

Chapter 1

Introduction

The tremendous growth in the data generated by different sources is reshaping modern

day life. In fact, more than 90% of all the data in the world has been generated in the

last four years [1]. Big data is revolutionizing many fields including, but not limited to,

health care, law enforcement, financial markets, manufacturing, town planning, online

profiling, and targeted advertisement. The term big data is usually associated with the

following “five V’s” [2, 3]:

• Volume: The volume of the data is predicted to soar as high as 43 trillion gigabytes

by year 2020. This increase in data will be almost 300 times as compared to the

year 2005 due to an addition of 2.3 trillion gigabytes everyday.

• Velocity: The rate at which the data is generated each millisecond is growing

tremendously. For instance, more than 400 billion hours of videos are watched on

Youtube each month, and about 400 million tweets are sent each day, and this rate

is expected to increase day by day.

• Variety: The data is generated by a variety of sources including emails, social

media—e.g., facebook, twitter, and youtube—stock markets, and sensors. This

highly diversified data has left behind an era of a well-behaved, rational, and

1

Chapter 1. Introduction 2

structured data. Almost 80 percent of today’s data does not fit into traditional

databases, and cannot be harnessed by populating highly structured data objects.

• Veracity: It is crucial for the data to be accurate and trustworthy. The applicability

of big data in critical fields—like health care, trade markets, and smart grid, to

name a few—puts even more pressure on its integrity as well as accuracy.

• Value: The persuasion of big data is all about its value churned out from its volume,

velocity, variety, and veracity. Therefore, it is important that the data is not only

collected correctly but also processed wisely to extract valuable information.

The rise of the big data enterprise has opened up several new challenges. Big data so-

lutions require distributed processing and storage across clusters of servers, data centers,

to offer practical benefits. In fact, data centers are the backbone of big data processing.

Big data processing is reported to be very “dirty” in terms of the environmental cost due

to the superfluous energy consumption as well as “ungreen” practices associated with

the operation of data centers [4]. This thesis is focussed on turning data centers into a

greener enterprise. For realizing a greener big data ecosystem, data center efficiency and

power procurement form greener sources should be coupled together [5]. In particular,

turning the big data enterprise into a greener enterprise requires two fundamental steps,

as shown in Figure 1.1. The first step is curtailing resource consumption, and making

data centers energy efficient. The second step is using greener energy sources to power

data centers. Specifically, we explore the following two fundamental approaches as means

towards holistic greener data centers:

• Optimizing big data processing in data centers by curtailing resource consumption

and making the system energy efficient.

• Optimizing the power procurement to power data centers by employing green policy

directives in presence of the system level constraints.


Figure 1.1: Big data: greener optimization.

1.1 Energy Insights of the Ungreen Big Data Enter-

prise

Big data processing units, data centers, have shown a tendency to become energy black

holes. This section highlights energy insights with reference to the “ungreen” big data

enterprise.

Data centers are one of the top and leading electricity-consumers in the United States

[6]. A 2008 study showed that around 2% of the world’s total electricity is being consumed

by data centers [4,7]. A more recent report shows that data centers consume around 3%

of the world’s total electricity, and produce 200 million metric tons of carbon dioxide [8,9].

Moreover, electricity consumption by data centers is growing at a rate of 12% per year,

and is about 56% higher than the preceding five years [10–13].

In Europe, energy consumption by data centers is set to go as high as 93 billion

kWh by 2020, which is equivalent to one hundred million 100-watt light-bulbs burning

24 hours a day, 365 days a year [14]. Similarly, in 2013 the power consumption by U.S.

data centers was around 91-billion kWh, which is equivalent to deriving energy from

thirty-four 500-MW coal-fired power plants on an annual basis. This power consumption

is estimated to soar as high as 140 billion kWh per year by 2020, resulting in $13 billion


per year in electricity bills and almost 150 million metric tons of carbon emissions [6].

However, soaring electricity bills and staggering carbon emissions have not dispirited the

data center industry from further expansion. More precisely, in the past five years nearly

66% of data center facilities added over 1000 kWh per data center to meet increasing

electricity demands [15].

Moreover, a big challenge is extending current peta-scale computing clusters to exa-

scale computing clusters of the future, where scaling-up current data centers is not an

option due to preposterous power demand. For example, simply scaling up a current

state-of-the-art fastest peta-scale computing platform, like Blue Waters, to an exa-scale

platform will need 1.5 GW of power for each such platform, which is almost 0.1% of

the total U.S. generation capacity required per platform, thus implying that each exa-

scale platform will require a reasonably sized nuclear power plant just to fulfill its energy

demands [16]!

Until 2011, power for more than half of the massive-scale data centers was procured

from coal for up to 80% of their power needs [4]. Industry giants like Apple, IBM, Face-

book were reportedly using more than 50% of the coal generated electricity to power up

their data centres. However, in more recent years, the industry has faced growing pres-

sure to integrate renewable energy sources mainly due to environmental responsibility

campaigns as well as federal legislation (at least in the U.S. [17]) to limit carbon emis-

sions. This has had very favorable impact; for instance, Apple, Box, Facebook, Google,

Rackspace, and Salesforce have committed to incorporate 100% renewable energy sources

to power their data centers [18]. However, there are too many companies that still go for

the economical but not the greenest energy from the grid, e.g., Amazon Web Services,

Youtube, and Netflix [19,20].

Big data frameworks like Hadoop MapReduce [21], Amazon Elastic MapReduce [22],

and Google MapReduce [23] are the prominent growing segment of data centers’ work-

load, used primarily for harnessing the power of analytics [24]. Server clusters running


these frameworks consume a lot of energy. Moreover, these frameworks inherently induce

low utilization of the data center resources. As an example, it has been reported that a

200-node Hadoop cluster consumes around 40 kW of energy for a certain workload [25].

Moreover, in a multi-job Hadoop workload, around 38% of the time CPUs, disk, and

network remain inactive [24], thus resulting in unnecessary power drainage.

1.2 Contributions

This thesis revolves around providing greener big data, and gives a comprehensive view

of several “energy-wise” planes and solutions to remediate the resource-hungry big data

enterprise. In particular, we make the following contributions.

1.2.1 Dissecting Big Data Enterprise

Turning big data enterprise into a greener enterprise requires two fundamental steps.

The first is curtailing resources for making the system resource efficient. The second is

using greener energy sources to power it up.

Accordingly, we identify five planes (5Ps) impacting the green footprints of data

centers, namely, server plane, virtualization plane, interconnect plane, framework plane,

and power procurement plane.

1.2.2 A Coding Based Optimization for Big Data Processing

It has been established that the reduction in volume of communication directly corre-

sponds to energy savings in a data center [26–30]. Hence, improving the energy efficiency

by reducing volume of communication is fundamental in bringing green practices to big

data processing.

We propose a system for optimizing big data processing in a data center. Particularly,

the focus is to make the big data processing greener by reducing the volume of com-


munication and help making server, virtualization, interconnect, and framework planes

greener. We motivate our work by presenting real world use-cases. Our contributions are

both theoretically and practically significant.

Our major theoretical contributions include analyzing the computational complexity

of a general scheme that tries to minimize volume of communication in a distributed

data center application without degrading the rate of information exchange. In addi-

tion to this, we present theoretical limits of such schemes by providing the upper and

lower bounds. Moreover, we prove that, in contrast to a frequent practice in many data

centers, network bisection is not an optimal location for middlebox placement for some

applications. Furthermore, we formulate a spate coding problem which helps in reducing

volume of communication in big data applications. We also present an efficient solution

for spate coding.

Our major practical contributions include a proof-of-concept implementation in a

data center, and a testbed. We have used Hadoop, the most widely used big data

processing framework, as our target framework. We have used two industry-standard

benchmarks—Terasort and Grep—to evaluate the performance of the proposed system.

The experimental results exhibit coding advantage, for different parameters, compared

to both a vanilla Hadoop implementation and an in-network combiner. In addition, we

have combined our scheme with in-network combiner, referred to as Combine-N-Code, to

further reduce the network traffic.

1.2.3 Distributed Power Procurement

The increasing trend of geographically dispersed data center facilities as well as the rise

of geographically diverse energy pricing over the traditionally shared power networks

is marking a new era of complex power-procurement paradigms. Adding to the com-

plexity are the rising demands for inclusion of policy-related constraints for the power

procurement of “power-hungry” data center enterprises.


This thesis presents a model to capture the emerging constraints for the power pro-

curement problem. We have used matroid theory to gain a better insight into the com-

binatorics associated with the problem. Specifically, we consider power procurement in

the presence of both the policy-driven constraints, set forward by data center operators,

as well as the physical constraints imposed by the evolving smart grid.

We have developed a framework for constrained power procurement (CPP), in which

the problem is broken down into building blocks, effective in relating power procurement

complexity to policy-driven as well as physical system constraints. Specifically, we in-

troduce the 1-restricted power procurement (1-RCPP) and the T -restricted constrained

power procurement (T-RCPP) problems. Moreover, we present distributed polynomial

time solutions to the 1-RCPP and T-RCPP problems. Our distributed solutions provide

a novel way to assign IDs to the generating sources to facilitate the distributed power

procurement for data center facilities. Furthermore, we show that for less restrictive

physical constraints, the CPP problem is NP-hard to approximate within a ratio of n1−ε

for any constant ε > 0.

1.3 Thesis Outline

This thesis is organized as follows. Chapter 2 provides some preliminaries. In Chapter

3, we dissect the big data enterprise into five planes vital for turning big data enterprise

greener. Moreover, we provide a detailed overview of the related efforts by academia and

industry. In Chapter 4, we present a novel spate-coding-based scheme that reduces the

volume of communication, and thereby helps make data centers greener. This chapter

also provides theoretical analysis of the schemes reducing the volume of communication,

and provide a detailed framework for Hadoop MapReduce. We manifest detailed insights

covering several performance aspects, using a data center setup as well as a testbed

setup. Chapter 5 models the power procurement problem under green-policy-driven as


well as emerging constraints imposed by distributed generation. We provide an equiva-

lent matroid representation of the constraints, and provides a distributed solution. To

conclude, we present open issues and associated challenges in Chapter 6. This chapter

also synthesizes a layout for a futuristic cross-plane green orchestrator.

Chapter 2

Preliminaries

This chapter introduces some of the tools used later in the thesis. Section 2.1 provides

an overview of graph theory and related definitions. Section 2.2 describes basics related

to matroid theory. Section 2.3 presents an overview of the MapReduce framework in

context of Apache Hadoop.

2.1 Graph Theory

Graph theory is a branch of mathematics concerned with the study of graphs. Graphs

are mathematical structures used to model pairwise relations between the objects. The

origin of the graph theory can be traced back to 1736 when Leonhard Euler solved the

Konigsberg bridge problem. Although graph theory has its origin in recreational math,

it has found applications in many disciplines, including but not limited to, computer

science, network theory, chemistry, social sciences, and operations research [31–33].

A formal definition of a graph is given below.

Definition 1 (Graph) An undirected graph G(V,E) is a non-empty finite set V of ele-

ments called vertices, and a finite set E of unordered pairs of elements (u, v) in V called

edges. If the set E consists of ordered pairs, then the graph is a directed graph.

9

Chapter 2. Preliminaries 10

Figure 2.1: A graph.

We will only consider graphs with a finite number of vertices.

Two vertices u and v in G(V,E) are called adjacent if there is an edge (u, v) ∈ E.

Moreover, a weighted graph is a graph for which each edge has an associated weight,

typically given by a weight function w : E → R [34].

Definition 2 (Bipartite Graph) A bipartite graph G(V,E) is a graph in which V can

be partitioned into two nonempty disjoint subsets V1 and V2, such that ∀ (u, v) ∈ E, u ∈ V1

and v ∈ V2.

Next, we introduce the concept of a clique in a graph and the clique packing problem.

Definition 3 (Clique) Given a graph G(V,E), a clique ϑ is a nonempty set of vertices

ϑ ⊆ V such that ∀ u, v ∈ ϑ, u 6= v, (u, v) ∈ E.

A clique ϑ is called a k-clique, denoted by ϑk, if |ϑ| = k, where |ϑ| denotes the number

of elements in ϑ.

For example, a,b,c,d is a 4-clique(ϑ4) in the graph of Figure 2.1. Any single vertex

is a 1-clique.

The following problem will be important later in this thesis.

Definition 4 (Clique Partition Problem) Given a graph G(V,E), partition V into

the minimum number of disjoint subsets ϑ1, ϑ2, . . . , ϑk, such that each ϑi, 1 ≤ i ≤ k, is a

clique.

Clique Partition is an NP-hard problem [35]. One possible clique partition for the

graph shown in Figure 2.1 results in two cliques ϑ1 = a, b, c, and ϑ2 = d, e. Another


possibility is ϑ1 = a, b, c, d, and ϑ2 = e. Clearly there can be more than one way to

partition a graph into the minimum number of cliques.

Next, we define a path, and a shortest path in a graph.

Definition 5 (Path) A path P between vertices u1 and ux in G(V,E) is a subgraph

P (V ′, E ′) such that

V ′ = u1, u2, · · · , ux,

and

E ′ = (u1, u2), (u2, u3), · · · , (ux−1, ux)

where the ui are all distinct. The length of path P is |E ′|. The weight of path P is∑(u,v)∈E′ w(u, v).

Definition 6 (Shortest Path) A shortest path between vertices u and v is a path of

weight W such that no other path between u and v has weight smaller than W .

For the case where w(u, v) = 1 ∀ (u, v) ∈ E for the graph shown in Figure 2.1, a

shortest path between the vertices a and e is P = a, d, e of weight 2.

2.2 Matroid Theory

Matroids were first introduced by Whitney in the early 20th century [36]. Matroid the-

ory has applications in many disciplines including linear algebra, graph theory, network

theory, coding theory, and combinatorial optimization.

Definition 7 (Matroid) A matroid M(X,L) is an ordered pair formed by a ground

set X and a collection L of subsets of X, that satisfy the following three conditions:

1. ∅ ∈ L;

2. If Y ∈ L and Y ′ ⊆ Y , then Y ′ ∈ L (hereditary property);


3. If Y1 ∈ L and Y2 ∈ L with |Y1| > |Y2|, then there exists x ∈ Y1 \ Y2 such that

Y2 ∪ x ∈ L (augmentation property).

We will only consider matroids where the ground set is finite.

Each Y ∈ L is referred to as an independent set. A maximal independent subset

of X is referred to as a base. Condition 3 implies that all bases of M have the same

cardinality, referred to as the rank ofM. Elements of X can be associated with a weight

function c : X → R. The weight of a subset Y of X is defined as the sum of weights of

its elements:

w(Y ) =∑x∈Y

c(x).

We next provide the definition of vector matroid and partition matroid [37].

Definition 8 (Vector Matroid) Let E be a set of column labels of an m×n matrix A

over a field F, and let L be the set of subsets S of E for which the multiset of columns

labelled by S is a set and is linearly independent in the vector space Fm. Then (E,L) is

a vector matroid.

For example, consider the binary matrix A.

A =

1 0 0 0

0 1 1 0

0 0 1 1

,in which the four columns are labelled, from left to right, as c1, c2, c3, c4. Then the

set of columns c1, c2, c3 constitutes a maximal independent set or a base.

Definition 9 (Partition Matroid) Let (E1, E2, · · · , Er) be a partition π of a set E

into nonempty subsets. Then (E,L) is a partition matroid when L is defined as:


L = X ⊆ E : |X ∩ Ei| ≤ ki ∀i ∈ 1, 2, · · · , r,

for given parameters k1, k2, · · · , kr.

Let B consist of those subsets of E that contain exactly ki elements from each Ei.

Then B is the set of bases of the coresponding partition matroid. For example, consider

the set E = a, b, c, d, and a partition of E into two subsets E1 = a, b, E2 = c, d

and let k1 = k2 = 1. Then, L = φ, a, b, c, d, a, c, a, d, b, c, b, d. One of

the bases of this partition matroid is a, d.

Definition 10 (Partial Transversal [37]) Let J = 1, 2, · · · , r be the indices of the

nonempty subsets (E1, E2, · · · , Er) of E. A transversal of (E1, E2, · · · , Em) is a subset

e1, e2, · · · , er of set E such that ej ∈ Ej for all j ∈ J , and e1, e2, · · · , er are distinct.

Then, X ⊆ E is a partial transversal of (Ej : j ∈ J) if, for some subset K of J , X is a

transversal of (Ej : j ∈ K).

For the case when subsets (E1, E2, · · · , Er) is a partition of E, then the set of partial

transversals of E coincides with the set of independent sets of the partition matroid.

2.2.1 Matroid Intersection

Definition 11 (Matroid Intersection) Let M1 = (X,L1), and M2 = (X,L2) be two

matroids over the same ground set X, with families of independent sets L1, and L2. The

set I = L1∩L2 is said to be the intersection of the matroidsM1 andM2. An element of

I is called a common independent set of M1 and M2. An element of I with maximum

cardinality is called a common base of M1 and M2.

Next, we present a slight variation of the weighted common independent set aug-

menting algorithm given in [38] for finding a minimum weight common base, a maximum

cardinality common independent set, of two matroids M1 = (X,L1), M2 = (X,L2).


1. Initialization: X := ∅,

2. Define a directed graph H with the node set X, and the edge set as follows. For

any xi ∈ X and xj ∈ X \ X:

• If (X \ xi) ∪ xj ∈ L1, (xi, xj) is an edge in H;

• If (X \ xi) ∪ xj ∈ L2, (xj, xi) is an edge in H.

3. Define sets:

X1 = xi ∈ X \ X | X ∪ xi ∈ L1

X2 = xi ∈ X \ X | X ∪ xi ∈ L2

4. For any node xi ∈ X define its cost l(xi) by:

l(xi) = −c(xi) if xi ∈ X

l(xi) = c(xi) if xi /∈ X

The cost of a path m in H, denoted by c(m), is equal to the sum of the costs of

the nodes traversed by m.

5. We consider two cases:

Case 1: There exists a directed path m in H from a node in X1 to a node in X2

• Choose the path m so that c(m) is minimal and it has a minimum number of

edges among all minimum cost paths from a node in X1 to a node in X2

• Let the path m traverse the nodes y0, z1, y1, z2, . . . , yg−1, zg, yg of H, in this

order, and set

X := (X \ z1, . . . , zg) ∪ y0, . . . , yg


• Go to Step 2

Case 2: There is no directed path in the graph H from a node in X1 to a node in

X2. Then, return X.

Note that the set X returned by the algorithm is a maximum-cardinality common

independent set with least weight.

Below we present a simple example illustrating execution of the matroid intersection

algorithm.

Example 12 Consider two matroids M1 and M2. M1 is a vector matroid and M2

is a partition matroid over the same ground set X = a1, a2, a3, a4, a5. The weights

associated with elements of the ground set are: c(a1) = 1, c(a2) = 2, c(a3) = 3, c(a4) = 1,

and c(a5) = 2. The following binary matrix A, corresponding to M1, represents element

ai as the ith column vector.

A =

1 0 1 1 1

0 1 1 0 1

0 0 0 1 0

.The partition of the ground set for matroidM2 is given by E1 = a1, E2 = a2, a3, E3 =

a4, a5, where k1 = k2 = k3 = 1. The algorithm starts with X := ∅ and iteratively aug-

ments it, such that the invariant X ∈ I is maintained. The corresponding graph H for

each iteration is shown in Figure 2.2, whereas the corresponding sets X1, X2, and X for

each iteration are given below.

First Iteration:

X1 = a1, a2, a3, a4, a5

X2 = a1, a2, a3, a4, a5


Figure 2.2: Graph H for iterations 1,2, and 3 during execution of the matroid intersectionalgorithm for finding least cost common independent set of maximum cardinality of twomatroids M1 and M2.

X = a1

Second Iteration:

X1 = a2, a3, a4, a5

X2 = a2, a3, a4, a5

X = a1, a4

Third Iteration:

X1 = a2, a3, a5

X2 = a2, a3

X = a1, a4, a2


The cost for maximum cardinality common independent is 4.

2.3 Hadoop MapReduce

Apache Hadoop [21], an open-source implementation of the MapReduce framework, is

widely used for processing large amounts of data in parallel on thousands of nodes. It

has been shown to scale-out to thousands of nodes and is used across various domains

(from bio-sciences to commercial analytics). More than half of Fortune 50 companies use

Hadoop [39]. Facebook’s Hadoop cluster holds more than 100 Petabytes of data which

grows at a rate of half a Petabyte every day [40].

Following is an overview of the MapReduce workflow for a job.

• The input data set (file) is split into smaller independent chunks to be processed

in parallel across many nodes.

• Each independent chunk (file split) is processed by a Mapper locally at a node.

Each Mapper generates intermediate 〈key, value〉 pairs.

• The role of a Partitioner is to partition intermediate 〈key, value〉 pairs based on

keys.

• During the Shuffle process, the intermediate data is delivered from the Mappers to

the corresponding Reducers via a network-intensive transfer.

• The Sorter merges and sorts the map outputs before being presented to the Reducer.

For any particular key, however, the values are not sorted.

• Each Reducer is responsible for aggregating a unique set of keys and process all

corresponding 〈key, value〉 pairs to generate the output.

• The final output is written back to the disk.


Figure 2.3: A toy word count example demonstrating basic tasks of Hadoop Mapreduce.

Figure 2.3 depicts the MapReduce process for a toy example focused on the word

count, where the goal of the job is to count the occurrences (values) for each word (key)

in an input file. In this example the input file is divided into three chunks each processed

in parallel by a separate mapper. The mappers outputs (intermediate 〈key, value〉 pairs)

are further processed (aggregated) by two reducers to a smaller set of 〈key, value〉 pairs.

Precisely, the values for the keys cat, man, and the are aggregated at the reducer1,

whereas the values for the keys chases, dog, and rat are aggregated at the reducer2. The

route from a mapper to a reducer consists of two hops. Assuming each 〈key, value〉 pair

can be transmitted by one packet, there is a total of 30 link-level packet transmissions

for 15 〈key, value〉 pairs exchanged during the shuffle.

2.3.1 Combiner and In-network Combiner

MapReduce jobs are typically constrained by the limited bandwidth between different

nodes, so it is important to optimize the data transfer during shuffle [41]. A combiner

function, applied at the output of mappers, helps to reduce the size of data transfer

during shuffle. For instance utilizing a combiner in the toy example of Figure 2.3 results


Figure 2.4: A toy word count example demonstrating the use of combiner.

in 24 link-level packet transmissions for 12 〈key, value〉 pairs exchanged during the shuffle

(as shown in Figure 2.4). Thus, resulting in 20% reduction in link-level network traffic

compared to standard Hadoop mechanism.

Previous works [42–44] have proposed to use the combiner at the network level to

further reduce the network traffic, we refer to it as the in-network combiner. Later

in Chapter 4, we have compared our proposed solution with the in-network combiner.

Figure 2.5 describes the operation of an in-network combiner for the toy example of

Figure 2.3. Utilizing the in-network combiner results in a total of 21 link-level packet

transmissions; 15 link-level packet transmissions to the network switch and 6 link-level

packet transmissions from the network switch to the corresponding reducers (as shown

in Figure 2.5). Thus, resulting in 30% reduction in network traffic compared to standard

Hadoop mechanism. Note that the in-network combiner also outperforms the combiner.


Figure 2.5: A toy word count example demonstrating the use of in-network combiner.

Chapter 3

Five Vital Planes for Green Big

Data Enterprise

Curtailing resources and energy consumption by data centers is being overlooked despite

rising energy costs. Part of the reason for this oversight is the tremendous growth of the

big data business, which is pushing energy efficiency down the priority list. A recent data

center industry survey by the Uptime Institute points out that 50% of the North American

respondents (the survey included individuals associated with 1,000 different data centers)

did not recognize energy efficiency to be a very important consideration [15].

The key to bringing down the huge electricity consumption by data centers is to fix

the most prominent factors triggering the power appetite. A report by Cisco points out

that almost 56% of the power consumed by a data center is used for the IT loads, followed

by cooling and distribution [46]. Contrary to the best practices guidelines for achieving

energy efficiency, the vast over-provisioning of IT resources has led to unnecessary energy

consumption as well as inefficient and low utilization of IT equipment. Indeed, it has been

vastly accepted that the most prominent factors negatively impacting the energy usage

of IT loads are low utilization of servers, limited adaptation of virtualization, inefficient

0Part of this work is to appear in [45].

21

Chapter 3. Five Vital Planes for Green Big Data Enterprise 22

Figure 3.1: 6Ps of the big data enterprise.

network interconnects, and lack of appropriate business models [6].

Furthermore, challenges associated with integrating renewable energy solutions for

data centers are a major hurdle for maintaining greener energy portfolios.

In this vein we dissect the big data enterprise into five planes (5Ps) vital to turn

data centers into greener enterprises. These 5Ps are: server plane, virtualization plane,

interconnect plane, framework plane, and power procurement plane plane as shown in

Figure 3.1. Appendix A surveys several key approaches to transform 5Ps into greener

planes.

3.1 Server Plane

According to a research done by Google, a typical server cluster utilization averages

merely 10% to 50% [47]. Since, the computing fabric of any big data processing platform

relies on servers in its entirety, an efficient server plane is very essential for greener big

data processing. The increased workload at the servers bound their efficiency by the

amount of free shared resources both locally and globally. Greener optimization that

results in better management of resources is fundamental for an efficient server plane.


3.2 Virtualization Plane

Virtualization is a technique for sharing physical resources among different Virtual Ma-

chines (VMs). A VM is an emulation of a server with its very own operating system and

applications. A physical server hosts different VMs by deploying a virtualization layer

(hypervisor). Virtualization can lead to significant improvements in energy efficiency as

well as improved server utilization. Virtualization is accepted as the most popular power

management approach used by data centers [6, 48]. The computing resources like CPU

and memory required for each VM are assigned dynamically according to the require-

ments of the applications running on that VM. This dynamic resource allocation helps

to save a considerable amount of energy by maximizing the utilization of free computing

resources among VMs.

Furthermore, virtualization is an effective way to provide server consolidation. In

many scenarios an idle server claims more than 40% of the power used by a fully utilized

server. In such scenarios, server consolidation by using virtualized environments results

in energy savings by switching off the servers that are partially used or are nearly idle [49].

In essence, the virtualization plane is meant to be an aid to achieve energy efficiency in

the distributed computing environment. However, the success of the virtualization plane

counts on better server resource utilization, and intelligent traffic engineering. Hence,

schemes that reshape network traffic as well as free the resources at the compute nodes,

helping them efficiently cope with multiple virtual instances, are a necessity for a greener

virtualization plane.

3.3 Interconnect Plane

The interconnect plane claims a significant proportion of the total energy consumption

by a data center. Moreover, the energy share of the interconnect plane is predicted to

soar further [26,50–52].


High volume of communication, elemental to big data processing, is one of the major

causes for staggering amount of energy consumption associated with the network com-

ponents of data centers. An aspect of bringing energy efficiency to data center networks

is the provision of load-proportional energy consumption in links and switches. In such a

network device, the energy consumption depends on the amount of data flowing through

it [26–30]. Therefore, a greener interconnect plane calls for intelligent communication

schemes that focus on reducing the amount of data flowing through the network devices.

3.4 Frameworks Plane

The core of major big data frameworks is to divide huge data sets into smaller subsets,

process them at separate nodes, communicate the processed information over the network

connecting different nodes (also referred to as the communication phase), and then com-

bine the results in a time-efficient manner. Therefore, efficiency of the framework plane

and infrastructure are tightly coupled. Therefore, a solution applicable to the general

spirit of big data processing, i.e., processing in parallel across a large number of nodes, by

optimizing communication among servers is very crucial for greener framework planes.

3.5 Power Procurement Plane

Efficient power procurement for the data center entities constitute several components

including power procurement, proper business model, and greener portfolio management.

Power procurement plane is a complex plane with socioeconomic multi-party interplay,

therefore an important step for reshaping power procurement plane is to investigate the

basic models that can capture distributed generation in the presence of preferential green

energy choices and the power transmission infrastructure constraints.


3.6 Connecting the Dots

This chapter dissected the big data enterprise into five pivotal planes with the perspective

of energy efficiency. Chapter 4 presents a solution to optimize the communication in

distributed data center applications. We want to emphasize that although our focus

is reducing the volume of communication, the performance evaluation of the proposed

solution exhibits a performance improvement in all four planes internal to a data center,

i.e., server plane, virtualization plane, interconnect plane, and framework plane. Chapter

5 focuses on optimizing an important aspect related to the power procurement plane.

Chapter 4

A Coding Based Optimization for

Big Data Processing

The rise of the cloud and distributed data-intensive (“big data”) applications puts pres-

sure on data center networks due to the movement of massive volumes of data. Reducing

the volume of communication is pivotal for embracing greener data exchange by efficient

utilization of network resources. This chapter proposes the use of a mixing technique,

spate coding, as a means of dynamically-controlled reduction in the volume of communi-

cation. We introduce motivating real-world use-cases, and present a novel spate coding

algorithm for data center networks. We also analyze the computational complexity of

the general problem of minimizing the volume of communication in a distributed data

center application without degrading the rate of information exchange, and provide the-

oretical limits of such schemes. Moreover, we proceed to bridge the gap between theory

and practice by performing a proof-of-concept implementation of the proposed system

in a real world data center. We use Hadoop MapReduce, the most widely used big data

processing framework, as our target. The experimental results employing two of industry

standard benchmarks show the advantage of our proposed system compared to vanilla

0Part of this work appears in [53–55].

26

Chapter 4. A Coding Based Optimization for Big Data Processing 27

Hadoop implementation, in-network combiner, and Combine-N-Code. The proposed cod-

ing based scheme shows performance improvement in terms of volume of communication,

goodput, disk utilization, and the number of bits that can be transmitted per Joule of

energy.

4.1 Facing the Challenge: High Volume of Commu-

nication and Big Data Processing

Cicso global cloud index predicts that by 2017 cloud traffic, traffic by data centers as

well as other entities in cloud, will represent sixty-nine percent of data center traffic [56].

Furthermore, recent business insights into data center networks evolution also forecast

an unprecedented growth in data center traffic with 76% of the aggregate traffic not

exiting the data center [56]. The increasing migration of applications to the cloud is

probably the major driver of this trend: even without a fundamental change in the

average communication/computation ratio exhibited by data center workloads, increasing

the compute-density-per-server by means of virtualization leads to proportional increase

in traffic generated per server. Orthogonally though, due to the effects of virtualization,

the onset of applications crunching and moving large volumes of data—captured by the

market-coined term big data applications—is also foreseen to significantly contribute to

higher network traffic within data centers [57].

The burden of high volume of communication, elemental to big data processing, is

anchored to network components like switches, and routers. With the advent of energy

proportional computations at servers, the energy quota for network components is pre-

dicted to soar as high as 50% of the total energy consumption by data centers [26]. Data

centers’ energy consumption related to the network was reported to be around 15.6 billion

kWh in year 2008 [7].

Energy proportional network components play a vital role in reducing a data center’s


energy consumption. In energy proportional network components, the energy consump-

tion is proportional to the volume of communication. Reduction in volume of communi-

cation directly corresponds to energy savings in a data center [26–30]. Furthermore, some

approaches conserve data center network’s energy consumption by adaptively choosing

the link rates to match the volume of traffic going through network component. For in-

stance, an energy saving of approximately 4 Watts per link can be achieved by decreasing

the link rate from 1 Gbps to 100 Mbps [58]. In such adaptive approaches, decreasing the

volume of communication improves the energy profile by favoring lower link rates. Hence,

the opportunity to improve the energy efficiency by reducing volume of communication

serves as a cornerstone in bringing green practices to big data processing.

Even with the provision of energy-efficient network architectures, current data center

network-architectures pose a serious challenge towards embracing big data in an effi-

cient and greener fashion. Specifically, data center networks are still not ready to digest

petabytes of network data traffic crossing the bisection [59, 60]. Processing data at this

speed needs a network with enough bisectional bandwidth where every server (node) can

send data at full speed to every other server (node). This means that network bisection

bandwidth is going to cap the rate at which different servers can communicate with each

other [61]. Unfortunately, the data center networks are optimized for North-South traffic

(connections between servers and clients) rather than East-West traffic (servers commu-

nicating with each other within data center) since most of the software architectures

in pre-cloud era were meant for clients accessing the servers within data centers. How-

ever, with the rise of cloud and distributed architectures like Hadoop MapReduce [21],

Storm [62], and Dryad/DryadLINQ [63,64] that are implemented on the different nodes

across the data centers, applications rely on East-West traffic for most of their com-

putations. In fact, as a defacto standard, data centers use a tree topology assuming

that network bisection (top layer of the network which is typically highly oversubscribed

(240:1) offers enough bandwidth for the machines at the lower levels in the network topol-


ogy [60]. However, most of the distributed big data applications like Hadoop MapReduce

exchange tremendous amounts of data over highly oversubscribed links resulting in high

packet-drop rates [59]. Therefore, unfolding the full potential of big data by greener data

centers calls for efficient utilization of network resources, and new non-intrusive frame-

works that can seamlessly integrate with existing network architectures. Consequently,

such new frameworks are expected to heavily rely on managing network traffic.

The main driving force for the adoption of cloud data centers is rooted in its ability to

deliver services and data at a faster rate, resulting in improved application performance

as well as higher operational efficiencies. In order to achieve the desired performance,

in terms of bringing down the job-completion times of applications and enhancing the

operational characteristics of the applications, most of the research is focused on schedul-

ing computation, jobs, and resources (e.g., see [65–67] and references therein). However,

since more than 50% of the job completion-time of many jobs is consumed in the com-

munication phase [57], it is essential to explore novel means by which the communication

time can be reduced. In this context different approaches have been studied for ame-

liorating the network effects due to communication-intensive workloads by managing

flows [57,68,69], and demand-driven path alterations [70–72]. Bringing greener prospects

in the data center networks are as important as ensuring higher operational efficiency.

In this vein this work focus on presenting an efficient way to manage the network traf-

fic by minimizing its volume which results in improved resource usage and their energy

footprints. Our work is complementary to existing approaches. We argue that gauging

towards the right gains is of fundamental importance. For example, some approaches

related to optimizing big data frameworks focus on job completion time, whereas we

assert that optimizing resource utilization is more valuable when it comes to energy ef-

ficiency. For example many Hadoop jobs are batch jobs for which time efficiency is not

of considerable importance; for example overnight log analysis to capture trends is not a

job that is critical on time. In such scenarios, if the perspective of optimization is shifted


from time towards resource utilization, the energy-gains can be improved.

4.2 Contributions

In particular, we propose a scheme for optimizing big data processing in a data center,

and present motivating use-cases.

Our major theoretical contributions involve analyzing the computational complexity

of a general scheme that tries to minimize volume of communication in a distributed data

center application without degrading the rate of information exchange. We then present

theoretical limits of such schemes by providing upper and lower bounds. Moreover, we

have proven that in contrast to a frequent practice in many data centers and high speed

networks, the network bisection is not an optimal location for middlebox placement

for some applications. Furthermore, we explore the potential of integrating the novel

concepts and techniques of spate coding into big data processing frameworks. We propose

to exploit a fundamental property common to most big data frameworks, i.e., sharing

of a physical node among several processes giving rise to side information. This side

information is generated as a consequence of availability of a process’s output to other

processes hosted by the same node without going through the network. The concept

of spate coding is inspired by the index coding problem [73, 74], though these two are

different problems with different solution spaces. The core idea in spate coding is to

mix (encode) packets, while leveraging side information, with the objective to reduce the

overall volume of communication. In contrast to index coding, spate coding does not

necessarily consider a broadcast environment. We have presented an efficient solution for

spate coding.

Our major practical contributions include a proof-of-concept implementation of the

proposed system in a real world data center, and a testbed. Although our proposed solu-

tion is general in nature, for demonstration purposes, we have used Hadoop MapReduce


as our target framework. We focus on the shuffle phase of Hadoop, which is known to

be the communication-intensive phase of the application [75]. We have evaluated the

proposed approach by utilizing, Terasort and Grep, two industry-standard benchmarks.

We have not only compared the proposed approach with vanilla Hadoop implementation,

but also with the in-network combiner (see Section 2.3.1 for background on in-network

combiner). The performance evaluation parameters include, volume of communication,

goodput, disk utilization, queue size, and the number of bits that can be transmitted per

Joule of energy. Moreover, we have combined the coding based scheme with in-network

combiner, called Combine-N-Code, to further decrease the network traffic.

4.3 Related Work

Several studies have been conducted to optimize data transfer during the communication-

intensive phase of big data frameworks. These studies can be classified into three broad

thrusts, namely traffic reduction, traffic engineering, and data center architectures.

Network traffic reduction is an important aspect of network optimization. Redun-

dancy elimination schemes identify and remove repeated content from network transfer

(e.g., [76–78]) for increasing end-to-end application throughput. MapReduce jobs are typ-

ically constrained by the limited bandwidth between different nodes, so it is fundamen-

tally significant to optimize the data transfer during shuffle [41]. A combiner function,

applied at the output of mappers, helps to reduce the size of data transfer during shuffle.

Recently, it has been proposed to gain further advantage from the combiner by extending

the combining operation to the network core (see e.g., Section 2.3.1). The in-network

combiner has been applied both at the rack level [44], and network-wide [42], [43].

The traffic engineering thrust is focused on reshaping the network traffic to improve

network performance by utilizing scheduling, routing and prediction based schemes. For

example, the scheme proposed in [57] focuses on network-state aware scheduling by dy-


namically manipulating flow rates. Moreover, [79, 80] focus on traffic-flow-prediction

based scheduling to reduce the impact of network transfers, and improve job processing

to optimize network traffic. Furthermore, the scheme in [81] improves the bandwidth

utilization by making use of unused bandwidth by breaking down bulky (elephant) flows

into mice flows to counter inherent problems with equal cost multipathing.

To accommodate the growing needs of big data applications that call for tremendous

network level transfers within a data center many novel data center architectures have

been proposed. In particular, redesigning the network to expand the bandwidth among

servers has steered the design of full bisection data center networks [60,82,83].

Furthermore, there are following two broad categories of applications where network

coding has been used in reducing network traffic: one, traffic reduction for wireless

networks including sensor networks, two, traffic reduction for repairing the failed nodes

in distributed storage. With respect to wireless networks, network coding has been

shown an ability to reduce number of transmissions by taking advantage of opportunistic

listening [84–87]. For instance, COPE is a wireless routing network protocol which takes

advantage of network coding to reduce the network transmissions by taking advantage of

opportunistic listening and coding [88]. Moreover, a similar application is code updates

during early development stage of the wireless sensor networks [89]. With regard to

application of network coding in distributed storage, where the network bandwidth is

a critical resource, repairing a failed storage node using network coding can reduce the

network traffic needed for repair [90–92].

Comparison to Related Work

In comparison to all the related work, which treats data flow as commodity flow (routing),

we utilize the novel concept of mixing (coding), and it has been proven that the mixing

at the network nodes can provide the optimal rate of information exchange in many


scenarios where the routing can not [93]. Furthermore, it has been shown that the mixing

techniques such as index coding can significantly increase the rate of information transfer

in a network [74,94,95]. Still, index coding [73,74,94–97] assumes a broadcast channel, an

assumption that is far from reality in a data center. We therefore propose spate coding

to improve the rate of information transfer in a non-broadcast environment, as is the

case with a data center network. We want to point out that the solution techniques as

well as theoretical results for index coding do not hold for spate coding.

Furthermore, the usability of traffic reduction techniques focusing on an in-network

combiner is limited to the jobs where reduce task is commutative and associative e.g.,

max, min, sum. For the reduce tasks that do not satisfy commutative and associative

properties, e.g., median, use of combiners yield incorrect results [42], [43]. In contrast,

our proposed scheme is not limited to the type of aggregation function and provides a

greater reduction in the volume of communication as compared to in-network combiner.

Moreover, in contrast to traffic reduction approaches based on the redundancy elim-

ination schemes where intermediate nodes are required to be involved in the decoding

process in our scheme only the destination nodes perform the decoding. Furthermore, our

scheme provides an instantaneous encoding and decoding of the packets. Additionally,

our scheme can work in conjunction with the redundancy elimination schemes.

Moreover, the schemes belonging to the traffic engineering thrust can work in con-

junction with our proposed scheme. We want to point out that the traffic engineering

based schemes count on the knowledge of demand-profiles, whereas in many scenarios

demand-profiles are either unknown or hard to predict. In addition to this, flow-based

prediction and scheduling has an inherent problem for bursty traffic patterns common

to big data frameworks, like Hadoop, where flows can be short lived. In contrast, our

proposed scheme is oblivious to the knowledge of demand-profiles, scheduling, and un-

derlying routing protocols.

Additionally, our scheme reduces the network traffic without modifying the data


center architecture and can prove to be a less costly alternate to proposals that calls

for fundamental design shift.

To the best of our knowledge, our work is first to propose the use of coding in the

communication intensive phase of big data applications to reduce network traffic and is

different from both the wireless scenarios as well as repairing a failed storage node. More

specifically, network coding for wireless scenarios is the index coding problem counting

on opportunistic listening; Section 4.5.2 compares spate coding with index coding as well

as inapplicability of a solution developed for a broadcast medium to a non-broadcast

medium. On the other hand, the storage node repair problem deals with the recovery of

a piece of file which is stored in coded format, whereas we focus on information exchange

happening during the communication phase of distributed application and is not related

to recovering lost data. Furthermore, the problem of repairing of a storage node while

reducing network traffic can be transformed into classical single-source multicast network

coding problem, a polynomial-time problem, and Section 4.5.2 compares it with spate

coding problem which is an NP-hard problem.

4.4 Motivating Use-Cases

Before delving into system specification details, we present two motivating use-cases using

real, albeit toy, examples. The examples showcase the use of mixing (coding) in big data

applications, and its potential in reducing volume of communication. The first example

is focussed on Hadoop MapReduce, whereas the second one is focussed on Storm which

can be thought of Hadoop’s counterpart for stream processing.

4.4.1 Hadoop MapReduce for Electricity Theft Detection

This example shows the advantage of the proposed mechanism for Grep (Global Regular

Expression) applications. Grep is a fundamental use case used to benchmark big data


farmeworks. In fact, it is one of the core use cases taken up by Google for showing the

productivity of MapReduce architecture [23].

The example highlights a use-case for detecting unusual high consumption, and unac-

counted for electricity consumption in an area by using Hadoop MapReduce framework.

Threshold detection, compared to the base load profile, is one of the methods to

detect non-technical losses and electricity theft [98]. This example highlights a use-case

for detecting atypically high electricity consumption by utilizing the Hadoop MapReduce

framework. The Hadoop job scans the data to count the number of times the power

consumption was higher than a threshold. The data used in this example is regenerated

and anonymized (for privacy, and confidentiality reasons) from real world smart meter

data records accessed via the Irish Social Science Data Archive [99], where the readings

were taken every 30 minutes from 5000 smart meters for a period of two years.

Each smart meter reading consists of the following three components: meter ID, a

day and time code specifying the day of the year as well as the time—interval—of the

day, and the power consumed.

As described in Section 2.3, a MapReduce task usually splits the input into indepen-

dent chunks which are first processed by the mappers placed across different data center

nodes in a completely parallel manner. The outputs of the mappers are then commu-

nicated to the reducers which are also placed across different nodes, during the shuffle

phase, for further processing to complete the job.

We note that in a typical real world Hadoop cluster a node hosts mutiple map-

pers and reducers [41], hence many mappers and reducers share common information.

Namely, a data center node (server) often hosts both a mapper and a reducer, whereby

the intermediate 〈key, value〉 pairs stored by the mapper on local storage are accessi-

ble to the reducer running on the same server without the need for going through the

network. This gives rise for ample opportunity to code the packets exchanged during

the shuffle phase. Placement of mappers and reducers on a node is tackled by the


Figure 4.1: Topology and mapper/reducer output/input in the electricity theft detectionexample.

tasktracker. A tasktracker is configured for a fixed number of slots both for mappers

and reducers. By default, a tasktracker runs multiple mappers and reducers on a node

simultaneously [41]. Under default settings, a tasktracker has two slots for mappers

and two slots for reducers. However, the maximum number of mappers and reduc-

ers per node can be controlled by setting mapred.tasktracker.map.tasks.maximum and

mapred.tasktracker.reduce.tasks.maximum respectively [41]. As a general practice the

number of mappers and reducers is set equal to the number of cores in a node.


4.4.1.1 Hadoop Job

We consider a four-node Hadoop cluster in a standard data center network architecture

as shown in Figure 4.1.The objective is to count the total number of instances where

the smart meter with ID 1400 reported the power consumption exceeding 0.5 for four

different time code and day code considered for this example, i.e., 11518 (April 25th for

30 minute interval starting at 8:30 am), 35010 (December 16th for 30 minute interval

starting at 4:30 am), 00120 (January 1st for 30 minute interval starting at 9:30 am),

20513 (July 24th for 30 minute interval starting at 6:00 am). The objective is to count

the total number of instances when the smart meter with ID 1400 reported the power

consumption exceeding a baseline of 0.5 for the following day and time codes: 11518,

35010, 00120, 20513.

A MapReduce task usually splits the input into independent chunks which are first

processed by the mappers placed across different data center nodes in a completely par-

allel manner. The outputs of the mappers are then communicated to the reducers which

are also placed across different nodes, during the shuffle phase, for further processing to

complete the job.

The input log of smart meter records is divided by Hadoop into four splits (split1, · · · , split4).

A split is stored on one of the four DataNodes (node1, · · · , node4) subscribed to the

Hadoop Distributed File System (HDFS), where we assume that spliti is stored on nodei.

Eachmapperi residing on nodei parses file spliti and emits the corresponding 〈key, value〉

pairs, where a key is the smart meter ID along with associated day and time code, and

a value is the number of times the power consumption exceeds 0.5. In this job the role

of the map function is to only extract the parts of the smart meter data log containing

the information about the meter ID 1400 for one of the four targeted times and dates.

Before being stored, intermediate 〈key, value〉 pairs output by each mapper are parti-

tioned such that each reducer receives (fetches) pairs corresponding to only one key. The

reduce function processes the 〈key, value〉 pairs emitted by the mappers and counts the


total frequency where power consumption exceeds the threshold; for this, four reducer

instances are created one at each of the four nodes DataNodes (node1, · · · , node4). Due

to the partitioning strategy, each reducer (reduceri on nodei) processes the 〈key, value〉

pairs emitted by the mappers and counts the total frequency where power consumption

exceeds the threshold corresponding to a single key as shown in Figure 4.1 (e.g., reducer2

residing on node2 is responsible for counting the 〈key, value〉 pairs with all keys equal to

“1400 35010”).

During the shuffle phase the mappers’ outputs (P1, · · · , P12 as shown in Figure 4.1)

are delivered to the corresponding reducers, e.g., the mappers outputs with the key 1400

11518 —P4, P8, and P12—are delivered to the reducer1. Without loss of generality, we as-

sume that a 〈key, value〉 pair is communicated by a mapper to a reducer through a single

packet transmission. Specifically for our example P1, · · · , P12; where Pi = (1400 20513, 1)

for i = 3, 5, 7; Pj = (1400 00120, 1) for j = 2, 6, 10; Pk = (1400 35010, 1) for k = 1, 9, 11;

and Pn = (1400 11518, 1) for n = 4, 8, 12. Packets P4, P8, and P12 are to be delivered to

reducer1. Packets P1, P9, and P11 are to be delivered to reducer2. Packets P2, P6, and

P10 are to delivered to reducer3. Packets P3, P5, and P7 are to be delivered to reducer4.

4.4.1.2 Standard Mechanism

Using standard Hadoop mechanism, it is easy to find that a total of 10 link-level packet

transmissions are required per reducer to fetch all of its desired 〈key, value〉 pairs from

respective mappers. For example reducer1 residing on node1 is responsible for counting

the 〈key, value〉 pairs with all keys equal to “1400 11518”, this process results in 10

link-level packet transmissions which can be calculated as below:

• Two link-level packet transmissions to fetch packet P4 from mapper2 residing on

node2,

• Four link-level packet transmissions to fetch packet P8 from mapper3 residing on

node3,


• Four link-level packet transmissions to fetch packet P12 from mapper4 residing on

node4.

It follows then due to symmetry that 40 link-level packet transmissions are required

to complete the shuffle phase. Note that a total of 16 link-level packet transmissions

cross the network bisection (AS − S1, AS − S2) during the shuffle phase.

4.4.1.3 Proposed Coding Based Mechanism

We note here that a reducer residing on the same node with a mapper has access to this

mapper’s output without going through the network; for instance, output of mapper1 is

locally available to reducer1. We call this locally stored and accessible information as the

side information available to a reducer, e.g., the side information of reducer1 is P1,P2

and P3.

Leveraging on this side information, we propose coding the packets, whereby the

coding refers to applying a simple XOR(⊕) function to a set of input packets. In this

example, the coding is employed at the bisection, using middlebox attached to the L2−

aggregate switch AS as shown in Figure 4.1. The coding results in only two packets

P2 ⊕ P5 ⊕ P8 ⊕ P11, and P3 ⊕ P6 ⊕ P9 ⊕ P12 crossing the bisection. Each reducer can

use its side information to decode its required packets from the coded packet it receives

as explained below. For example, when reducer1 receives P2 ⊕ P5 ⊕ P8 ⊕ P11, it can

decode its required packet P8 by XORing the received packet with the packets it already

has (P1,P2 and P3), i.e., P8 = (P2 ⊕ P5 ⊕ P8 ⊕ P11) ⊕ P1 ⊕ P2 ⊕ P3. This follows from

the fact that although P5 = (1400 20513, 1) and P3 = (1400 20513, 1) are generated by

different mappers, they contain the same information and hence P5 ⊕ P3 = 0. Similarly,

P1 ⊕ P11 = 0 since P1 = (1400 35010, 1) and P11 = (1400 35010, 1).

To explain the decoding process in an intuitive way, we introduce ≡ denoting equality

among packets. Specifically, two different packets are equal if they convey the same

information. For example although packets P4,P8, and P12 are different packets but as


they carry the same information (1400 11518,1), we represent these by an equal packet

A, i.e., P4 ≡ P8 ≡ P12 := A. Similarly, P1 ≡ P9 ≡ P11 := B, P2 ≡ P6 ≡ P10 := C, and

P3 ≡ P5 ≡ P7 := D. Let’s explore how reducer1 would be able to decode its required

packet fetched from mapper3. We start by focussing on the coded packet P2⊕P5⊕P8⊕P11

received by reducer1:

P2 ⊕ P5 ⊕ P8 ⊕ P11 ≡ C ⊕D ⊕ A⊕B

Then utilizing side information available at reducer1:

(P2 ⊕ P5 ⊕ P8 ⊕ P11)⊕ P1 ⊕ P2 ⊕ P3 ≡ C ⊕D ⊕ A⊕B ⊕B ⊕ C ⊕D

= A ≡ P11 = (1400 11518, 1),

i.e., the packet required by reducer1 from mapper3.

Together with the packet exchange occurring via the access switches and the trans-

mission of the packets input to the coding function for them to reach the point of coding

(aggregation switch in our example), we find that in this case a total of 36 link-level

packet transmissions are required to complete the shuffle phase. More specifically:

• Each of the four reducers fetches one record each from the DataNode one-switch

away connected via L2− switch S1(S2) incurring 2 link-level packet transmissions

per reducer, and a total of 8 link-level packet transmissions for all the reducers,

• 16 link-level packets transmissions, four for each of the four DataNodes to L2 −

Aggregate switch(AS),

• A total of 12 link-level packets transmissions to deliver both the coded packets from

L2− Aggregate switch(AS) to all the DataNodes.

Note that by using coding a total of 12 link-level packet transmissions cross the


network bisection during the shuffle phase, i.e., compared to baseline Hadoop implemen-

tation a 25% reduction in network bisection traffic. The use of identical values for each

distinct key generated by a mapper in this example — picked deliberately to ease pre-

sentation — favours the efficiency of coding, but obviously may not hold for production

Hadoop computations. We generalize the concept of the coding-based shuffle beyond this

simplifying assumption in Section 4.8.1.

4.4.2 Storm for DNA Sequencing

This example introduces the concept of partial coding, and shows how coding can be

generalized to various 〈key, value〉 pair patterns, and is oblivious to the semantics. In this

example, we focus on Storm (event processor) [62], a distributed real-time computation

framework for processing unbounded data streams. Apache storm is an emerging big

data platform used by Taobao, Ooyala, Infochimps, Weather channel, and Groupon.

The core abstraction in Storm is the stream, which is a continuous sequence of tuples.

The basic components of Storm for the provision of stream transformations are spouts

and bolts. A spout can be considered as a source of streams, whereas a bolt processes

a number of input streams and emits the transformed streams. Spouts and bolts are

connected through a topology, a connectivity graph, which is an abstraction of how the

stream transformation takes place. In fact, both the Storm cluster and the Hadoop cluster

are very much alike except that Hadoop works on a MapReduce job while Storm works

on topologies of spout and bolts. Spout and bolt tasks are spread across the Storm cluster.

Storm topology represents an arbitrarily complex multi-stage stream computation [100].

Storm is used along with Cassandra [101] as an external database in order to preserve

the bolts state in the event of a failure.

To demonstrate the versatility of our proposed solution, we consider a network archi-

tecture that is different than the one considered in the Example 4.4.1. More specifically,

we consider a network architecture that is associated with Split multi-link trunking (IEEE


Figure 4.2: Twelve records emitted by spout for the DNA-sequence processing job.

802.3ad) [102], NIC (network interface controller)-teaming with adaptive load balancing

(ALB) [103], and load balancing on servers where different virtual machines assigned to

different NICs.

4.4.2.1 Storm Job

The Storm job considered is a DNA-sequencing code processing samples of short DNA

sequence records. DNA sequencing is elemental for cancer treatment and is one of the

major big data applications in the health sector [104]. Each DNA-sequence record consists

of the following two parts: a short DNA sequence (value), and a unique ID associated

with this sequence (key). The objective of this job is to cluster short DNA-sequences

based on the second digit of their unique ID. For demonstration purposes, we assume

that the input data for this DNA-sequence processing job contains only twelve records

as given in Figure 4.2.

Figure 4.3 shows the Storm topology for the DNA-sequencing job. The topology

consists of one spout, and two bolts bolt A and bolt B where each bolt has four tasks.

The output from the spout is fed to the four tasks of bolt A. The bolt A emits modified

streams through NIC eth0. These modified streams are fed to the bolt B, via NIC eth1,


Figure 4.3: Storm topology for DNA sequencing example.

Figure 4.4: Placement of tasks of bolt A and bolt B in the DNA sequencing example ona four node Storm cluster.


which is also running four tasks. The tasks of bolt A and bolt B are running on the four

DataNodes (node1, · · · , node4) as shown in Figure 4.4 subscribed to Cassandra; taskki

represents the task of bolt k running on nodei. Furthermore, the stream grouping for

bolt A is shuffle grouping [105], which distributes the tuples to its tasks in a random

fashion. The stream grouping for bolt B is fields grouping based on the second digit of

the key, which ensures clustering the short DNA-sequences based on the second digit of

their unique ID. Specifically, the tuples from output streams of bolt A with the second

digit of the key i will be delivered to the taskBi+1 (e.g., the output of of bolt A where

second digit of the key is 0 will be delivered to the taskB1 ). Without loss of generality,

we assume that a 〈key, value〉 pair is transmitted from bolt A to bolt B through a single

packet transmission, and the spout’s output is such that P1, · · · , P3 are received by taskA1

, P4, · · · , P6 are received by taskA2 , P7, · · · , P9 are received by taskA3 , and P9, · · · , P12 are

received by taskA4 .

4.4.2.2 Standard Mechanism

We proceed to analyze this scenario from the perspective of standard Storm mechanism.

Assuming each link to have same capacity, it is easy to see that the links on the path

from switch S1 to S2 i.e., S1 − AS − S2, are the bottleneck links. Calculating the

total link utilization incurred by a taskAi to communicate the transformed streams to

the corresponding tasks of bolt B, we easily find that a total of 12 link-level packet

transmissions are required using standard Storm mechanisms. It follows then due to

symmetry that 48 link-level packet transmissions are required to complete the transfer

from all tasks of bolt A to the corresponding tasks of bolt B. Note that a total of 24

link-level packet transmission crosses the network bisection.


4.4.2.3 Proposed Coding Based Mechanism

By observing the output tuples from the tasks of bolt A, it is obvious that in contrast

to Example 1, coding on the basis of an entire 〈key, value〉 pair will be of no advantage

in this example. This follows from the fact that each of the P1, · · · , P12 has a unique

Key, hence no task of the bolt B has sufficient side information to decode a subset of

entire packets coded together. Therefore, we exploit the novel concept of partial (a.k.a.

fractional) coding. At S1, the L2-Switch of the toy network topology in Figure 4.4,

the coding is performed only on the portions of the packets (second digit of the key

and complete value). More specifically, the coded packets are: Ω1 ⊕ Ω4 ⊕ Ω7 ⊕ Ω10,

Ω2 ⊕Ω5 ⊕Ω8 ⊕Ω11, and Ω3 ⊕Ω6 ⊕Ω9 ⊕Ω12, where Ωi represents part of Pi comprising

the second digit of the key and complete value; for example Ω1 is highlighted in a block

in Figure 4.2. Each of these coded data are then combined, by concatenation, with their

corresponding raw (i.e., not coded) portions to form a partially coded packet. Further

details on the format of a partially coded packet are given in Section 4.8.1.

Following similar arguments used in Section 4.4.1, it can be shown that each task of

the bolt B can decode its required 〈key, value〉 pairs. It is interesting to note that in this

scenario , where coding is performed only on a part of packet excluding first digit of the

keys, the coding-based network transmissions results in significantly—namely 75%—less

utilization of network bisection links, while maintaining the same amount of information

transfer as without coding. Depending on the use-case, this translates to 75% more

Storm jobs running simultaneously, or to a 75% decrease in job completion time if the

links between AS − S1 and AS − S2 are congested.


4.5 Spate Coding Problem

4.5.1 Problem Formulation

This section formalizes the concept of coding by defining spate coding problem. While

spate coding has flavours similar to network coding and index coding, Section 4.5.2 high-

lights the differences.

Definition 13 (Spate coding problem) An instance of spate coding problem is de-

fined by a coding server s, a set of clients C = c1, . . . , cm, a set of equal length packets

P = p1, . . . , pn, and a network topology specified by a tree H(V,E) rooted at s. A

certain subset of packets is to be delivered to each ci, via s, known as its Want set

Wi ⊆ P . Each ci has access to some side information known as its Has set Hi ⊆ P .

Let (X1, X,X2) be a partition of V , then X1 = s, vertices in X = v1, · · · , vk host

clients, and vertices in X2 are the dumb vertices—which can only forward or route the

packets. The coding server can transmit and route the packets in P , or their combinations

(packets coded over a finite field) over the edges (s, vi) ∈ E, where vi ∈ X ∪X2. The goal

is to find a transmission scheme that requires the minimum number of link-level packet

transmissions to satisfy the requests of all clients1.

In short, if ρe denotes the set of link-level packets traversing on e ∈ E and U =⊎e∈E

ρe2,

then the objective is to find a mapping ζ—where Pζ7−→ U subject to packet routing

constraints—that minimizes |U |, such that for each ci ∃ψi—a decoding function—and

ψi(ui, Hi) = Wi, where ui represents the packets received by ci; i.e., each ci should be able

to recover its Want set from the packets received and the side information available to it.

We want to comment that two different packets might carry the same payload, and

coding server exploits this to reduce number of link-level transmissions. We elaborate it

further in the following paragraph in context of example in Section 4.4.1. Moreover, for

1A packet traversing an edge e ∈ E corresponds to one link-level packet transmission.2⊎

denotes disjoint union, where⊎i

ρi ,⋃i

(%, i) : % ∈ ρi.


Figure 4.5: An instance of spate coding for the example 4.4.1.

practical implementation scenarios a client might need some additional information—e.g.,

a coding vector in form of a header—to decode the received coded packet. We discuss

this in Section 4.8.1.2.

The example in Section 4.4.1 can be represented as an instance of a spate cod-

ing problem, where P = P1, P2, · · · , P12, X1 = AS, X2 = S1, S2, and X =

node1, node2, node3, node4. The coding server is co-located with L2 − Aggregate

switch AS. There is a set of four clients C = reducer1, reducer2, reducer3, reducer4,

where ci = reduceri is hosted on nodei. The Want set of c1 is W1 = P8, P12. The side

information available to the c1 is given by its Has set H1 = P1, P2, P3, and similarly for

rest of the clients as shown in Figure 4.5. Note, although reducer1 needs P4, it is not in-

cluded in W1 as reducer1 does not fetch it via AS rather it is fetched via S1 from mapper2.

Furthermore, u1 = u2 = P8 ⊕ P11, P9 ⊕ P12, and u3 = u4 = P2 ⊕ P5, P3 ⊕ P6. The

total number of link-level packets transmitted, |U |, is 12, where each of the four pack-

ets transmitted by coding server results in three link-level transmissions. For instance,

packet P8 ⊕ P11 traverses three edges (AS, S1), (S1, node1), and (S1, node2). Further-

more, ρ(AS,S1) = P8 ⊕ P11, P9 ⊕ P12. Note that P4, P8, and P12, though different

packets, carry the same payload, same is true for the following sets of packet: P1, P9, P11;


P2, P6, P10; P3, P5, P7. Furthermore for extracting packet P8 from P8⊕P11, by performing

(P8 ⊕ P11) ⊕ P1, reducer1 needs information about equivalence of packets P11 and P1.

Mapping ζ can be computed using Algorithm 1, whereas the decoding function ψi is

specified by Algorithm 2. We present the proposed solution to spate coding problem in

Section 4.5.3.

4.5.2 Spate Coding and its Relationship with Network Coding

and Index Coding

In this section, we describe spate coding problem in context of traditional network coding

and index coding problems. We begin by noting that though all these coding problems

exploit mixing of the packets, each of these target a different environment that effect the

problem characterization and solution space. Index coding problem was first introduced

in 1998 [94], and network coding was first introduced in 2000 [106], and it was not

until 2015 [107, 108] when a first comprehensive relationship was established between

these problems, though some previous work focused on establishing relationship for some

specific scenarios (e.g., [109]). We emphasize that spate coding problem is different from

both the standard network coding and index coding problems [93–97,110–112].

We first start by highlighting the relationship between index coding problem and spate

coding problem. Specifically, every instance of index coding problem can be translated

to a corresponding instance of spate coding problem as given by Theorem 14.

Theorem 14 For each instance of index coding problem there exists a corresponding

instance of spate coding problem.

The proof of Theorem 14 is given in Appendix B.1.

Next, we proceed to briefly describe the relationship between network coding problem

and spate coding problem. Theorem 15 shows that every instance of network coding

problem can be translated to a corresponding instance of spate coding problem.


Theorem 15 For each instance of network coding problem there exists a corresponding

instance of spate coding problem.


We want to point out that it has already been shown that network coding and in-

dex coding problems are equivalent [108], however it is still unknown whether network

coding/index coding can subsume spate coding as a special case.

4.5.2.1 Discussion

Index coding problem is closely related to spate coding problem by virtue of incorporat-

ing side information. However, index coding problem has been defined for the networks

communicating over a broadcast channel, whereas spate coding problem incorporates the

networks communicating over a non-broadcast channel. Spate coding extends and gener-

alizes the concept of index coding problem to non-broadcast environments. To highlight

the difference that a non-broadcast environment can make we present a simple example

below showcasing that, for the similar instances, the solution to index coding problem can

be counter productive in non-broadcast environments for which spate coding has been

developed. This example also highlights the inability of existing index coding solutions

to capture spate coding scenarios.

Example 16 Consider an instance of spate coding problem as shown in Figure 4.6(a).

The similar instance of index coding problem is shown in Figure 4.6(b). The similarity

of the instances is in terms of the number of clients, and the same Want and Has sets

associated with each client. It can be verified that the number of link-level packet trans-

missions required with traditional routing is 16 to complete the exchange task in Figure

4.6(a). In contrast even if we use the optimal solution for a similar instance of index

coding problem shown in Figure 4.6(b) to complete the exchange task in Figure 4.6(a), it

requires 26 link-level packet transmissions in total to complete the exchange task (eight


link-level packet transmissions for uplink plus eighteen link-level packet transmissions for

downlink). Therefore, the transition from broadcast medium to non-broadcast environ-

ment need a new problem framework and modelling.

Moreover, current solutions to network coding problem don’t entertain spate coding

problem. Most of the work in network coding domain is focused on multicast scenarios,

and network coding in such scenarios is a polynomial time problem, accordingly the

proposed solutions are of polynomial time complexity. Keeping in mind that spate coding

is an NP-hard problem such solutions and algorithms can not solve spate coding problem

unless P=NP. Furthermore, an in-variant necessary for the correctness of network coding

solutions is the ability to perform coding at the nodes of degree two or more in the

information flow graph, whereas in spate coding only one node can perform coding—

violating this invariant.

Moreover, we propose to exploit two fundamental properties common to most dis-

tributed big data frameworks. One, sharing of a physical node among several processes

giving rise to side information, i.e., availability of a process’s output to other processes

hosted by the same node without going through the network. Two, data exchange among

processes distributed across different physical nodes to complete the desired task giving

rise to the opportunity of mixing different flows to optimize east-west traffic. None of the

existing solutions to the network coding problems considers the side information available

at the receivers while performing coding operation at network codes and hence cannot

be used to solve the problem setup presented in this work.

Furthermore, although we initially formulated the spate coding problem while tar-

geting the distributed application scenarios, however, the proposed scheme is applicable

to any setup where different nodes contributing in data exchange share some common

information.


Figure 4.6: (a) An instance of spate coding problem. (b) A similar instance of indexcoding problem.

4.5.3 Solution for Spate Coding Problem

This section presents a solution for spate coding problem. For efficiency, the only coding

operation that the server is permitted to perform is ⊕ i.e., operations over GF (2).

We start by defining the dependency graph to capture information shared between

the clients (nodes).

Definition 17 (Dependency Graph) Without loss of generality, for the ease of pre-

sentation, we assume each client ci requires just one packet. In case a client χ requires

|W | > 1 packets, we replace it by |W | clients, ci,1, · · · , ci,|W |, each having one of the |W |

packets as its Want set and each having the same Has set as of χ.

Given an instance of the spate coding problem, we define a dependency graph G(V,E)

with vertex set V and edge set E as follows:

• For each client ci there is a vertex vi in G(V,E);

• For any two clients ci and cj not residing on the same node (server), there is

a directed edge, in G(V,E), from vi to vj if and only if it holds that for Pi ∈

Wi,∃ P ∈ Hj : P ≡ Pi, where ≡ represents equality as described in Section 4.4.1;


in other words, an edge is directed from vi to vj whenever cj has a packet equal to

the one wanted by ci

Algorithm 1 ComputeCode

Input: An instance of spate coding problem1: X ′ := φ2: while X 6= φ AND X 6= X ′ do3: For each set of clients X = c1, c2, · · · , cy such that W1 ≡ W2 · · · ≡ Wy, X

′ :=

X \ X \ c14: From the given instance of spate coding problem for the clients in X ′, construct

the dependency graph G(V,E);5: for k = λ down to 2 do6: while ∃ a clique ϑ in G(V,E) of size k do7: Divide clique into x smaller cliques θi each belonging to different subtrees in

the multicast topology.8: for i = 1 to x do9: Multicast a packet that satisfies all clients corresponding to the vertices in

θi;10: V := V \ θi;11: Delete the clients corresponding to the vertices in the θi from the client set

X12: end for13: end while14: end for15: end while16: if X 6= φ then17: Send the uncoded packets corresponding to the clients in X18: end if

Figure 4.7 shows the dependency graph for the Example in Section 4.4.1, where each

client ci is split into two vertices vi,a and vi,b as |Wi| = 2.

Lemma 18 A clique of size n in the dependency graph represents a group of n clients,

whose requests can be satisfied with only one coded packet.

Proof: Follows from the fact that all the clients represented by a clique in the

dependency graph, can be satisfied by one transmission which consists of an ⊕ of all the

packets in their Want sets.


Figure 4.7: Dependency graph for the transmissions crossing AS for the Example givenin 4.4.1. The execution of the Algorithm 1 results in four multicast packets shown byfour cliques each of size 2.

The proposed Algorithm 1, named ComputeCode, greedily packs as many cliques

as possible starting from the largest clique of size λ. Each clique corresponds to only

one multicast packet, therefore packing cliques starting from larger cliques heuristically

ensures more savings in terms of number of transmissions. However finding all cliques

can become computationally prohibitive, we therefore focus on finding all the cliques

of size λ ≤ 4 to allow close-to-realtime execution of the coding algorithm. Increasing

λ could possibly increase coding benefit. However, in addition to time-versus-coding

benefit tradeoff, another important factor to consider while selecting λ is the maximum

number of leaf nodes in the subtree rooted at the middlebox. λ should not exceed the

maximum number of leaf nodes in the subtree, since the maximum coding advantage is

achieved for the case when all the packets heading to the leaf nodes are coded together

as one packet.

The computational complexity of our clique packing algorithm is O(|V |4). In general,

extending the same algorithm for finding all cliques of size λ and smaller results in O(|V |λ)


computational complexity. We note that the computational complexity of clique finding

can be significantly improved to O(κλ−2|E|) using algorithms similar to [113], where κ

is the arboricity of the dependency graph. For graphs with small κ, e.g., planar graphs

with κ = 3, clique finding is a very efficient operation of O(|E|).

To reduce the overhead associated with each multicast packet, we exploit the relation-

ship between a multicast packet, corresponding multicast tree, and corresponding clique.

We note each sub-clique, a subset of the vertices and corresponding edges of a clique,

encodes fewer packets than the original clique. Moreover, a sub-clique corresponds to a

sub-tree of the multicast tree associated with the original clique. Hence, we divide a clique

in sub-cliques in a way such that each multicast sub-tree carries a packet that encodes

a subset of the packets in the original encoded packet, resulting in less coding overhead.

For instance for packet format proposed for Hadoop in Section 4.8.1.2, the header of each

multicast packet conveys the information regarding each of the packet in CODED SEG

using CODING VECTOR, RIDK, UNCODED SEG. So if CODED SEG contains fewer

packets the size of CODING VECTOR, RIDK, UNCODED SEG is smaller. Also note

that since each sub-clique is also a clique so the validity of the solution holds i.e., each

client can decode instantaneously without any need to buffer the packets it received.

Further, dividing a clique into smaller cliques does not increase the number of overall

link-level packet transmissions as it is equivalent to separating information intended for

different clients.

The execution of the proposed algorithm resulted in two cliques each of size 4 (shown

with dotted lines in Figure 4.7) for Example in Section 4.4.1. These cliques are further

subdivided into two sub-cliques, the sub-cliques associated with the multicast sub-tree

rooted at the switch S1 are c1,1, c2,1 and c1,2, c2,2, and sub-cliques associated with

the multicast sub-tree rooted at the switch S2 are c3,1, c4,1 and c3,2, c4,2 as shown in

Figure 4.7. This corresponds to multicasting packets p2⊕ p5 and p3⊕ p6 to reducer3 and


reducer4, and multicasting packets p8 ⊕ p11 and p9 ⊕ p12 to reducer1 and reducer23.

4.6 Challenges and Considerations

Some of the obvious questions with regards to our proposal of using the mixing (coding)

are:

1. How does the coder (middlebox) collect the side information, and determines des-

tination of each coded packet?

2. How is the spate coding problem incorporated in the proposed architecture?

3. What is the coding format, and how can one determine which parts of 〈key, value〉

pairs should be encoded?

4. How does one encode the packets while being able to perform practical line-rate

processing?

5. What extra information is required in each packet, in addition to a packet’s payload,

so that each client can decode the required packets instantaneously.

6. How does a client decode a packet?

7. Can we say something about computational complexity of such schemes in general,

and provide some bounds on the advantage?

8. Can the proposed architecture integrate seamlessly in the current big data archi-

tectures (like Hadoop)?

9. How does the proposed scheme perform in practice?

3Note multicasting four packets p2 ⊕ p5, p3 ⊕ p6, p8 ⊕ p11, and p9 ⊕ p12 instead of two packetsp2⊕p5⊕p8⊕p11 and p3⊕p6⊕p9⊕p12 (as explained in example given in Section 4.4.1) does not increasethe overall number of link-level packet transmissions crossing the network. For instance the number oflink-level packet transmissions crossing network bisection for multicasting packet p2 ⊕ p5 ⊕ p8 ⊕ p11 toall four reducers use both the link AS−S1, and AS−S2, whereas multicasting packet p2⊕ p5 (p8⊕ p11)uses only one link AS − S1 (AS − S2).


We present a complete architecture that encompasses the questions 1 to 6 in Section

4.8.1. We present some analysis related to question 7 in Section 4.7. The answer to the

question 8 is discussed in Section 4.8.3, and regarding question 9 we present evaluation

results from a clustered prototype implementation of proposed scheme in Section 4.9.

4.7 Analysis

In this section we analyze the computational complexity and advantage of the schemes

that tend to minimize data transmission in distributed data center applications.

A distributed data center application runs over a set of DataNodes hosting the sender

service instances that send data to other nodes, and the receiver service instance that

receive data from other nodes. A DataNode can host both the sender service instances

and the receiver service instances. The data is transferred over a communication graph

I(V,E) from the sender service instances to the receiver service instances in the com-

munication phase. The vertex set V consists of DataNodes as well as the routing and

forwarding nodes in the communication graph. The edge set E consist of all the com-

munication links connecting the vertices in V . For instance in Hadoop MapReduce, a

mapper is a sender service instance, a reducer is a receiver service instance, and shuffle

is the communication phase.

We first define the Transmission Intensity of a distributed data center application.

Definition 19 (Transmission Intensity) Transmission intensity is defined as the to-

tal number of link-level packet transmissions during the communication phase of dis-

tributed data center application. The transmission intensity of a communication scheme

Si is denoted by |P (Si)|.

A scheme S is called Transmission Intensity optimal if for all other schemes Si:

|P (S)| ≤ |P (Si)|.

We next define the problem efficient scheme (ES).


Definition 20 (Problem ES) Find a scheme S that is Transmission Intensity optimal.

Let OPT (S) denote the optimal solution to the problem ES.

Theorem 21 The Problem ES is NP-hard, and it is NP-hard to find a polynomial time

constant factor approximation as well.


We begin to analyze the maximum and minimum advantage that prosed coding based

scheme can offer compared to the current standard non-coding (routing) based tech-

niques. We start by defining the Utilization Ratio.

Definition 22 (Utilization Ratio (µ)) Utilization Ratio µ is the ratio of link-level

packet transmissions when employing a coding based solution to the number of link-level

packet transmissions while not employing a coding based solution.

We also define the diameter of a network, represented by a communication graph, as

follows.

Definition 23 (Diameter of a Network (d)) The diameter of a network d is the longest

of all the shortest paths (in terms of number of hops) between the vertices representing

distinct DataNodes in the communication graph.

Theorem 24 provides a lower bound on the utilization ratio for a general network in

terms of its diameter d.

Theorem 24 µ ≥ 1d.


Corollary 25 For distributed data center applications running on large number of nodes

(larger than the network diameter), µ ≥ 1number of nodes

.

Theorem 26 µ ≤ 1, and this bound is a tight bound.



Next, we analyze the maximum advantage a coding based scheme can offer with refer-

ence to a network’s bisection when the coding is also performed at the network’s bisection.

We proceed by defining Utilization Ratio with reference to a network’s bisection.

Definition 27 (Bisection Utilization Ratio (µ(bisection))) Bisection Utilization Ra-

tio µ(bisection) is the ratio of link-level packet transmissions crossing the network’s bi-

section when employing a coding based solution to the number of link-level packet trans-

missions crossing a network’s bisection while not employing a coding based solution.

Theorem 28 provides an upper bound on µ(bisection) while coding is also performed at

the network’s bisection.

Theorem 28 µ(bisection) ≥ 12, while coding is also performed at the network’s bisection,

and this bound is a tight bound.


Theorem 29 Network Bisection is not an optimal location to place the middlebox for

performing coding.


4.8 Theory to Practice

4.8.1 Coding Based Middlebox and its Components

For the ease of explanation as well as in accordance with our proof of concept prototype

implementation, we present the details of our architecture with reference to Hadoop

MapReduce. It can be easily extended to other distributed data center applications.

Moreover, the proposed coding scheme is independent of the underlying application.


Figure 4.8: The proposed coding based Hadoop MapReduce data flow.

We introduce three new stages, namely sampler, coder, and preReducer to the traditional

Hadoop MapReduce. The primary objective of the sampler is to gather side information.

Similarly the primary objective of the coder is to code, and of the preReducer is to decode.

The overall architecture is shown in Figure 4.8; while it shows only two nodes it is in fact

replicated across all the nodes.

In this section we focus on the middlebox, and its components (sampler, and coder).

We begin by introducing the notation.

We consider n nodes or physical machines, where a mapper j on physical machine i is

denoted by Mij, and a reducer j on physical machine i is denoted by Rij. Each mapper

Mij emits its 〈key, value〉 pairs to the partition, a memory buffer, Γi.


Figure 4.9: Architecture for middlexbox (sampler, and coder)

Figure 4.10: Architecture for preReducer.


4.8.1.1 Sampler

The sampler gathers the side information. Specifically, the sampler resides in the middle-

box and fetches ℵ random records stored in the sorted and partitioned file spills from the

mappers at each of the physical machines. These random records are fetched in parallel

to the shuffle phase. This sampling process does not interfere with the shuffle process.

These sampled records not only help to start building the side information available at

each node but also play a pivotal role in the decision of which portions (segments) of the

packets should be coded (please refer to the Example in Section 4.4.2 regarding partial

coding).

The overall data flow from a mapper to the sampler is shown in Figure 4.9. Specifi-

cally, a mappers first completes emitting the 〈key, value〉 pairs to the intermediate spills

which are formed as a result of circular memory buffer being filled up to a certain thresh-

old (io.sort.spill.percent). These intermediate spills are then merged to form a final sorted

and partitioned sequence file. In the final sequence file each partition corresponds to a

particular reducer, and is made available to the reducer over HTTP. The sampler, co-

residing with the coder, fetches the random records, in parallel to shuffle phase, from

the final sequence file which is stored in SequenceFile format at the locations specified

by mapred.local.dir [41]. The records from the sequence file is read using java class

org.apache.hadoop.io.SequenceFile.Reader [114]. These sampled 〈key, value〉 pairs serve

as the initial Has set for each physical machine.

Note that gathering the side information incurs an overhead and is the reason that

the proposed scheme suggests to take a limited number of samples instead of fetching

the complete file. This limited side information affects the coding advantage. However,

for the sake of practicality it is important that the communication overhead associated

with collection of side information is small. Note that later on, as the reducers fetch the

data from the respective mappers, the packets crossing the network middlebox gradually

enhance side information for each of the contributing nodes, without incurring addi-


tional network overhead. The aggregated side information results in an improved coding

advantage.

Spate coding needs an access to side information from each of the participating nodes

in order to decide on combination of packets to be coded. More precisely, some knowledge

about packets, not crossing the middlebox, from mappers’ outputs at each of the node is

required. For instance the server can compute packet p8 ⊕ p11, for node1 and node2, in

the example given in Section 4.4 only if it knows that node1 has access to packet p1 and

node2 has access to packet p4.

4.8.1.2 Coder

The coder C is a dedicated software appliance, strategically placed in the middlebox,

for wire-speed packet payload processing (typically XORing packet payloads) and re-

injection of coded packets into the network. Specifically, the coder C initially receives ℵ

inputs from the partitions of all mappers via the sampler. After that it only receives the

information that passes through the part of the network where it is placed. The coder

performs the following three functions:

(1) Format Decision Making: Based on the sampled data records the coder de-

cides on a coding format consisting of byte indices ω1, ω2. The coder treats a 〈key, value〉

pair as a data chunk. In a data chunk the bytes starting from index ω1 and ending at

index ω2 are the ones that are anticipated for coding for the following ℵc generations; we

name these bytes the encodable chunk. A generation specifies a group of packets to be

processed together by the coder. The rest of the bytes in the data chunk, named the un-

codable chunk, are forwarded without coding. In other words, the coding format specifies

which portions of the collective 〈key, value〉 pairs to code. The logic behind choosing ω1,

ω2 is to find partitions of the data chunk that maximize coding advantage, as for a specific

Hadoop job some bytes of 〈key, value〉 pairs might contain more common information

than others. Note that coding exploits the common information between different file


splits residing on different physical machines. For the example in Section 4.4.2, if each

digit in the 〈key, value〉 pair is represented by a byte then ω1 = 2, ω2 = 28, and the

uncodable chunk is just the first byte and the rest of the packet is the encodable chunk.

The decision on the coding format is based on comparing the coding advantage over

the ℵ random records fetched by the sampler from different mappers. Coding advan-

tage is evaluated on different chunks of the data packet for different values of ω1, ω2.

The computational complexity of determining the best contiguous encodable chunk is

logarithmic (a very efficient procedure based on binary partition). The complexity of

finding arbitrary number of noncontiguous encodable chunks can result in higher coding

advantage, but can also increase the computational complexity, more book-keeping, and

a more complex packet header.

The coding format is periodically re-computed after every ℵc generations to fine tune

the decision based on particularities of the Hadoop job.

(2) Coding: This step performs binary coding (bitwise XOR) based on the bytes

ω1 to ω2 (encodable chunk) from each generation of received 〈key, value〉 pairs. The

coding algorithm is based on solving spate coding problem in an efficient fashion, and is

presented in Section 4.5.3. We note that the algorithm requires information about both

the Want set and side information (Has set) of each node in the cluster participating in

the Hadoop job under consideration, and packets that are equal. In our scheme, the Want

set is determined based on the key of the 〈key, value〉 pairs of the current generation,

whereas knowledge about Has set keeps on building at each generation by adding it to

already existing side information set from the sampler. During each generation the new

side information is extracted from the bytes ω1 to ω2 from the 〈key, value〉 pairs available

at the coder. For the coding purpose, we maintain the Has set of all the nodes without any

assumption over the extent of information overlaps between different physical machines.

The information overlaps between different physical machines can be arbitrary or none.

Moreover, the coding can result in multiple multicast packets to be forwarded to the


reducers. Each multicast packet is computed for a subset of reducers sharing a common

down-link path originating at the coder node. To enable processing at the line-rate, the

algorithm ensures that the coded packets are instantaneously decodable at the receiver

nodes i.e., just based on the packets received and without any need of buffering a set of

incoming packets.

(3) Packaging: This step packs the outcome of the coding process into a custom

packet payload format, as shown in Figure 4.12, that is then re-injected into the network

towards the reducer nodes. Different fields in the packet payload format collectively

ensure that each reducer finally receives the 〈key, value〉 pairs intended for it. The fields

are:

• NUM ENCODED: It is a numerical value describing the number of packets that

are coded together. For instance, for the packet p2 ⊕ p5 ⊕ p8 ⊕ p11 of Example in

Section 4.4.1, this value is 4. In case where no encoding is performed, this field is

set to 0. This value is bounded by the number of reducers.

• CODING V ECTOR: It is a vector of size NUM ENCODED, and contains

hashes of the encodable chunks. There can be a large set of encodable chunks, so

we use hashes for their fast matching with the Has set to identify side information

for the decoding process given in Section 4.8.2.

• RIDK: It is a vector of size NUM ENCODED, and specifies intended reducers

along with its corresponding key. We associate an ID (a bit vector) RID with the

reducer R, and RK denotes the key assigned to the reducer R. RIDK contains

hashes of RID RK pairs. There can be a large number of reducers in a Hadoop

cluster, and one coded transmission is multicasted to only a subset of the reducers.

Moreover, a reducer might work on multiple keys, and does not know in advance

what 〈key, value〉 pair to expect in the received packet, so it is necessary that a

packet should contain information about the intended reducers and its correspond-


Figure 4.11: Coding for variable packet lengths.

ing key. Hashes are used for fast identification of the reducer and the key.

• CODED SEG: It is the actual coded chunk, formed by bitwise XOR of the

NUM ENCODED encodable chunks. For instance, one of the coded packets of

Example in Section 4.4.1 the CODED SEG contains p2 ⊕ p5 ⊕ p8 ⊕ p11.

• UNCODED SEG: It contains NUM ENCODED uncodable chunks.

Discussion

Note that there is no assumption regarding the packets (〈key, value〉 pairs) to be of equal

length. To elaborate further, consider two packets of different lengths as shown in Figure

4.11, showing the encodable chunks. The coder performs XOR on encodable chunks

while uncodable chunks are concatenated for each packets. The resulting payload for

coded packets is shown in Figure 4.11, elaborating the packaging of the encodable and

the uncodable chunks. The coded packet also contains the required header.


Furthermore, we argue—in practical scenarios as demonstrated in Section 4.9—the

overall reduction in communication is much more than the overhead required to support

coding/decoding operations. To support our argument, we calculate net savings—total

reduction in volume of communication minus total overhead—per coded packet at output

link from the middlebox as follows:

(τ∑i=1

(li)−

(k +

τ∑i=1

(li − k)

))− (2 ∗ h ∗ τ + x)

where k = ω2−ω1 bytes from τ packets—p1, p2, · · · , pτ of lengths l1, l2, · · · , lτ respectively—

are coded together. Overhead comprises of h bytes of hashes—used for each entry in

CODING V ECTOR and RIDK fields—and x bytes for NUM ENCODED field.

For example, consider a scenario where a part from each of three 100 bytes packets

is coded together. Specifically 90 bytes from each of the packet are coded together

with h = x = 1. Net savings per coded packet at output link from the middlebox are

(300− (90 + 3 ∗ 10))− (2 ∗ 1 ∗ 3 + 1) = 173 bytes, i.e, per coded packet gross savings are

180 bytes, overhead is 7 bytes, and the net savings are 173 bytes. Note that number of

bytes for an XOR operation over k bytes from each of τ packets is still k. Furthermore,

overall number of hops for an encoded or traditional packet are same.

4.8.2 PreReducer

The major role of the preReducer component is to ingest the custom-made packets sent

by the coder, decode their payloads and extract the 〈key, value〉 pairs which are to be

input to the standard Hadoop Sorter.

We recall that a mapper stores the output 〈key, value〉 pairs in a set of partitions,

where each partition contains 〈key, value〉 pairs for a particular reducer. We further note

that based on the placement of the middlebox, or some routing intricacies some of the

〈key, value〉 pairs fetched by a reducer might not pass through the middlebox. Hence,


Figure 4.12: The proposed packet payload format for the coding based shuffle at theapplication-layer. Depending on size, the format shown may be transported using mul-tiple segments/packets.

preReducer should tackle both types of the packets, namely packets coming through

the middlebox with additional header, and packets not coming through the middlebox.

Specifically, the preReducer passes the packets not coming through the middlebox to the

reducer without any changes. The preReducer performs decoding, and < key, value >

extraction process on the packets coming through the middlebox. The complete archi-

tecture for the preReducer is shown in Figure 4.10.

The preReducer decodes the coded part of the packet based on the coding format,

and forms 〈key, value〉 pairs by inserting the decoded chunk from byte index ω1 to ω2

into the corresponding uncoded chunk of the packet. Once a node receives a packet,

the preReducer first checks the RIDK field to determine the intended reducer and the

corresponding key. Then it identifies the side information, required to decode its intended

encodable chunk, by comparing the hashes of the mappers outputs stored locally with

the hashes contained in CODING V ECTOR. The decoding is performed by XORing

CODED SEG with the side information. The detailed decoding process is given in in

Algorithm 2.


Algorithm 2 PreReducer1: n = NUM ENCODED2: if n 6= 0 then3: Retrieve the index myIndex from the packet by sifting through the RIDK vector4: decoded Seg = CODED SEG5: for i = 1 to n do6: if i 6= myIndex then7: decoded Seg = decoded Seg ⊕ lookup(Hashi)8: end if9: end for

10: Retrieve 〈key, value〉 by inserting decoded Seg at index ω1 + 1 in theUNCODED SEG at index myIndex

11: else12: 〈key, value〉 is same as packet received13: end if

4.8.3 Seamless Integration using OpenFlow

For a seamless integration of the proposed scheme into the Hadoop architecture, we

propose a novel software-defined-networking, specifically OpenFlow, based scheme for

multicasting coded packets to the right reducers. Unlike conventional receiver-initiated

multicast, the use of OpenFlow (version 1.2) provides the ability to have the middlebox

control the dynamics of multicast groups on-demand. Hadoop contains information about

the placement of mappers and reducers in the cluster, and the topology they are connected

through [115]. Using this information mappers and reducers can be grouped into subtrees.

The proposed scheme implements multicast state across a multicast group G—at each

switch along the tree topology—using a Group Table entry with Group ID=G. The Group

Type is set to all (multicast), and respective Action Bucket is set to forward the packets

to the set of destination ports corresponding to the multicast tree links. We overload

here the semantic of two IP header fields, namely the ECN and DS fields (8-bits long

in total), to match multicast group state on a switch with packets of a specific multicast

group, allowing us to use 256 multicast groups in total. An alternative is to use Locally

Administered multicast MAC addresses, which can allow handling of up to 246 multicast

groups.


The following two practical constraints are associated with the implementation of

multicast in large-scale data centers. First, keeping a multicast group for every coding

combination, determined by the coder, may be prohibitively expensive. Second, creating

mulitcast groups on-demand can incur high latency. To meet these two practical con-

straints, we introduce a component called the coding groups coordinator (CGC) within

the coder. The CGC coordinates, in tandem with the OpenFlow controller, the dynamic

creation of multicast groups using a small (constant) working set of multicast groups.

Specifically, the CGC maintains at most α (a small constant) group-queues. Each group-

queues is associated with a distinct set of reducers (group of receivers) and being identified

by a unique Queue-ID. In addition to this, the CGC maintains one general queue that

is drained to the network. Packets from the coder destined to the same multicast group

are buffered in the same group-queue, whereby at the event of adding the first packet to

a group-queue, the CGC notifies the OpenFlow controller to create network state for the

multicast group that the packet corresponds to. Once a queue is fully flushed, the CGC

picks the next multicast group from the general queue and assigns the emptied queue

to a new multicast group, while in parallel the next group-queue containing the packets

corresponding to another multicast group is drained to the network (i.e., one group-queue

is drained at every instance of time). This ensures a temporal overlap of multicast state

creation for a future multicast group with injection of packets to the network, while also

operating on a fixed budget of multicast state (#groups). In short, our approach some-

what resembles with the concept of virtual queues in network routers, without though

the constraint of starvation, for in this case the workload does not pose deadlines on

identical virtual queues due to Hadoop being throughput and not latency-bound.

Figure 4.13(b) describes the process of multicasting four encoded packets P2 ⊕ P5,

P3⊕P6, P8⊕P11, and P9⊕P12, found as a result of execution of Algorithm 1 applied on

the example in Section 4.4.1. Figure 4.13(a) shows the entries for Flow Tables and the

corresponding Group Tables for each switch in the multicast tree. Note that the encoded


packets P8 ⊕ P11 and P9 ⊕ P12 are to be multicasted to the multicast group consisting of

node1 (hosting reducer1) and node2 (hosting reducer2). Similarly, the encoded packets

P2 ⊕ P5 and P3 ⊕ P6 are to be multicasted to the multicast group consisting of node3

(hosting reducer3) and node4 (hosting reducer4). In particular, the CGC creates two

queues, the queue with Queue-ID=1 for the multicast group consisting of node1 and

node2 (destination of P8 ⊕ P11 and P9 ⊕ P12), and the queue with Queue-ID=2 for the

multicast group consisting of node3 and node4 (destination of P2⊕P5 and P3⊕P6). The

CGC then creates the network state for the these multicast groups. For all the packets in

the queue with Queue-ID=1 the DS+ECN field is set equal to 1, and for all the packets

in the queue with Queue-ID=2 the DS+ECN field is set equal to 2. The Flow Table

entries and corresponding match-actions inspects the DS+ECN field of a packet, and

then utilize Group Table to find the corresponding multicast group and then forward

based on the Action Bucket. For example, the packets with DS+ECN field equal to 1 are

forwarded according to the Action Bucket under Group ID 110, i.e., multicasted to the

ports a and b of the switch S1, and port x of the switch AS. Similarly, the packets with

DS+ECN field equal to 2 are forwarded according to the Action Bucket under Group ID

100, i.e., multicasted to the ports c and d of the switch S2, and port y of the switch AS.

4.9 Performance Evaluation

We developed a prototype as well as a testbed to evaluate the performance of the proposed

coding based approach. Section 4.9.1 describes the prototype and Section 4.9.2 describes

the testbed. We want to emphasize that the prototype and the testbed depict two of

the most commonly used real-world system development models, i.e., proprietary–vs–

open-source. The prototype was implemented in a data center using costly proprietary

tools, and hardware. The testbed, on the other hand, was implemented using open-

source tools, virtualized environments, and commodity off-the-shelf components. The


Figure 4.13: OpenFlow based multicasting using coding groups coordinator for the ex-ample given in section 4.4.1. Flow Tables and the corresponding Group Tables for eachswitch in the multicast is also shown.


experimental results showed the advantage of our proposed scheme in both models. We

use data from Hadoop shuffle to benchmark our proposed solution. The Hadoop jobs

consisted of the following two industry standard benchmarks used widely for performance

evaluation of big data frameworks. These benchmarks represent a wide spectrum of big

data work loads. Specifically the benchmarks are:

1. Terasort, a benchmark to sort 10 billion records (1 terabytes (TB)). These records

are generated using a program called TeraGen [116]. Each record consists of 100-

bytes, with the first 10 bytes being the key and the remaining 90 bytes being the

value. This benchmark represents the works loads from mathematical applications

to application using artificial intelligence.

2. Grep (Global Regular Expression) represents generic pattern matching and search-

ing on a data set. Our data set consisted of on an organization’s data-logs, whereby

the goal was to calculate the frequency of occurrence of eight different types of

events in the data-log input. Applications of Grep vary from data mining and

sequencing to anamoly detection.

We want to point out that these two benchmarks not only cover a wide range of big

data applications, but also cover a wide spectrum of the network traffic generated by big

data applications. In fact, Cisco has used the same two types of benchmarks to study

big data infrastructure considerations, and performance of their proposed solutions. Fur-

thermore, these two benchmarks capture the workloads with two very different traffic

patterns, namely “Business Intelligence (BI)” workload benchmark captured by Grep,

and “Extract, Transform, and Load (ETL)” workload captured by Terasort [117]. More-

over, these benchmarks are two of the most widely used standard practice benchmarks for

assessing performance of the Hadoop MapReduce implementations [42,68,79,118–120].

In addition to vanilla Hadoop, we have used in-network combiner as explained in

Section 2.3.1—a network-level optimization focused on aggregating mapper’s outputs for


reducing network traffic—as another baseline to compare the performance of the proposed

scheme. We want to point out that the in-network combiner is the current state-of-the-

art outperforming contemporary combiner as shown in Section 2.3.1. Below we present a

word count example demonstrating in-network combiner and the proposed coding based

scheme for the objective of reduction of network traffic.

Consider an eight node Hadoop cluster consisting of eight mappers and eight reducers.

The output from each of the mapper as well as the key that each of the reducer is

responsible for is shown in Figure 4.14. To evenly compare the benefit, the in-network

combiner as well as the coder are located at the aggregate switch AS. As shown in the

figure, the in-network combiner aggregates the packets 〈MAT, 1〉 originating from node1

and node2 into a new packet 〈MAT, 2〉, and 〈FAT, 1〉 from node3 and node4 to create

a new packet 〈FAT, 2〉. This results in 8% lesser traffic crossing the network bisection.

On the other hand, employing spate coding results in 33% lesser traffic by generating a

set of four packets that cross the network bisection as shown in Figure 4.14. We want

to point out that the above discussion pertains to the packets that cross the network

bisection where both the coder and the combiner are placed. Specifically, the packets

that are exchanged via s1/s2 can neither be combined nor coded. For example, the

packets 〈MAT, 1〉 are to be transferred from nodes node1, node2, node5, node7, and

node8 to node node6, however only the packets originating at nodes node1 and node2

cross the bisection. Similar discussion holds for packet 〈FAT, 1〉.

4.9.1 Prototype

We have prototyped sampler, coder, and preReducer (decoder) in a data center as an

initial proof of concept implementation.

Our testbed consisted of following type of racks:

• Compute Racks


Figure 4.14: A word count example.


• Management and Storage Racks

Compute Racks hosted mappers, reducers , and the middlebox. Each compute rack

consisted of blade-servers. Each server was equipped with twelve x86 64 Intel Xeon

cores, 128 GB of memory, and a single 1 TB hard disk drive. Each processor had 12

MB of Smartcache with the maximum memory bandwidth of 32 GB/s. The servers

used in prototyping were arranged in three racks. All the servers were running Red Hat

Enterprise Linux 6.2 operating system [121]. The components were implemented using

Java, and Octave [122]. Management Rack provided remote access to the compute racks.

Storage Rack was used for storing the home directory of the users.

The interconnect consisted OpenFlow enabled IBM RackSwitch G8264 as Top-of-

Rack switches, and OpenFlow enabled Pronto 3780 as Aggregation switches. RackSwitch

G8264 switch provided non-blocking line-rate switching with a throughput support of

1.28 Tbps. Pronto 3780 provides a bandwidth of 960 Gbps and supports up to 16K route

rules.

Figure 4.15 depicts the prototype setup.

We use the following metrics for quantitative evaluation:

• Job Gain, defined as the increase (in %) in the number of parallel Hadoop jobs

that can be run simultaneously with coding based shuffle compared to the number

of parallel Hadoop jobs achieved by standard Hadoop

• Utilization Ratio, defined as the ratio of link-level packet transmissions when em-

ploying coding based shuffle to the number of link-level packet transmission incurred

by the standard Hadoop implementation.

Our experimental study shows that for both of the tested benchmarks, the overhead to

implement coding based shuffle (in terms of transmission of extra bookkeeping data in

packet headers) was less than 4% in all the experiments. Table 4.1 shows the results


Figure 4.15: Prototype setup.


Hadoop Job Job Gain Utilization Ratio

Sorting 29 % .71Grep 31% 0.69

Table 4.1: Job Gain and Utilization Ratio using proposed coding based shuffle.

across the two metrics for the two benchmarks. The results show significant performance

of our scheme compared to the standard Hadoop implementation.

Noting the fact that our coding based scheme just requires XORing of packets which

is computationally very fast operation and given high memory bandwidth of the servers,

we were able to process closer to line rate. Specifically, even during the worst case

scenario, the throughput of the coder was 809 Mbps on a 1 Gbps link.

4.9.2 Testbed

Our testbed consisted of eight virtual machines (VMs), each running CentOS 7 as the

operating system [123], on top of x86 64 Intel i7 cores. CentOS is a free Linux distri-

bution aimed at providing enterprise class functionality compatible with Red Hat En-

terprise Linux [121]. We used Citrix XenServer 6.5 as the underlying hypervisor [124].

In addition to more than 1 TB of local hard disk and solid state drives, each VM had

access to 3 TB of network attached storage. Citrix XenCenter was used to manage the

XenServer environment and deploy, manage and monitor VMs and remote storage [125].

Open vSwitch [126] was used as the underlying switch providing network connectivity to

the VMs. Open vSwitch is a production quality, distributed, multilayer virtual switch

designed to enable massive network automation supporting OpenFlow as well as NIC-

teaming. The rest of the software implementation was the same as used in Section 4.9.1.

Each virtual rack contains VMs, and one of the VMs was used as a virtual appliance

for the middlebox as shown in Figure 4.16.

Moreover we have implemented a stand-alone split-shuffle, to provide better insights


Figure 4.16: Testbed architecture.

into shuffle dynamics, where receiver service instances (e.g., reducers) fetch file spills from

sender service instances (e.g., mappers) in a split fashion using the standard Hadoop http

mechanism.

To investigate performance of the proposed scheme as well as middlebox placement

in different scenarios, we used following two commonly-used data center topologies:

1. Tree topology (shown in Figure 4.17(a)) with the middlebox at the bisection. We

denote this topology by Top-1.

2. Tree topology with NIC-Teaming (shown in Figure 4.17(b)). Moreover, in this

topology the middlebox is placed at first L2-switch. We denote this topology by

Top-2.

We measured the following parameters of interest:

• Volume-of-Communication (VoC), defined as the amount of data crossing the net-

work bisection. Note that VoC and utilization ratio, although related to each other,


Figure 4.17: Topologies used for testbed: (a) Top-1 (b) Top-2.

measure two different quantities. Note that VoC consists of the true payload, i.e.,

total size of 〈key, value〉 pairs, as well as metadata (e.g., packet header including

hashes, coding vector etc) needed for the proposed solution, thus proportionately

reducing coding advantage. Furthermore, VoC provides a fair comparison ground

irrespective of underlying system. Note that, VoC is explicitly chosen to be differ-

ent from one used in theoretical analysis, i.e., number of packet transmissions. The

primary reason for selection of VoC is that it captures the network related param-

eters like underlying network protocol, packet size etc, as well as packet overhead

used for coding based scheme. On the other hand, a system dependent parameter

like VoC is not suitable for deriving theoretical results related to complexity and

bounds that are general enough to be applicable to practical setups.

• Goodput (GP), defined as the number of useful information bits delivered to the

receiver service instance per unit of time.

• %disk-utilization (%d-util), defined as the percentage of CPU time during which

I/O requests were issued to the storage disk. Disk saturation occurs when this


value is close to 100%. Higher %d-util means poor performance (disk bottleneck).

• Queue-size (Qsz), defined as average queue length of the requests that were issued

to the storage disk. Higher Qsz means longer wait time.

• Bits-per-Joule (BpJ), defined as the number of bits that can be transmitted per

Joule of energy.

The selected performance parameters are closely connected and more comprehensively

express the benefits that the proposed scheme has to offer in a holistic fashion capturing

both the network, and the nodes. Specifically, volume of communication, and number

of bits transmitted per Joule of energy focus on the network aspects, whereas goodput,

disk utilization, and queue size capture a node’s perspective.

VoC measure overall data transfer (network load) during the shuffle phase. On the

other hand, goodput measures the average effective throughput, crucial from an applica-

tion’s perspective. Furthermore, aside from the amount of data and data rate, the energy

required to shuffle data is an important parameter indicating the level of greener opti-

mization offered by the proposed scheme. Similar is the importance of measuring device

utilization and queue size capturing the advantage of the proposed scheme in terms of

availability of the system resources to different processes and virtual machines.

4.9.2.1 Volume-of-Communication

In this Section we compare VoC of our proposed approach with vanilla Hadoop and a

state of the art in-network combiner. An in-network combiner reduces the VoC by par-

tially distributing functionality of the receiver service instances over the network [42,44].

Additionally, to demonstrate the complementary nature of our approach, we deployed

the proposed coding based approach on top of in-network combiner (Combine-N-Code).


Figure 4.18: Normalized VoC using Grep benchmark for both topologies Top-1 andTop-2.

Figure 4.19: Normalized VoC using Terasort benchmark for both topologies Top-1 andTop-2.


Figure 4.18 shows the normalized VoC for all four scenarios using the Grep benchmark

for both topologies Top-1 and Top-2. Note that the normalized V oC of the proposed

coding based approach is the same as µ(bisection) defined in Section 4.7. The results

show that the proposed coding based approach, compared to vanilla shuffle, can reduce

the volume of communication by 43% for Top-1, and 59% or Top-2. Moreover, the coding

based approach outperforms in-network combiner both for Top-1 and Top-2 by 13% and

38% respectively. Furthermore, deploying Combine-N-Code can reduce the volume of

communication by a staggering 64% for Top-1, and 79% for Top-2. Figure 4.18 also

shows the 95% confidence interval for vanilla shuffle, in-network combiner, proposed

coding based scheme, as well as Combine-N-Code.

Figure 4.19 shows the normalized VoC for all four scenarios using the Sorting bench-

mark for both topologies Top-1 and Top-2. For the sorting benchmark in-network com-

biner did not reduce volume of communication at all. Whereas the proposed coding

based scheme leads to a 37% reduction volume of communication for Top-1, and 62%

reduction in volume of communication for Top-2 compared to both vanilla shuffle as well

as in-network combiner. Figure 4.19 also shows the 95% confidence interval for vanilla

shuffle, in-network combiner, proposed coding based scheme, as well as Combine-N-Code.

It is interesting to note that the coding based approach reduces more network traffic

on Top-2 as compared to Top-1. The reason for better performance for Top-2 is due

to the specific placement of the middlebox resulting in more data exchanged through

it, and hence giving rise to more coding opportunities. Specifically, for Top-1 56% and

67% of the data exchange happened through the L2−Aggregate switch (co-located with

the middlebox) for Grep and Sorting respectively. Whereas for Top-2 most of the data

exchanged passed through the L2 − switch (co-located with the middlebox) giving rise

to more coding opportunities.

The communication overhead associated with side information gathering for the sam-

pler for Grep and sorting were less than 1% (0.65% and 0.4% ).


Figure 4.20: Normalized VoC for different values of λ (maximum clique size) used inAlgorithm 1.

Figure 4.20 shows the normalized VoC for different values of λ (maximum clique size)

used in Algorithm 1. Increasing λ increases the coding advantage (decreases VoC ). The

maximum coding advantage is achieved at λ = 8 which conforms to the discussion in

Section 4.5.3.

4.9.2.2 Goodput

In this Section we compare the proposed approach with vanilla shuffle for GP measured at

the receiver service instance. Figure 4.21 shows that the coding based scheme outperforms

vanilla shuffle at all link rate settings. Moreover, the coding benefit increases with the

increase in link rates (GP for coding based scheme is is 55% higher than vanilla shuffle at

500 Mbps and grows to 76% at 1000 Mbps). Figure 4.21 also shows the 95% confidence

interval for both vanilla shuffle and proposed coding based scheme.

Moreover, we investigate the impact of oversubscription ratio on GP for different

link rates. The oversubscription ratios were implemented by generating constant bit


Figure 4.21: Goodput versus link rates for sorting benchmark for topology Top-1.

rate background TCP traffic using iperf tool. Figure 4.22 shows the average GP for

oversubscription ratios of 1, 5, 10, and 15, where the link rate was fixed at 500Mbps.

Figure 4.23 shows a similar plot for link rate of 1000 Mbps. The coding benefit, averaged

over oversubscription ratios, is around 56% for 500 Mbps, and 58% for 1000 Mbps.

4.9.2.3 Disk IO Statistics

In this Section we compare proposed approach with vanilla shuffle for two disk I/O related

parameters, i,e., %d-util and Qsz measured at the receiver service instance. Figures 4.24

and 4.25 show that the coding based scheme outperforms vanilla shuffle at different link

rate settings. Furthermore for both Qsz and %d-util, the percentage improvement in

performance between the proposed scheme and vanilla shuffle peaks to more than 39%

at link rate of 1000 Mbps. This trend can be explained by observing that as the link

rate becomes higher the disk I/O at the receiver service instances becomes the bottleneck

which can be compensated by reduction in the volume of communication offered by the


Figure 4.22: Goodput for different oversubscription ratios using sorting benchmark fortopology Top-1 with link rate at 500 Mbps.

coding based scheme. Figures 4.24 and 4.25 also show the 95% confidence interval for

both vanilla shuffle and proposed coding based scheme.

%d-util and Qsz are recorded for the same duration for both vanilla shuffle as well as

coding based scheme.

4.9.2.4 Bits-per-Joule

Energy proportionality—energy consumption proportional to the load—of a network de-

vice is a desirable feature for a greener big data enterprise. It has been observed that

the energy consumption is proportional to the load on network devices, however in prac-

tice this proportional relationship is not ideal rather devices consumes energy even at

no load (e.g., Figure 2 in [127]). Moreover, the power consumption depends on the link

rate [58, 127,128].


Figure 4.23: Goodput for different oversubscription ratios using sorting benchmark fortopology Top-1 with link rate at 1000 Mbps.

Figure 4.24: %d-util versus link rates using sorting benchmark for topology Top-1.


Figure 4.25: Qsz versus link rates using sorting benchmark for topology Top-1.

Adaptive Link Rate(ALR) [128, 129], and traffic consolidation are two of the pop-

ular techniques, exploiting energy proportionality, to optimize energy consumption in

a network device. In ALR network devices tune their network interfaces to different

rates and/or shutdown some of them [128, 129]. In traffic consolidation based schemes,

traffic from multiple links is consolidated to a smaller number of links while turning

off the unused interfaces [30, 51]. Our proposed scheme can lead to energy savings by

deploying traffic consolidation and ALR—coding can reduce volume of communication

consequently reducing traffic on the interfaces and creating opportunities for turning off

some interfaces and/or tuning them to lower rates.

In this section we compare BpJ for the proposed approach with vanilla shuffle. We

compute BpJ for a fixed link rate. Moreover, the links were fully utilized. More precisely,

we investigated possible improvement in energy efficiency by changing Open vSwitch link

rates in the testbed and using power rating given in [128] to calculate BpJ. Specifically,

a k% reduction in volume of communication using coding based scheme can correspond

to turning off that port for up to k% of time as compared to vanilla shuffle. Note that

coding based shuffle has less amount of data to transfer compared to vanilla shuffle as


Figure 4.26: BpJ versus link rates using sorting benchmark for topology Top-1.

shown in our results in Section 4.9.2.1.

Figure 4.26 shows BpJ for different link rates. The improvement in BpJ of coding

based scheme over vanilla shuffle grows significantly with higher link rates (i.e., 2.5, 4.5,

and 187.2 more BpJ at link rates of 10, 100, and 1000 Mbps respectively).

In general by choosing lower link rates, favouring lower power (energy per unit time)

consumption, the corresponding GP also drops as much more time is elapsed in network.

We observe that the benefit of choosing lower link rate to improve power efficiency might

not be an energy efficient solution for certain scenarios as deduced by the results of Figure

4.26 that the BpJ grows with the increase in link rate.

4.10 Summary and Discussion

We introduce the novel concept of spate coding for reducing the volume of commu-

nication, without degrading the rate of information exchange, in big data processing

frameworks. We have introduced for the first time to our knowledge a network middle-


box service that employs spate coding to optimize network traffic. We have not only

analyzed the computational complexity of a general schemes for minimizing the volume

of communication, but also provided the bounds on maximum advantage and placement

strategies. We have also performed a proof-of-concept implementation in real world prac-

tical scenarios. Moreover, we performed several experiments to investigate the purposed

scheme in a holistic fashion including the parameters capturing both the network and the

host nodes. The performance indicators including volume of communication, goodput,

bits per Joule, disk utilization, and queue size showed the performance of the proposed

system across a spectrum of scenarios and applications. The improved goodput is an

indicative of a greener framework plane, whereas improved disk utilization and queue

size reflect a greener server plane and virtualization plane. Similarly, decreased volume

of communication corresponds to a greener interconnect plane.

Although this chapter discusses the proposed scheme in context of Hadoop MapRe-

duce, however, the proposed scheme can be extended to multistage multitask frameworks

by treating each group of tasks communicating with each other individually, or assigning

a virtualized appliance to each subgroup of associated tasks. Furthermore, in multi-

stage scenarios where the next stage of a job starts after completion of the first stage,

the output of first stage can serve as a superior set of side information for the next

stage and can result in better performance as well as less reliance on the sampler. Ex-

tending the current scheme and model for more futuristic scenarios that are not related

to MapReduce is an interesting venue for future research. However, the three proposed

components sampler, coder and preReducer are scalable for the Hadoop MapReduce with

large number of mapper/reducer tasks. In particular, preReducer is a component that

is individual to each physical node—hosted locally and operating independently—and

therefore the number of tasks/nodes does not pose a scalability challenge. The sampler

and the coder components, residing in the network middlebox, are limited in number

and dealing with large number of tasks efficiently might need some planning. One of


the possibilities can be to dissolve solving the spate coding problem, an approach similar

to divide-and-conquer, for all tasks into a number of sub-problems each dealing with a

practical-sized group of tasks. Each subproblem can either be assigned to an individual

middlebox appliance or a virtualized appliance. The scalability also highly depends on

the hardware used for middlebox. For instance, an FPGA-based middlebox might be

able to process even a very large number of tasks, eliminating the need of subdividing

the problem. On the other hand using one of the nodes as a middlebox device might need

sophisticated handling, like subdividing and/or scheduling the subproblems, to keep up

with the enormous workload.

4.11 Connecting the Dots

We want to reemphasize that turning the big data enterprise into a greener enterprise

requires two fundamental steps. The first step is curtailing resource consumption, and

making data centers energy efficient. The second step is using greener energy sources

to power data centers. This chapter provided a coding based scheme that was focussed

on making data centers greener and energy efficient. The next chapter deals with the

greener power procurement for the workhorse of big data enterprise, i.e., data centers.

Chapter 5

Distributed Power Procurement for

Data Centers

As discussed in Chapter 1, a greener big data enterprise requires both the processing

work-horses as well as the power procurement to be green. Data centers consume around

3% of the world’s total electricity, whereas renewable energy sources account for no more

than 10% of the generation mix. However, with the advent of smart grid, there is a

paradigm shift in the power procurement, e.g., it is expected that around 50% of all the

power procured for the data centers would be from greener sources [4]. A major feature

of this new paradigm is association of the electricity pricing with the geographical and

temporal diversity of the sources. Uncertainties associated with the electricity markets

and the integration of the greener sources lead many data center operators to buy energy

in bulk at locked prices for extended lengths of time, from “ungreen” sources, which

in certain scenarios cost more. Therefore, there is a growing energy-procurement trend

which enables data centers to realize better and greener energy mix while lowering energy

costs [131,132].

Moreover, data center operators tend to house multiple facilities across geographical

0Part of this work appears in [130].

91

Chapter 5. Distributed Power Procurement for Data Centers 92

locations and workload redistribution across different facilities is a common practice by

data centers. Among the factors that are driving data centers to geographically diver-

sify locations across multiple regions are globalization, security, disaster recovery, and

minimizing downtime [133]. The possibility of distributed facilities at various locations

can benefit from diversity of geographical generation mix [134] as well as electricity

prices [135]. Additional rising scenarios like policy directives and transmission line bot-

tlenecks are adding to the complexity [17].

This chapter focuses on the power procurement problem incorporating emerging pol-

icy and transmission line capacity constraints. We use the concept of matroid theory to

model the problem as a combinatorial problem, and propose an optimal and distributed

solution for it. Furthermore, we account for communications and computational complex-

ity and demonstrate how although the power procurement problem is NP-hard, simpler

versions of the problem lead to polynomial-time solutions.

Note that the power procurement problem modeled in this chapter is general in na-

ture and is not specific to only data centers. In fact it captures a more general problem

of scheduling power generation to a set of load areas exhibiting the policy related con-

straints.

5.1 Introduction

Power procurement from a variety of sources to meet load demands while minimizing

operating costs has historically been a cardinal problem in the industrial world. In its

classical settings, the power procurement involves the selection of generating sources to

supply forecasted load that minimizes cost over a required planning horizon. With the

rise of big data, energy consumption of data center facilities are sky-rocketing resulting

in millions of metric tons of CO2 generation. However, recent environmental awareness

is driving the data center industry to adopt greener sources.


Furthermore, as geographically diverse data centers marry “smarter” cyber-physical

grid in a deregulated market, we witness a fluctuating technical, political and financial

landscape in which decisions are constrained by additional issues including policy and

transmission line bottlenecks. For instance, although it may take one year to build a wind

farm, it can take five years to build the necessary transmission lines needed to carry its

power to cities [136]. Moreover, the regulation are pushing data centers to incorporate

green-practise driven policies, e.g., U.S Energy Efficiency Improvement Act of 2014 (HR

540) bill [17], resulting in novel system constraints. Such policies can dictate constraints

on the choices of energy sources for various facilities in the shared distribution system.

In this vein, the objective of this chapter is to explore the interaction of power pro-

curement problem structure with complexity in context of emerging constraints. For this

reason we focus on the fundamental aspects of the constrained power procurement prob-

lem in order to more effectively identify the the trade-offs and relationship to complexity.

In particular, we consider a power procurement problem in which we take into account

a subset of constraints from its classical counterpart while exploring the integration of

new requirements. We develop a modified formulation that lends itself to a better un-

derstanding of issues of complexity and tractability. Within this formulation we explore

solution complexity to ask: Can we solve the general problem in polynomial time? If

not, what makes the structure of the problem NP-hard? For what special cases can we

find efficient solutions? We emphasize that we aim to asses insight into cyber-physical

constraints and complexity to aid future opportunities for evolution.

We specifically address the following physical system requirements: 1) policy con-

straints which deal with prioritization of generation source classes for specific facilities,

2) transmission line constraints involving line overloads that arise from usage by multiple

generating sources sharing the same transmission lines, and 3) delay constraints including

startup lags of the generation sources. We call this the constrained power procurement

(CPP) problem. We associate a time-varying cost of operation to each generating source


to reflect the temporal diversity of electricity prices.

An instance of the CPP problem consists of a number of generating sources, a set of

transmission lines, and a set of data center facilities. Each generating point can represent

a collection of sources. Each transmission line has a known capacity limit. Each data

center facility has a forecasted demand and a set of policy-driven generation preferences.

An example follows.

5.1.1 Example Case Study: Power Procurement over a 12 bus

power system

Consider three data center facilities A1, A2, A3, that we revisit throughout this chapter,

connected to generating sources g1, g2, · · · , g8 over a 12-bus power system, and a set of

two bottleneck transmission lines γ1, γ2 shown bold in Fig. 5.1. The sum from any group

of two or more generating point outputs exceeds the transmission line capacities of γ1

and γ2 and therefore prohibits simultaneous use by more than one generating point. A

careful study reveals the following facts.

If selected for procurement, generating sources Gγ1 = g1, g2, g3, g8 need to use one

common transmission line γ1 and thus cannot be simultaneously procured. Similarly,

Gγ2 = g4, g5, g6, g7 cannot be simultaneously used for procurement by using γ2. Thus

the transmission line constraints impose that no two generating sources from Gγ1 and

Gγ2 be scheduled to use γ1 and γ2, respectively, at the same time. Moreover, due to

policy matters each data center facility has preference of generating sources. In this

example, A1 chooses GA1 = g1, g2, g3, A2 opts for GA2 = g4, g5, and area A3 must

fulfill demand with GA3 = g6, g7, g8. Thus, within the scope of the planning horizon,

the transmission line and policy constraints dictate selection of only one generating point

from each of GA1 , GA2 , and GA3 for data center facilities A1, A2 and A3 respectively.


Figure 5.1: Three data center facilities A1, A2, A3 and a 12 bus power system with eightgenerating sources g1, g2 · · · , g8. The constrained transmission lines, γ1 and γ2, are shownin bold.


Figure 5.2: An equivalent representation of the imposed constraints for the 12 bus casestudy shown in Figure 5.1.

Figure 5.3: Startup delays and generation cost associated with each generating point foreach time unit t0, t1, t2 and a planning horizon of three time units for the 12 bus casestudy shown in Figure 5.1.


In this example, we assume that any generating source within GAican accommodate

the forecasted load for Ai; if that is not the case then more than one generating point

from the set can be selected. A three-tier view of the constraints is shown in Fig. 5.2,

where generating sources are connected by lines to show the data center facilities they

can serve and the shared transmission line they need to couple to. Moreover for each

generating source the associated startup delays and the generation cost for a planning

horizon of three time units t0, t1, t2 is shown in Fig. 5.3.

5.2 Related work

The related work can be broadly categorized into three thrusts, namely power procure-

ment for data centers, distributed generation, and generator scheduling.

The joint evolution of big data industry and smart grid invites a new era of power

procurement for data centers. The current efforts in power procurement for data centers

can be subdivided into following categories. The first category deals with making best use

of local renewable power sources like wind, solar, battery banks, and backup generators

[137–139]. The second category focuses on geographical distribution of workloads and

geographical or temporal price diversity [134, 135, 140]. The third category focuses on

demand response by workloads scheduling incorporating the dynamic nature of electricity

pricing by adjusting power usage patterns. Some of the demand response strategies also

take into account the use of renewables and local energy sources [141–143]. The fourth

category focuses on improving the efficiency and management of hardware and software

resources [144,145].

We want to point out that the common objective of all four categories is reducing

electricity bills. However, we assert that aside from lowering the electricity cost, the

emerging scenarios require a power procurement practice that incorporates distributed

generation in the presence of policy, transmission line, and generation delay constraints.


Incorporation of these constraints is not explored well by the research community. In

comparison, in this work we consider power procurement in the presence of several physi-

cal constraints imposed by the power grid, as well as policy constraints from a data center

operator’s perspective, while minimizing the electricity cost. In this context, to the best

of our knowledge, it is the first work that captures the aforementioned constraints while

lowering the electricity costs.

The distributed generation thrust can be subdivided into the following directions.

From the perspective of active distribution networks, scheduling of distributed energy

resources with respect to the network topology has been considered for the optimal co-

ordination of energy resources [146]. Reference [147] presents production cost minimiza-

tion and network constraint management for active management of distributed energy

resources. Distributed generation (DG) has been explored in terms of compact integrated

energy systems that use 25% less semiconductors in [148]. Control techniques for inte-

gration of DG resources into the electrical power network have been studied in [149].

Furthermore, [150] studies the dynamic allocation of the resources in the dispatch of DG

using an evolutionary game theoretical approach for optimal and feasible solutions in a

microgrid structure. Moreover, the placement of generators in DG for loss reduction in

primary distribution network has been studied in [151]. [152, 153] present a distribution

optimal power flow model to integrate local distribution system feeders into a smart

grid. Reference [154] presents a jump and shift method to handle the multi-objective

optimization in the smart energy management systems, where the goal is to minimize

the operating costs by reducing the emission levels while meeting the load demands.

In contrast, the primary objective of our work is to explore how the emerging policy

and transmission line constraints can significantly affect the complexity of the power

procurement problem in the presence of distributed generation, which has not been ex-

plored previously in a holistic fashion. In this vein, we present a novel approach based

on matroid theory to provide a better insight into the problem.


A counterpart of this problem is the generator scheduling problem, where a set of

generators are committed to meet the load demands. Intelligent generation scheduling

(involving unit commitment [155] and economic dispatch [156]) has been widely studied

but classically generator scheduling has not accounted for policy and transmission line

constraints together often leaving its inclusion to the next stage of economic dispatch.

However, the growing trend in power systems is one in which these limitations must be

accounted for early on during the scheduling phase to provide a more comprehensive

view [157]. Therefore, research has been done to solve the generator scheduling problem

for real-life large-scale power systems (e.g., see [158]) and also for the generator scheduling

in presence of the security constraints [159] and the transmission line capacity constraints

[160]. However, the growing emphasis on environmental responsibility calls for inclusion

of policy based constraints. Generation scheduling taking into account operational policy

in terms of number of controls action have been studies in [161,162]. Similarly, [163,164]

explore maximization of social welfare and management of environmental impact issues.

A generalized formulation for artificial intelligence based intelligent energy management

minimizing operation cost and the environmental impact of a microgrid has been explored

in [165].

In contrast, the emerging green scenarios call for policy initiatives which has not been

addressed before, e.g., the option of explicitly selecting from the energy sources, which

is addressed in our work.

5.3 Contributions

Power procurement to match demands of the geographically distributed data center fa-

cilities while simultaneously meeting constraints involves extensive combinatorics due to

the discrete nature of the problem; since a generating source is either selected or not.

An alternative approach could relax the integer constraints imposed by the problem and


solve it via linear programming; however, such a solution may lack optimality and in some

cases rounding might result in an infeasible solution. Thus, in this work we propose an

approach for combinatorial optimization, suitable for distributed computing, that aims

to leverage the natural symmetry and structure of the CPP problem constraints while

providing insight on issues of computational complexity.

Our contributions are three-fold:

1. We develop a framework for CPP in which the problem is broken down into building

blocks effective in relating procurement complexity to physical system constraints;

specifically, we introduce the 1-restricted constrained power procurement (1-RCPP)

and the T -restricted constrained power procurement (T-RCPP) problems.

2. We present distributed polynomial time solutions to the 1-RCPP and T-RCPP

problems. The 1-RCPP problem assumes zero delay and a planning horizon of one

time unit thus dealing only with transmission line and policy constraints, which can

be represented as matroids. The T-RCPP problem extends the 1-RCPP problem

to a planning horizon of T time units and includes delay constraints. Moreover, our

distributed solution proposes a novel way for assignment of IDs to the generating

sources to facilitate the distributed solution.

3. We show, how for less restrictive physical constraints, the CPP problem is NP-hard

to approximate within a ratio of n1−ε for any constant ε > 0.

We want to point out that one of the primary objectives of our work is to explore

how the emerging constraints can significantly affect the complexity of the problem.

Therefore, we deviate from classical formulation to consider a new approach that allow

us to use matroid theory to get a better insight into the problem. We believe that only

through studying a problem that deviates form the classical formulation shall help shed

light on the interaction between constraints and complexity.


5.4 Problem Formulation

We consider a set of k data center facilities A1, · · · , Ak, with forecasted load demand

DAifor facility Ai, that can procure power from a set of n sources G = g1, · · · , gn,

and a planning horizon T . Each source is associated with a variable uijt ∈ 0, 1, where

uijt = 1 means that generating unit gi is scheduled for procurement for area Aj at time

t. A generator gi generates power Wit at time t such that Wmini ≤ Wit ≤ Wmax

i .

There are q buses B = b1, · · · , bq, let ∧bi be the set of indices of the generating

sources at bus bi, andDbi,t be the forecasted demand at bus bi in time period t. We assume

` transmission lines Γ = γ1, · · · , γ` and transmission capacity Fγi for each transmission

line γi. Let τγi,bj be the line flow distribution factor at the transmission line γi by the

bus bj. For the case study of Section 5.1.1 the corresponding set of generating sources

is G=g1, g2, . . . , , g8, the set of (bottleneck) transmission lines is Γ=γ1, γ2, and the

data center facilities are A1, A2, and A3.

Each generating source is associated with a time-dependent generation cost c : G ×

0, 1, · · · , T − 1 −→ R ≥ 0 that assigns non-negative costs to each generating point gi

over all time units tj ∈ 0, 1, · · · , T −1 such that c(gi, tj) refers to the cost of generation

cost of gi. In addition, each generating source gi is associated with a generation delay

(advanced notification) d(gi) ∈ 0, 1, · · · to reflect wait times in startup. If a data center

facility has finalized gi at tj in its procurement plan, the actual time when gi couples to

the transmission line is d(gi) + tj.

Furthermore, we define a vector z of length k, where zj, associated with facility Aj,

is defined as below:

zj =

1 if∑

gi∈sj Wit uijt−d(gi) = DAj

0 otherwise.(5.1)

An allocation at time ti, denoted by ati ⊆ G, is a set of generating sources powering

the facilities at time ti. A power procurement schedule is defined to be the set of alloca-


tions at times 0, 1, · · · , T − 1 for a planning horizon of T . A feasible power procurement

schedule is defined to be a schedule that satisfies the following three constraints:

1. Policy constraints regarding meeting the demand constraints of the data center

facilities while respecting policy initiatives of each facility. More specifically, each

facility Ai is associated with a set si ⊆ G of generating points opted by it to meet

its demand.

2. Transmission line constraints captured by the dc power flow model.

∑bj

τγi,bj

∑k∈∧bj

uktWkt −Dbj ,t

≤ |Fγi |, ∀γi ∈ Γ, t ≤ T

3. Delay constraints whereby in a feasible allocation and for a planning horizon T ,

a generating point gi can only be allocated its required transmission line at time tj

if d(gi)+tj ≤ T−1. Delay constraint captures the minimum downtime or advanced

notification of a generating point.

The cost of an allocation ati is defined to be the sum of the generation costs of all the

generating sources in ati given by: cost(ati) =∑

gj :gj∈atic(gj, ti). Subsequently, the cost of a

power procurement schedule Q is given by the sum of costs over all its allocations given

by: cost(Q) =∑

ati :ati∈Qcost(ati).

Definition 30 (Constrained Power Procurement (CPP) Problem): For an instance of

the constrained power procurement problem with policy, transmission line, and delay con-

straints as defined above, find a least cost feasible power procurement schedule; that is,

no other feasible schedule has a lower cost.

The objective of the CPP problem is to power up the maximum possible number of

facilities while respecting their policy initiatives without violating the policy, transmission


line and delay constraints such that the solution has the least cost amongst all constrained

possibilities. The CPP problem can be summarized as below:

minT−1∑t=0

k∑j=1

n∑i=1

uijt c(gi, t) max (||z||1) (5.2)

subject to∑gi∈sj

Wit uijt−d(gi) ≤ DAj, ∀Aj, t (5.3)

k∑j=1

uijt ≤ 1, ∀ gi, t (5.4)

∑bj

τγi,bj

∑`∈∧bj

k∑m=1

u`mt−d(gi) W`t −Dbj ,t

≤ |Fγi|, ∀ γi, t (5.5)

t+ d(gi) ≤ T − 1, ∀ uijt = 1 (5.6)

zj =

1 if∑

gi∈sj Wit uijt−d(gi) = DAj

0 otherwise.∀Aj, t (5.7)

Note that Equations 5.3, 5.5, and 5.6 deal with the policy, transmission line, and

delay constraints respectively.

5.4.1 Why Matroids

In this section we motivate the selection of matroid theory as a tool for the power

procurement problem. We start by presenting a simple instance of the CPP problem

with six generators G = g1, g2, g3, g4, g5, g6, where c(g1, 0) = 7, c(g2, 0) = 6, c(g3, 0) =

6, c(g4, 0) = 9, c(g5, 0) = 7, c(g6, 0) = 9, T = 1, and d(gi) = 0 for all i. There are three

areas A1, A2, A3, and three transmission lines γ1, γ2, and γ3. The transmission line γi

is connected to the bus bi, where τγi,bi = 1, and Dbi,t = 0. Furthermore, Wi0 = 1,

DAi= 1, and Fγi = 1 for all i. The policy sets are : s1 = g3, s2 = g2, g4, g6, and

s3 = g1, g5. The generators connected to bus b1, b2, and b3 are g1, g2, g3, g5, and


g4, g6 respectively. This instance belongs to the Mixed Integer Programming (MIP)

family as given in Figure 5.4.1.

Equations 5.12 - 5.26 capture Equation 5.1.

We start by describing the complexity of the problems belonging to MIP family,

and then focus on some heuristic approaches—e.g., relaxation etc— that might work for

certain scenarios in MIP.

We first note that Integer Programming (IP)/Mixed Integer Programming (MIP) is

not only an NP-hard but also has no known polynomial-time approximation solution since

the Boolean Satisfiability problem can be solved using IP [166–168]. It is so computa-

tionally complex that even for the case where the number of constraints are superlinear

in number of variables n, existence of any algorithm with the complexity O(2(1−s)n) for

s > 0, contradicts the well-known Strong Exponential Time Hypothesis (SETH) stating

that k-SAT does not have a subexponential time algorithm [169, 170]. In fact, IP/MIP

are so general that virtually any combinatorial optimization problem can be transformed

into IP/MIP [171]. Furthermore, an NP-hard problem with no known polynomial-time

approximation solution, e.g., IP/MIP, is believed to have exponential computational com-

plexity [166], and is not tractable [172].

The commonly used approaches to solve IP/MIP include branch-and-bound and re-

laxation. Solution techniques based on branch-and-bound methods have exponential

complexity [173]. Furthermore, an approach based on relaxation may not produce a

feasible solution by the means of trivial rounding. For instance in order to solve above

problem by relaxing the constraint on variables (uij0) to take any real value between 0 and

1 and then rounding the solution obtained seems to be computationally tractable, but it

can be shown that such an approach can result in an infeasible solution. For example, the

relaxation of the binary variables uij0 (the last constraint) in the program given in Fig-

ure 5.4.1 yields the following solution: u130 = 0, u220 = 0, u310 = 1, u420 = 0.5, u530 =

1, u620 = 0.5. It is easier to see that either of simple rounding up and down results in an


min (7u130 + 6u220 + 6u310 + 9u420 + 7u530 + 9u620) max (||z||1) (5.8)

subject to u310 ≤ 1 (5.9)

u220 + u420 + u620 ≤ 1 (5.10)

u130 + u530 ≤ 1 (5.11)

α1 = u310 − 1 (5.12)

− α1 ≤M1(1− z1) (5.13)

α1 ≤ −m1(1− z1) (5.14)

m1(|α1|+ 1) < 0.1 (5.15)

|α1| −M1 < 0 (5.16)

α2 = (u220 + u420 + u620)− 1 (5.17)

− α2 ≤M2(1− z2) (5.18)

α2 ≤ −m2(1− z2) (5.19)

m2(|α2|+ 1) < 0.1 (5.20)

|α2| −M2 < 0 (5.21)

α3 = (u130 + u530)− 1 (5.22)

− α3 ≤M3(1− z3) (5.23)

α3 ≤ −m3(1− z3) (5.24)

m3(|α3|+ 1) < 0.1 (5.25)

|α3| −M3 < 0 (5.26)

u130 + u220 + u310 ≤ 1 (5.27)

u530 ≤ 1 (5.28)

u420 + u620 ≤ 1 (5.29)

mj,Mj ∈ R+ (5.30)

z1, z2, z3 ∈ 0, 1 ∀ Aj (5.31)

uij0 ∈ 0, 1 ∀ i, j (5.32)

Figure 5.4: An instance of the CPP problem.


infeasible solution, specifically rounding up violates the policy constraint, and rounding

down fails to serve the area A2. Note that sophisticated rounding techniques specific

to certain problem formulations might be developed that can possibly ensure feasibility.

Given the fact that the relaxation techniques might be developed specific to a problem

formulation, it is an interesting open question to investigate if such a technique can be

developed for the CPP problem that allows an effective binary-recovery procedure given

the specific form of the constraints in the problem formulation.

Furthermore the solutions for IP/MIP are usually centralized, whereas the mammoth

size of the future power systems and the increased number of data center facilities calls

for a distributed solution [174,175]. Even the distributed solutions for IP/MIP are in fact

parallel algorithms assuming loosely synchronous models [176,177] which are suitable only

for co-located high performance computing clusters (HPC), for example IBM’s CPLEX

on an HPC cluster. These solutions cannot be used for the distributed entities which

are far apart geographically as well as don’t have a reliable communication mechanism

as required by HPC clusters. Note, Information and Communication Technology (ICT)

support in the evolving smart grid is mostly based on asynchronous TCP and unreliable

IP infrastructure, so there is a need of distributed solutions for solving the power pro-

curement asynchronously while taking into account the complexity class of the problem

instance. Furthermore, any efficient distributed solution utilizing an TCP/IP infrastruc-

ture needs to take into account the inherent rate-control limitations of TCP which result

when a huge number of messages is exchanged (packets sent) over the network.

The power procurement problem is a combinatorial optimization problem; choosing

the least cost procurement by utilizing a generate-and-test approach requires processing

O(2|G|) possibilities. A major challenge of solving combinatorial optimization problems

is to develop solutions with acceptably small number of computational steps. An accept-

able standard in the realm of combinatorial optimization is the smartly-crafted problem-

specific solutions that require number of computational steps that are bounded by a


polynomial in the size of the problem. Such polynomial time solutions and techniques

exploit the specific structure of the problem. Each of such solutions, usually, has its own

niche where it performs efficiently. On the other hand the solutions and techniques that

try to solve combinatorial optimization problems in general like IP/MIP are not very

efficient as they do not exploit the problem specific characteristics. Moreover, there is no

possibility of establishing bounds on the length of the computations involved in solving

IP/MIP [171]. So, it is important to classify the the problem into different complexity

classes, and then based on the classification use the most efficient solution in terms of

computational complexity.

Given the combinatorial nature of the problem, we have used matroid theory to cap-

ture the variety of constraints faced by evolving power procurement scenarios. The struc-

ture of matroids enables one to effectively rule out large groups of generating choices for

procurement with a simple test in contrast to sifting through all possible combinations.

We use matroid theory to develop an efficient solution with polynomial time computa-

tional complexity. Our formulation enables us to classify the problem at hand based on

computational complexity to identify easy and tough instances of the problem. Moreover,

the theoretical analysis in terms of computational and message complexity of the pro-

posed matroid intersection based scheme is of particular interest in context of designing

industrial scale systems. The analysis presented can be helpful in deciding performance

requirements of system components like CPU/controller, communication requirements

on the interconnection TCP/IP network, etc.

5.5 Restricted Constrained Power Procurement Prob-

lem

In this section we deal with a restricted version of CPP problem, called the restricted

constrained power procurement (RCPP) problem where Wi = Dj,∀gi ∈ sj and Wgi <


Fγj < 2Wgi ,∀gi ∈ ∧bk with τγj ,bk 6= 0 (i.e., bus bk injects flow to the transmission line

γj).

Later, in Section 5.6 we show that without this restriction the problem becomes NP-

hard. We start by presenting the solution to the 1-RCPP problem which exhibits the

following additional restrictions:

• T = 1, which naturally implies

• d(gi) = 0 ∀i and

• c(gi, tj) = c(gi) ∀i, j.

It is obvious that the the least cost schedule for the 1-RCPP problem, is same as least

cost allocation at time ti = 0 since the planning horizon is just one time unit. Thus, we

may use the solution to the 1-RCPP problem as a building block for the solution of the

T-RCPP problem with planning horizon T .

5.5.1 Relation to Matroids

To apply matroid theory, we first capture the constraints of Section 5.4 as matroids

defined in Chapter 2.

5.5.1.1 Policy Constraints

We first define a set PC(G) = s1, · · · , sk consisting of distinct (non-overlapping)

policy sets si ⊆ G defined for each facility Ai. For the faculties A1, A2, and A3 in

the case study of Section 5.1.1 the corresponding policy constraint set is: PC(G) =

g1, g2, g3, g4, g5, g6, g7, g8. To capture the policy constraints we start by working

in a k dimensional vector space, with k basis vectors given by b1, b2, · · · , bk. We further

define a matrix A with k rows and n columns. Each generating point gi ∈ G is associated

with a unique column vector zi in matrix A. Each zi is a vector in the k dimensional


vector space. To each policy set si ∈ PC(G), we assign a unique base vector bi. More-

over, all the generating points in si are assigned unique column vectors that are positive

(sequential) integer multiples of bi such that they are unique but linearly dependent over

bi. Specifically, if gj, gk, · · · , gy+x ∈ si, then we assign vector zj = bi to gj, zk = 2 ·bi to gk,

and similarly to the remaining generating points using consecutive integer multiples of bi.

Note that this unique vector shall be later used as unique ID for each of the generating

sources in the distributed algorithm presented in Section 5.5.2.

We define L1 = L1 where L1 ⊆ G, such that all the columns vectors zi associated

with gi ∈ L1 are linearly independent for all i; that is, L1 is the family of all subsets of

G whose associated column vectors are linearly independent.

Lemma 31 SM , (G,L1) is a vector matroid over ground set G, with collection of

independent sets given by L1.

Proof: Follows directly from the definition of vector matroid [38].

For our case study of Section 5.1.1 the matrix A is given below:

A =

1 2 3 0 0 0 0 0

0 0 0 1 2 0 0 0

0 0 0 0 0 1 2 3

.

In this example ground set for SM is G = g1, . . . , g8. Furthermore, L1 is the set of

all subsets of G whose associated column vectors are linearly independent, an example

of one such set is g1, g4, g7.

5.5.1.2 Transmission Line Constraints

We define a family of subsets of G, TLC(G) = x1, · · · , xR|xi ⊆ G ∀i where set xi

contains all generating sources that require line γi but have conflicts such that their

total output exceeds the line capacity Fγi . Let L2 be the family of all subsets of G


that are feasible with respect to transmission line constraints, that is, they do not have

transmission line conflicts. Formally, L2 = L2|L2 ⊆ L′2, where L′2 ⊆ 2P such that

L′2 = e1, · · · , ek|e1 ∈ x1, · · · , ek ∈ xk and k ≤ `.

Lemma 32 TLM , (G,L2) is a partition matroid over ground set G, with the collection

of independent sets given by L2.

Proof: The proof is based on two observations. The first is that each L2 is a

partial transversal (as defined in Section 2.2) of the transmission line constraint set

TLC(G) = x1, · · · , x`, so L2 is the set of all the partial transversals of the trans-

mission line constraints set TLC(G). The second observation is that the transmission

line constraint set TLC(G) defines a partition of G. Hence, by definition of a partition

matroid [38], TLM is a partition matroid over the ground set G.

For the case study of Section 5.1.1, the ground set for TLM is G. L2 is the set of all

partial transversals of TLC(G), one such component is given by g1, g3.

5.5.2 A Distributed Algorithm for the 1-RCPP Problem

Since we are in the era of distributed generation, a distributed power procurement for

distributed facilities is a perfect fit. This section provides a distributed solution to the

power procurement problem where generating sources collaboratively and distributedly

decide the least cost schedule for powering up the geographically distributed facilities.

It has been shown in the previous section that policy constraints can be captured by

a vector matroid SM = (G,L1), and transmission line constraints can be captured by a

partition matroid TLM = (G,L2). Moreover, both the constraints need to be satisfied

for finding a feasible solution for the 1-RCPP problem.

Specifically, the 1-RCPP problem can be solved by finding the maximum cardinality

intersection of L1 and L2 (to maximize facilities powered up under constraints) with least

cost.


Lemma 33 The optimal solution to 1-RCPP is given by the least cost maximal cardi-

nality common independent set of matroids SM and TLM.

The proof of Lemma 33 is given in Appendix C.1.

We first give a brief overview of the proposed distributed algorithm. Each generating

executes executes a copy of Algorithm 1-RCPP based on Edmond’s weighted matroid

intersection algorithm [38,171,178].

For the sake of simplicity, we assume that each generating source can communicate

with any other generating point (via a possible multi-hop or mesh network). Moreover,

each generating source gi has a unique vector ID idi that consist of two parts. The first

part id1i corresponds to the vector zi assigned to it (used for capturing policy constraints

by using a vector matroid representation as mentioned in section 5.5.1). The second

part id2i indicates the transmission line it must couple with; for instance, the second

part of idi for the generating point gi is j if gi ∈ xj. For our case study the ID for

g1 is id1 = 1001 where id11 = z1 = 100 and id21 = 1 since g1 ∈ x1. We assume each

generating point knows the IDs of all the other generating points. The communication is

done in rounds. Each round is subdivided into three phases denoted Phase 1, Phase 2,

and Phase 3. In each round, a set of the generators selected so far is denoted by Gon.

Using algorithm ChkL, each generator independently computes the sets G1, G2,−→E and

←−E . Sets G1 and G2 are the set of generators that can be added to the set of generators

selected so far without violating the policy, and transmission line constraints respectively.

The sets←−E and

−→E represent sets of valid replacements for the generators selected so far

with respect to policy, and transmission line constraints respectively. The algorithm is

initiated by an external call init(). Each generator maintains a vector Gon of length n

indicating the selection of a generator. The updated vector is communicated at the end

of the round. Gon is updated in each round till we get the optimal solution. Initially

no generator is selected, and each generator has a non-negative cost. During phase 1


of each round, the generator gi will compute two vectors←−E , and

−→E each of length n,

indicating a valid replacement w.r.t. L1 and L2. A valid replacement means that if

instead of gi, gj is selected in the set Gon then the resulting selection would still represent

a valid independent set with respect to the corresponding matroid. Computation for the

sets←−E , and

−→E are done locally, with the help of the identification vectors associated

with the generators, using the subroutine ChkL(g,G′, num) given in Figure 4, where

num ∈ 1, 2. Given a set of selected generators G′ and a generator g, the subroutine

ChkL(g,G′, num) checks if g∪G ∈ Lnum. For instance ChkL(g,G′, 2) checks if g∪G ∈ L2.

If a generator gi is not selected then it proceeds to determine two binary vectors G1 and

G2, each of length n, by invoking ChkL(gi, Gon, 1) and ChkL(gi, Gon, 2) respectively. The

vectors G1 and G2 represent the generators that can be selected alongwith the already

selected generators and still belong to the independent sets L1 and L2 respectively. If gi is

selected in G1 then it proceeds to phase 2 of the round where it finds the least cost path,

in a distributed manner, among all (gk, gj) pairs of generators such that gk is selected

in G1 and gj is selected in G2. Note that vectors−→E and

←−E are used as routing tables,

informing gi about the outgoing neighbors (incoming next hops) and incoming neighbors

(outgoing next hops) respectively. Also note, these next hops might not necessarily be

the physical next hop neighbors in the actual network, rather this is a just a virtual tag

used by gi. A least cost path will be the path in which the sum of cost of all the generators

traversed by it is the least . Initially the generator gi find the least cost path from itself

to gm, Ggi gm with cost cim. Note that cim can also be zero which happens when gi is

selected in both G1 and G2, (i.e., gi finds a path to itself). Moreover, among all (gi, gj)

pairs of generators such that gj is selected in G2 that are connected through least cost

cim path, gm is also least hop count him away from gi. gi then exchanges messages with all

other generators selected in G1 about its discovery of the least cost cim and hop count him.

After receiving information about least cost and least hop count from all other generators

selected in G1, it finds the minimum cost and least hop count path among all the (gk, gj)


Algorithm 1−RCPP (G) Code for gi, 1 ≤ i ≤ n1 When init():2 Gon(g) = 0 ∀ g3 c(gi) = c(gi)4 For each round

Phase 1:

5←−E (g)=0,

−→E (g)=0, G1(g)=0, G2(g)=0 ∀ g

Compute routing table6 If Gon(gi) = 1;7 For each gj such that Gon(gj) = 0

8 G′ = Gon

9 G′(gi) = 0

10−→E (gj) = ChkL(gj , G

′, 1)

11←−E (gj) = ChkL(gj , G

′, 2)12 Else13 For each gj such that Gon(gj) = 1

14 G′ = Gon

15 G′(gj) = 0

16←−E (gj) = ChkL(gi, G

′, 1)

17−→E (gj) = ChkL(gi, G

′, 2)

18 For each gj such that Gon[gj ] = 0

19 G1(gj) = ChkL(gj , Gon, 1)

20 G2(gj) = ChkL(gj , Gon, 2)Phase 2:

21 If G1(gi) = 122 For all G2(gj) = 1

23 Find a least cost path from gi to gj using routing tables←−E and

−→E

24 Let gm be the generating sources connected through least cost path Ggi gm with cost

cim and hop count him25 Send(< cim, h

im >) to every generating point gj with G1(gj) = 1

26 Upon Receive(< cjm, hjm >) from every processer gj with G1(gj) = 1

27 c∗m= mink|G1(gk)=1

(ckm). If ckm = c∗m for more than one generating point then choose the

generating point with least hop count h∗m. LetG∗gi′ gm′ be the least cost path among all paths Ggk gj such that G1(gk) = 1 and

G2(gj) = 1Phase 3:

28 If c∗m = infinity29 Then Gon is a maximum-cardinality common independent set. Send(< stop >)to all generating sources

30 Else-If gi′ = gi31 Let the path G∗gi′ gm′ = y0, z1, y1, · · · , zl, yl.32 G′on(gk) = 1 such that gk ∈ g|Gon(g) = 1 \ z1, · · · , zl ∪ y0, · · · , yl33 Send(< G′on >) to all generating sources34 When Receive(< G′on >):35 Update Gon

36 if Gon[gi] = 137 c(gi) = −c(gi)38 Else39 c(gi) = c(gi)40 goto step 441 When Receive(< stop >):42 If Gon[gi] = 1, then start working.

Figure 5.5: Algorithm 1−RCPP


pairs of generators such that gk is selected in G1 and gj is selected in G2. If more than one

generators have same value for least cost path cm then identifications are used to break

the ties. Let this least cost and least hop count path be G∗gi′ gm′ = y0, z1, y1, · · · , zl, yl

with cost c∗m and hop count h∗m, where g′i be the generator which is a source in the least

cost path. Here y0 = gi′ such that gi′ is selected in G1, and yl = gm such that gm is

selected in G2. The generator gi proceeds to phase 3 if gi = gi′ (i.e., the shortest path

starts from gi ) and informs all the generators about the new selection of generators G′on

and every generator that receives this message will update its Gon vector accordingly. If

generator gj is among the updated list of selected generators in Gon then it sets its cost

to be −c(gj) otherwise it set its cost to be c(gi). The algorithm then proceeds to the

next round and repeat the same steps until there exist no shortest path or c∗m is infinity

then g′i will send a stop messages to every one indicating that final selection has been

made and all the selected generators can start working.

The selection of generating points in the optimal solution needs at most ` rounds; this

is because the exact number of rounds needed is given by the maximum possible areas

served which is bounded from above by `.

The formal Algorithm 1-RCCP for generating point gi is shown in Fig. 5.5.

Algorithm b = ChkL (gk, G′, num)

1 b=1;2 If num=13 For gj with G′[gj ] = 1

4 if id1k and id1j are linearly dependent

5 Return b=0;6 If num=27 For gj with G′[gj ] = 1

8 if id2k=id2j9 Return b=0;

Figure 5.6: Algorithm b = ChkL

Theorem 34 The distributed Algorithm 1-RCPP gives an optimal solution to the 1-


RCPP problem.

The proof of Theorem 34 is given in Appendix C.2.

5.5.3 Complexity Analysis

The complexity of Algorithm 1-RCPP is measured in terms of the number of messages

exchanged M and the local computation time at each generating point Tlocal.

The number of messages sent across the network in Phase 2 is incurred in two steps.

One, in computing the shortest path from all generating sources in G1 to all generating

sources in G2, which in worst case can turn out to be all pairs shortest paths. The

complexity in terms of the number of messages for the all pairs shortest paths algorithm

given by S. Halder [179] is 2n2 messages. Two, in the worst case if |G1| = |G2| = n,

the number of messages sent is n2. Whereas in Phase 3 the source node of the shortest

selected path sends nmessages one to each generating source. Therefore the total message

complexity for the three phases of a round is O(n2). There can be at most ` rounds.

Therefore the total number of messages exchanged is M = O(` n2).

The time complexity is incurred at the following steps of each round. The time to fill

the routing tables using subroutine ChkL is O(n2), as each generating point checks its

neighborhood at most with all other generating sources; precisely checking for one entry

in vector−→E , and

←−E requires O(n) steps. Similarly, computing entries of vectors G1, and

G2 also requires O(n2) computations each. Step 24 and 27 require the identification of

generating sources with minimum distance and hops, which necessitates at most O(n2)

computations. Therefore in each round, the total local time complexity at each generating

source is at most O(n2) and the total complexity in all rounds Tlocal = O(` n2).


5.5.4 T-Restricted Constrained Power Procurement Problem

In Section 5.5 we presented a solution for the 1-RCPP problem. In this section we extend

this solution to the T-RCPP problem. We start by extending matroids SM and TLM

to capture time-varying generation costs c(gi, tj) and generating point delays d(gi). This

extension is based on the concept of time expanded matroids [180].

5.5.4.1 Extended Ground Set

We start by extending the ground set G to the extended ground set EG.

EG , (gi, tj)|gi ∈ G, tj ∈ 0, · · · , T − 1, d(gi) + tj ≤ T − 1.

EG consists of all generating sources and their possible allocation times represented

as a set of pairs (gi, tj) that satisfy the delay constraints. For the case study given in

Section 5.1.1, the extended ground set is: EG = (g1, 0), · · · , (g8, 0), (g1, 1), (g2, 1), (g3, 1),

(g6, 1), (g8, 1), (g2, 2), (g6, 2)).

5.5.4.2 Policy Constraints

We extend SM, the matroid capturing the policy constraints as follows. First, with each

(gi, tm) ∈ EG we associate a column vector zi. This vector is the same column vector zi

as that associated with gi discussed in Section 5.5.1.1. We define, EL1 = EL1|EL1 ⊆

EG such that all the (gi, tm) ∈ EL1 have linearly independent associated column vectors

for fixed time. In other words, EL1 is the family of all subsets of EG whose associated

column vectors are linearly independent.

We define: ESM , (EG,EL1).

For the case study given in Section 5.1.1 an example of a set EL1 belonging to EL1 is

EL1 = (g1, 0), (g4, 0), (g7, 0), where the column vectors associated with (g1, 0),(g4, 0),

and (g7, 0) are 100, 010, and 002 respectively.


5.5.4.3 Transmission Line Constraints

We extend TLM, the matroid capturing the transmission line constraints as follows.

Define, EL2 = EL2|EL2 ⊆T−1⋃j=0

EL2tj, EL2tj

⊆ EG such that for each (gi, tm) ∈ EL2tj

d(gi) + tm = tj, and for all (gi, tm), (gr, tp) ∈ EL2tjgi and gr satisfy the line constraints.

Thus, each EL2tj is the family of subsets of EG which can use their required trans-

mission line at time tj, without violating the line constraints.

We define: ETLM , (EG,EL2).

For the case study given in Section 5.1.1, an example of a set EL2 belonging to EL2 is

EL2 = (g3, 0), (g6, 0), (g4, 0). Note that (g3, 0), (g6, 0), and (g4, 0) all require the same

transmission line, that is, γ1 but they will actually couple to the transmission line at time

slot 1, 2 and 0, respectively, and therefore do not violate the transmission line constraint.

Lemma 35 Both ESM and ETLM are matroids. Furthermore, the optimal solution to

the T-RCPP problem is given by the least cost maximum cardinality common independent

set of matroids ESM and ETLM.

Proof: The fact that both ESM and ETLM are matroids and follows directly

from the concept of time expanded matroids [180]. Then using the similar arguments as

in the proof of Lemma 33 it can be shown that the intersection of ESM and ETLM

satisfies the policy, transmission line and generation delay constraints, and the optimal

solution to the T-RCPP problem shall be the least cost maximum cardinality common

independent set of ESM and ETLM.

A sketch of solution for the T-RCPP problem, Algorithm T-RCPP, is shown in

Fig. 5.7. In Algorithm T-RCPP a generating source gi runs multiple copies of Algorithm

1-RCPP, one for each (gi, tj) ∈ EGgi where EGgi represents copies of gi at different time

slots. For the case study given in Section 5.1.1, the execution of Algorithm T-RCPP

would schedule generating sources g3, g4 and g6 all at t = 0 for power procurement. The

sample execution of Algorithm T-RCPP for the case study given in Example 5.1.1 is


Algorithm T−RCPP () Code for gi, 1 ≤ i ≤ n1 Generate

EGgi , (g, tj)|g = gi, tj ∈ 0, · · · , T − 1, d(gi) + tj ≤ T − 1

2 For each generating source (gi, tj) ∈ EGgi , assign a vector ID id(i,j) as follows:

3 id1(i,j) = zi

4 id2(i,j) = k such that gi ∈ xk,

5 id3(i,j) = j + d(gi)

6 Broadcast id(i,j) for all (gi, tj) ∈ EGgi to all the other generating sources

7 For each (gi, tj) ∈ EGgi , call Algorithm RCPP with the following modification in subrou-tine ChkL given in Fig. 3.8 Step 8 will be modified to “If id2k=id2j AND id3k=id3j”

Figure 5.7: Algorithm T−RCPP

shown in Figure 5.8. The Algorithm T-RCPP takes three rounds. The edges indicate

the valid replacements of the selected generating sources w.r.t EL1 and EL2. Edges are

drawn according to the routing table−→E (incoming neighbors) and

←−E (outgoing neigh-

bors). For a selected generating source gi,←−E (gj) = 1 (shown as an edge (gj, gi)) indicates

that gj is a valid replacement of gi w.r.t. EL1, and−→E (gj) = 1 (shown as an edge (gi, gj))

indicates that gj is a valid replacement of gi w.r.t. EL2. The generating source that are

selected in G1, and G2 are shown with circular, and square boundaries respectively. For

a generating point g with Gon(g) = 0 the number on the left indicates its cost, whereas

the number on the right indicates the cost of the least cost path originating from g (cm).

Whereas for a generating source with Gon(g) = 1, its cost is shown on its right. The

generating source selected in current round is shown in black (with the corresponding

cm shown in gray circle). After the third round the final (optimal) selection of the gen-

erating sources is also shown. The cost of the optimal solution is 12. In all the rounds

the shortest path is zero hops. In first round round1, generating point (g3, 0) is selected,

as for it both G1(g3, 0) = 1 and G2(g3, 0) = 1 and the cost of the path from (g3, 0) to

itself is the least among all the paths between any generating point (gk, tl) such that

G1(gk, tl) = 1 to any generating point (gb, ty) G2(gb, ty) = 1. Similarly, in rounds round2

and round3 the generating sources (g6, 0) and (g4, 0) are selected. Also note that all of


the selected generating points requires the same transmission line but are not in conflict

with each other as they will be coupled in different time slots due to the generation delays

associated with them.

Theorem 36 The Algorithm T-RCPP(EG) provides an optimal solution to the T-Restricted

Constrained Power Procurement (T-RCPP) problem.

Proof: The correctness of algorithms follows from Lemma 35 and correctness of the

weighted matroid intersection algorithm by Edmonds [178].

Lemma 37 For Algorithm T-RCPP, the number of message exchanged is M = O(T ` n2)

and the local time complexity is Tlocal = O(T ` n2).

Proof: The complexity of Algorithm T-RCPP follows directly from the complexity

of Algorithm 1-RCPP. The only difference is the increased number of generating points

(each generating point is copied in the worst case for each time slot), which precisely is

T times more than the actual set of generating points. Therefore message and local time

complexity is multiplied by a factor T as compared to Algorithm 1-RCPP.

5.6 Constrained Power Procurement Problem

In this section we show that the CPP problem is NP-hard by a reduction from the opti-

mization version of the independent set (IS) problem, i.e., the maximum independent set.

Before proceeding to the proof of NP-hardness, we give the conflict graph representation

of the transmission line constraints.

Definition 38 (Conflict Graph) A conflict graph Gc(V c, Ec) has vertex set defined as

V c=g1, · · · , gn, and edge set defined as Ec=(gi, gj)|Wi +Wj > Lk where gi, gj ∈ xk.

Theorem 39 The Constrained Power Procurement problem is not only NP-hard, but

also NP-hard to approximate within a ratio of n1−ε for any constant ε > 0.


Figure 5.8: A sample execution of Algorithm T-RCPP for the system shown in Example5.1.1.


Proof: We first define an instance of maximum independent set problem. An

instance IS for the maximum independent set problem is defined by a graph H(V ′, E ′),

with vertex set V ′ and edge set E ′. The objective is to find a maximum cardinality subset

of V ′ such that vertices in this subset are not directly connected to each other. We define

the corresponding instance of the CPP problem as follows:

• G=V ′, and gi=v′i ∀i, i.e., each vertex vi ∈ V ′ corresponds to a generating source gi.

• The policy constraint set is: PC(G) = s1, · · · , s|V ′|, where s1 = g1, · · · , s|V ′| =

g|V ′|.

• The conflict graph Gc(V c, Ec), capturing the transmission line constraints is defined

as: V c=V ′, and Ec = (gi, gj)|(v′i, v′j) ∈ E ′. Thus, any two power sources (gi, gj)

cause overload for the competing transmission line if (v′i, v′j) is an edge in E ′.

• The planning horizon is defined to be T − 1 = 0 or T = 1.

• For the delay constraints:

– d(gi) = 0 ∀i.

• c(gi, tj) = 1 ∀i, j.

Due to the one-to-one correspondence of graphs H ′(V ′, E ′) and Gc(V c, Ec), a max-

imum independent set in H ′(V ′, E ′) is the same as a maximum independent set in

Gc(V c, Ec). Formally, an independent set in Gc(V c, Ec) corresponds to the generating

sources that can simultaneously be selected for power procurement since they do not re-

sult in transmission line overload. Furthermore, as each vertex in H ′(V ′, E ′) corresponds

to a facility area in CPP, therefore finding the maximum cardinality of independent set

is equivalent to maximizing the number of facility areas being served, and cardinality of

the the maximum independent set in Gc(V c, Ec) is equal to∑

si∈PC(G)

f(si). Finding the


maximum independent set is not only an NP-hard problem but also NP-hard to approx-

imate within a ratio of n1−ε for any constant ε > 0 [181]. Therefore the CPP problem is

NP-hard to approximate within a ratio of n1−ε for any constant ε > 0.

5.7 Performance Evaluation

5.7.1 An overview of relevant solution techniques

In this section we highlight the specific properties that make the proposed solution tech-

nique a preferred choice. There are many ways to interpret the preference of solutions,

for instance the notion of feasibility, optimality, time complexity, and restriction on the

objective function if any. We have discussed these important characteristics in Section

5.1 which implies that choosing either Linear Program, or Integer Linear Program for

solving the RCPP problem, proven in this chapter to be polynomial time solvable, re-

sults in undesirable solution quality in terms of one or more parameters compared to the

presented matroid theory based solution. We therefore propose to compare our solution

technique to a greedy solution that have desirable solution parameters similar to the

proposed matroid theory based solution in the subsequent section.

However, we point out that the other techniques do have merits in classical settings

but they do show weakness when we have to expand the classical problem to include the

emerging constraints.

5.7.2 Experimental Setup

To evaluate the performance of the propose scheme working for practical power systems,

we performed a simulation study using IEEE 300 bus test system [182] to capture the

transmission line constraints. IEEE 300 bus test system was developed by IEEE Test

Systems Task Force in 1993, so it does not incorporate the smart grid scenario with a large

number of distributed generating sources. To incorporate the realistic setting with smart


grid including a large number of generating points, we extended the number of generating

points in this system which is a known practice (see for example [183]). Specifically, we

divided the given network into 30 distributed data center facilities. Note, different load

demands, operation of circuit breakers, manual operation of the isolators, maintenance

and failure patterns of different devices (transformers, generators, transmission lines,

buses, relays etc), availability of renewable sources (like power production from wind

turbines, solar cells, etc) give rise to a large number of possible realizations of this system.

For each experiment we consider a realization of this system selected from all possible

realizations in an unbiased way. Each point in the graphs presented in this section is an

average taken over 100 experiments.

To supply a facility, at least one generating source must be able to supply its load

within the planning horizon, and the generating point should also be allocated transmis-

sion network with sufficient capacity to be able to supply the power to the facility. All

the generating sources that share the same bottleneck link over the power transmission

network have conflict with each other, this defines Transmission Line Constraints. The

planning horizon is set to one time unit, and generation delays to be zero, so Delay

Constraints imposes the schedule duration of one time unit (i.e., only one allocation at

time 0). Regarding Policy Constraints, we say a facility has been served if at least one

generating source is able to supply its load before the elapse of the scheduling horizon,

and the generating point is also allocated a transmission line be able to supply the power

to the area.

The objective is to serve maximum number of facilities such that total procure-

ment cost is least, without violating the the Transmission Line and Delay Constraints.

Throughout the rest of this section we shall use the word cost to specify cost of procure-

ment.


5.7.3 Percentage procurement and Cost

In this set of experiments we use average percentage procurement, and average normalized

procurement cost as performance metrics.

For a given realization, the percentage procurement (α) is defined to be the percentage

of the facilities served without violating the constraints. For a given power network, the

procurement cost per facility (β) is defined to be the sum of the procurement costs of all

the selected generating sources divided by the number of facilities served.

For a given power network, the transmission network density (TND) is defined to

be the total number of transmission lines in the power network divided by the total

number of the facilities. Note that in a realization, there might be more transmission

lines available to serve one facility than another facility. So, TND shows on average, how

many transmission lines connect a facility to the generating sources. For a given power

network, the average procurement choices per facility (PCF ) is defined to be the total

number of generating sources divided by the total number of facilities.

The first set of experiments shows α, and β as a function of TND. We consider three

different settings. Figures 5.9, 5.10, and 5.11 show α versus TND using the Algorithm T-

RCPP and the greedy heuristic for different values of PCF . Figures 5.12, 5.13, and 5.14

show corresponding β for the results in the Figures 5.9, 5.10, and 5.11 respectively. The

results show that with a higher PCF , Algorithm T-RCPP results in higher α, whereas

the greedy heuristic does not show improvement even when PCF is increased. The

results also point out the large gap between optimal solution (using Algorithm T-RCPP)

and the greedy solution in terms of both α and β. An interesting point to note is that

when PCF is less than a certain threshold even the Algorithm T-RCPP, i.e., the optimal

solution, cannot provide 100% procurement. This follows from the fact that lower PCF

results in more Transmission Line Conflicts. Note that where Algorithm T-RCPP can


Figure 5.9: Average Percentage procurement α for random power networks as a functionof Transmission Network Density TND for PCF = 2.




provide up to 100% service, the greedy heuristic cannot provide more than 60% service.

It is important to note that the proposed Algorithm always guarantees the lower cost

solution as compared to the greedy solution. Also the difference between the proposed

solution and the greedy solution in terms of β is more significant when the conflicts on

the usage of transmission lines are more due to lesser TND. Moreover, the percentage

procurement α for both the proposed solution and the greedy solution increases as the

number of available transmission lines per area increases. One of the primary reasons

for this is the fact that the fewer transmission lines available to any service area dictates

more and more conflicts over the usage of transmission lines resulting in fewer areas to

be served even optimally. These result also highlights the significance of the proposed

solution for the future power networks where the density of generation sources is visioned

to be increased due to increased penetration of distributed generation.

The second set of experiments shows the relationship of the α, and β versus average


Figure 5.12: Average Procurement Cost per Facility β for random power networks as afunction of Transmission Network Density TND for PCF = 2.




PCF (The average number of sources is gradually increased from 60 to 200). We con-

sider two different settings for TND. Figures 5.15, and 5.16 show the comparison of α

versus the average PCF , for power networks with TND equals to 30 and 45 respectively.

Figures 5.17, and 5.18 show β corresponding to the results in the Figures 5.9, and 5.10

respectively. The results show that Algorithm T-RCPP is able to provide 100% procure-

ment even with small number network density, which highlights the importance of the

proposed solution to achieve the best possible procurement in practical settings where

the resources are usually scarce. Furthermore, having more resources (TND, and PCF )

helps in reducing the overall procurement cost. The results also point out the large gap

between using an optimal solution (using Algorithm T-RCPP) and the greedy heuristic

in terms of both α and β.

Note that the procurement cost for both the proposed solution and the greedy solu-


Figure 5.15: Average Percentage procurement α for random power networks as a functionof Procurement Choices per Facility PCF for TND = 30.

Figure 5.16: Average Percentage procurement α for random power networks as a functionof Procurement Choices per Facility PCF for TND = 45.


Figure 5.17: Average Procurement Cost per Facility β for random power networks as afunction of Procurement Choices per Facility PCF for TND = 30.

Figure 5.18: Average Procurement Cost per Facility β for random power networks as afunction of Procurement Choices per Facility PCF for TND = 45.


tion decreases with the increase in the PCF due to the availability of better and less

costly choices for power up each facility. Also note that although the greedy solution

attempts to pick the least costly solution for each facility locally but it still cannot

compete with the globally optimal matroid based solution. The reason behind this is

since choosing least cost generating source for each facility might result in conflicts over

the usage of transmission line and ultimately lead to poor percentage procurement and

more expensive alternates to avoid transmission line overloads than the optimal solu-

tion can guarantee. This huge difference between the proposed solution and the greedy

solution in terms of both cost and percentage procurement reinforce the significance of

the proposed matroid based solution. Note that due to mismatch in the growth of the

power network and distributed generation sources [136], the future of power procurement

might face more transmission line bottlenecks. Under these developing constraints the

proposed Algorithms shall not only help in increasing percentage procurement but also

reducing the effective system cost by finding lesser cost alternatives while still meeting

the policy-driven constraints.

5.8 Summary

In this chapter, we study how green initiatives and cyber-physical evolution of power

systems can affect the complexity of optimizing power procurement for distributed data

center facilities giving rise to opportunities to consider new formulations. In particular,

we present a general framework for modeling the constrained power procurement (CPP)

problem, where the distributed data center facilities exhibits policy constraints, power

grid exhibits transmission line constraints, and delay constraints for a given planning

horizon to meet the forecasted demand of distributed facilities. We have presented poly-

nomial time distributed solutions to the 1-RCPP and T-RCPP problems and shown the

NP-hardness of the CPP problem.

Chapter 6

Concluding Remarks and Path

Forward

Data centers represent a backbone for big data processing, yet they have the potential to

become processing bottle-necks and energy black-holes. Unless data centers are carefully

designed to be an efficient enterprise, we risk reversing the motivation of their inception.

Therefore, it is imperative that the big data paradigms holistically incorporate challenges

of greener optimization. A vital step towards this goal is to rethink the foundations of

big data optimization focusing on the existing infrastructure weaknesses that snag the

full comprehension of the greener big data potentials. In Chapter 3 we have identified

the planes crucial for making big data enterprise greener. Moreover, we have presented

an extensive survey of the on going efforts in each of the planes and highlighted the

promising challenges associated with each plane. Chapter 4 described a scheme to opti-

mize big data processing in favor of the greener server, virtualization, interconnect, and

framework planes. Additionally, Chapter 5 presented a solution for optimizing the power

procurement plane.

We argue that a true green big data ecosystem requires collaboration not only from

the core of big data enterprise—data centers—but also from other parts of the ecosystem

132

Chapter 6. Concluding Remarks and Path Forward 133

including, but not limited to, end users, standard development bodies, regulatory bodies,

equipment manufacturers and suppliers, network operators, and content providers. Ac-

cordingly, we present some open problems and challenges in this chapter, and conclude

by proposing a novel, futuristic, green orchestrator focusing on a comprehensive view of

the ecosystem.

6.1 Open Issues and Future Work

Traffic profiling, measurement, and analysis is a fundamental step toward realizing an

energy-efficient network infrastructure. Since the traffic profile is strongly coupled to the

types of applications and services, application-specific traffic-forecasting based decisions

can provide better and more efficient energy management.

Furthermore, the interconnect plane associated with cloud-based data centers can

neither be strictly restricted to conventional tree topology nor confined to strictly high

bisection topologies like VL2, Fat-Tree, and Jellyfish [60,83,184]. Rather, an interconnect

plane of a cloud-based service can be a combination of many different topologies at

distant sites perhaps connected via internet. Therefore, restructuring existing solutions

and techniques to suit the dynamics of more general interconnects associated with cloud

paradigms is crucial for greener services.

Dynamics of data center networks are shifting towards adapting energy-efficient inter-

connects and protocols like IEEE 802.3az [185], but in certain scenarios this can degrade

the quality of the application. A key challenge in this regard is to develop solutions that

can achieve a good trade-off between energy efficiency and performance metrics.

Furthermore, the impact of energy-efficient server plane on Quality-of-Experience

(QoE), and Service-Level-Agreements (SLAs) is also constitutional, and can help bridge

the gap between theory and business practices.

It is necessary to investigate application specific designs that can improve IT equip-


ment utilization and efficiency in a transparent and seamless fashion. Optimization

solutions that exploit the correlation among large data sets can help to reduce the use of

IT resources required by the frameworks. Furthermore, there are several ways in which

software-defined networking can help in bringing about the necessary changes to make

big data processing energy efficient.

There is also a need to address tool-specific energy inefficiency for big data processing.

A plausible direction is the development of a DPM controller specific to a tool e.g., DPM

for Hadoop.

Revolutionary models advocating nano data centers are getting noticed. Nano data

centers are peer-to-peer distributed data centers that make use of ISP-controlled home

gateways as both computing and storage resources. Nano data centers have been shown

to be an energy-efficient alternate to conventional data centers. Exploring the interplay

of the frameworks with nano data centers is very beneficial.

The business models for many of today’s multi-tenant cloud data centers are based

on selling computing resources to remote clients, e.g., Amazon EC2. Data centers in

such scenarios do not have advanced statistics about the jobs they run. Hence there is

little to no control over effectively scheduling jobs, or deploying DPM-based techniques

in order to improve power efficiency. Furthermore, DVFS-based solutions to improve

power efficiency are not very viable in today’s multi-tenant cloud-based data centers

since these might violate the SLAs where a tenant is paying for a computing resource

with specific operating frequency. Therefore, there is a need to study the impact of power

management solutions on SLAs to appropriately update agreements while still managing

power efficiently.

Moreover, there is a need to empirically study the effectiveness of various energy

efficiency metrics to better understand the energy usage dynamics of multi-tenant en-

terprises. The business models of the big data enterprises need to monitor and report

appropriate energy efficiency metrics thus encouraging the responsible use of resources


and services [6].

A major challenge for integrating renewables into the energy portfolios of data centers

is the unpredictability of weather-dependent renewable sources (e.g., solar energy, wind

energy). Specifically, solar power deployment is still far from reality due to the fact that a

large number of solar panels is required to produce a tiny fraction of the energy required

by a data center. For example, according to Emerson Network Power it takes seven acres

of solar panels to generate 1 MW [186].

It is also of fundamental importance to empirically study the impact of integrating

greener energy on QoS and SLAs, and develop appropriate schemes.

With the increasing popularity of cloud-related services it is vital to explore the appli-

cability of existing techniques for energy profile management to multi-tenant data centers.

Moreover, there is a need to devise new demand response strategies for multi-tenant data

centers incorporating looser controls over customer-driven workloads. On the other end,

there is a need to explore mechanisms for utility-based incentives offered to multi-tenant

cloud data centers. Similarly, the realization of revolutionary nano data centers calls for

special business paradigms and stimulus-based rewards for all stakeholders [187].

Furthermore, there is a need to include more comprehensive system constraints in

the presence of data uncertainty, reliability, and security issues. Studying the impact of

storage systems and renewable power generators while incorporating a detailed analysis

of the specific parameters associated with the nature of the generating source is the

other thrust. Moreover, there is a need to develop algorithms under real-time and online

scheduling environment while taking into account effects of the communication system

performances as well as swiftly changing market prices.


6.2 Green Orchestrator

The significance of efforts in the individual planes of a big data enterprise is indisputably

valuable, however we assert that an orchestrator that takes on a cross-plane approach

for energy efficiency in a holistic fashion is very important. Among others, one of the

needs for such a cross-plane orchestrator is to make sure that the solution for one plane

is complemented by solutions for other planes towards the common goal of establishing a

greener system. For instance, a solution technique that optimizes the virtualization plane

with respect to energy efficiency might not be well supported by the insufficient capacity

of the interconnect plane. More specifically, a scheme that relies on VM migrations for

optimizing energy efficiency needs more bandwidth and might stress the interconnect

plane, which is already under pressure. Similarly, APIs developed to facilitate energy

efficiency but without fully respecting the underlying hardware or system can result in

redundant resource utilization. Likewise, generation-mix for the energy portfolio of a

batched-job can be much more flexible in incorporating greener and renewable sources

by mitigating the uncertainties associated with renewables, e.g., by scheduling tasks in

the batch, whereas such a flexibility might not be viable for a real-time workload.

We recognize that system requirements are highly specific to workloads and appli-

cations, and therefore urge that developing a cross-plane integrated solution for such a

huge system should be done on a per-job basis and in accordance with the type of frame-

work. In this section we propose a novel per-job cross-plane orchestrator that can be

implemented to improve energy efficiency. The proposed cross-plane Green Orchestrator

consists of the following components.

• Green Lessor: This component is responsible for establishing a per-job SLA based

on a collaborative lease among the stakeholders. The stakeholders include data

center, tenant, and energy provider (both utility and local generation). The lease

takes into account available power, pricing models, power consumption statistics


Figure 6.1: Green Orchestrator for big data processing.


of jobs, network state, and server state while incorporating the predictability of

the available local energy harvest. Green lessor takes input from network state

predictor, server state predictor, and post-execution analyzer.

• Pre-Execution Analyzer: This component performs pre-execution analysis of

a job. Based on the job statistics, and history logs from post-execution analyzer,

pre-execution analyzer identifies the power statistics and requirements while incor-

porating the lease SLAs.

• Network State Predictor: This component predicts the network state. Specif-

ically network state predictor takes into account inbound and outbound queues

at the switches and routers, available bandwidth, and throughput. Network state

predictor interacts with network dynamic power manager to keep the predictions

updated.

• Server State Predictor: This component predicts the state of the server and

compute nodes in terms of CPU and memory resources available for a job. Server

state predictor interacts with server dynamic power manager to keep the predictions

updated.

• Network Traffic Optimizer: This component focuses on optimizing the network

by performing traffic engineering. Specifically, it eliminates the redundant traffic,

and reshapes the flows. Network traffic optimizer interacts with network dynamic

power manager, and server dynamic power manager to optimize the traffic.

This component is very useful in distributed big data applications where inter-

process communication is not scheduled for optimal energy consumption. For ex-

ample within a Hadoop job, the shuffle phase does not take into account energy

conservation while initiating the communication between map and reduce processes.

The shuffle process can be made greener if amortized cost of energy is taken into

account while scheduling inter-process communication. Specifically, the flows in


the shuffle phase can be scheduled collaboratively with map and reduce processes

to minimize the number of less-utilized or idle servers. Then DPM techniques can

be used to put a set of servers into sleep or low energy mode.

• VMizer: This component is in charge of VM placement, provisioning and schedul-

ing. VMizer works in coordination with network state predictor and server state

predictor. The overall objective of this component is the intelligent placement of

VMs while amalgamating them into a subset of cluster. This improves system uti-

lization by allowing a greater number of VMs per node (scaled according to the

CPU and storage capacity) and putting the rest of the nodes in sleep mode. For

example concentrating mappers and reducers in Hadoop, and nuts and bolts in

Storm such that each active node in the cluster is fully utilized, rather than a large

number of partially utilized yet active nodes. This not only reduces the number of

active nodes in the cluster, but also concentrates the amount of traffic to a limited

portion of the cluster giving an opportunity to put unused network devices into a

sleep mode.

• Pizer: This component is in charge of process placement, provisioning, and schedul-

ing. Pizer facilitates for intelligent placement of processes (e.g., mappers and re-

ducers in Hadoop, nuts and bolts in Storm) to a subset of the cluster with the aim

of improving resource utilization. For example in Hadoop, pizer places map and re-

duce processes that need to communicate with each other in close proximity. Pizer

places the process on VMs while respecting locality and bandwidth limitations to

improve turnaround time.

• Server-Dynamic Power Manager: Server-DPM performs server level dynamic

power management and works in coordination with server state predictor.

• Network-Dynamic Power Manager: Network-DPM performs network level dy-

namic power management and works in coordination with network state predictor.


• Post-Execution Analyzer: This component analyzes a job’s energy profile upon

completion.

The synergy of different components of the proposed Green Orchestrator is shown in

Figure 6.1. Several solutions discussed in Chapter 3 can be used to implement some of

the proposed components. However, some components need more groundbreaking work,

specifically green lessor, network traffic optimizer, and pizer.

We believe that the discussion with regard to different components of the proposed

orchestrator can help to fill in the gaps in efforts addressing appropriate traffic engineer-

ing, business models, energy portfolio management, and process-level energy envisioning.

Furthermore, this work expects to see development efforts to realize and demonstrate the

suitability and practicality of proposed orchestrator for resuscitating the green vision of

the big data ecosystem.

Appendices

141

Appendix A

Energy Savvy Planes for Big Data

Enterprise: Literature Survey and

Discussions

This appendix surveys several key approaches to transform 5Ps of big data enterprise,

identified in Chapter 3, into greener planes.

A.1 Server Plane

This section presents some of the key approaches to make server plane energy savvy.

The notion of energy proportionality, i.e., energy consumption proportional to the

workload [127], is at the heart of achieving energy efficiency for underutilized servers.

Dynamic Power Management (DPM) has been advocated to be very effective for

directly addressing power inefficiency caused by underutilized servers. DPM adjusts the

power consumption of a device by taking into account the workload prediction [188–190].

One of the famous techniques for DPM is Dynamic Voltage and Frequency Scaling

(DVFS) [191–195]. The basic idea behind DVFS is to save energy using low supply

voltage, and low frequency (clock) when the work load is not computationally intensive,

142

Appendix A. 5Ps: Literature Survey and Discussions 143

or there is not a need to operate a processor at maximum performance.

Server consolidation is another approach for DPM of underutilized servers. Server

consolidation-based techniques achieve energy efficiency by consolidating jobs to a limited

number of highly utilized servers, and then transitioning rest of the servers to low power

or off states [196–200].

With the expanding big data enterprise, the heterogeneity of computing and storage

equipment is increasing and would likely impact power management solutions. Specifi-

cally, energy usage characteristics are highly dependent on the underlying platform, and

any existing DPM solution that ignores the heterogeneity is not going to live up to the

solution standard. Therefore a practically viable yet optimally energy-efficient solution is

required to respect heterogeneity. Some scheduling techniques exploit the heterogeneity

of IT resources in data center environments by maximizing the use of energy-efficient

components, and minimizing the use of energy-inefficient components by means of work-

load scheduling [201,202].

Job scheduling is another approach to achieving higher server utilization by better

workload management with a focus on energy conservation [203–205].

Another approach for turning the server plane into a greener plane is to explore the

effects of hardware characteristics on energy efficiency. In this regard, an intriguing

study has been conducted in [206], where authors have compared energy characteristics

of several small clusters of embedded processors, mobile processors, desktop processors,

and server processors for running big data applications. The interesting finding is that a

data center cluster made up of high-end mobile processors and NAND-flash-based solid-

state-drives is the most energy efficient system for data intensive applications.

A.2 Virtualization Plane

Virtualization is a primary approach adapted for improving resource utilization at servers.


Despite its apparent usability for energy efficiency and broad penetration in data

centers, the actual deployment of virtualized servers is not a very prevalent practice;

for example a 2010 survey reported that only 30% of production servers actually deploy

virtualization techniques [207].

Server virtualization under dynamic workloads results in a huge increase in network

traffic. This increase is a serious hurdle in the wide deployment of virtualization en-

vironments due to the limited available bandwidth for east-west traffic (traffic between

different servers within a data center). One way to tackle this issue is new data cen-

ter architectures that provide a higher rate of information exchange for east-west traf-

fic [60, 83, 184, 208, 209]. Another direction is to perform live migration of VMs between

physical servers which might be running different hypervisors. Some tools like [210] allow

offline migration of the VMs between different hypervisors.

A.3 Interconnect Plane

Network links in data centers are not properly utilized. According to a study, network

links operate at 5%-25% of their capacity on average [30], and a significant number of

network links remain idle for 70% of the time [211], thus leaving substantial room for

improving energy efficiency. With the increase in network-intensive distributed workloads

for big data processing, the need to rethink energy efficiency for data center networks

has never been more prevalent.

In the context of energy efficiency, the conventional three-tier tree topology is known

to be energy inefficient partly due to use of enterprise level power-hungry equipment [212].

For large data centers the switches at the top layer in a three-tier conventional topology

are reported to consume huge amounts of energy [83]. The energy inefficiency of the con-

ventional data center topologies has led to the proposal of new energy-efficient topologies

like VL2 [60], Fat-Tree [83], and Jellyfish [184] that use low-cost energy-efficient commod-


ity network equipment. The basic idea in these proposals is to use densely connected

interconnect with a combination of commodity of-the-shelf switches, flat addressing, and

efficient load balancing. In particular, Fat-tree reported an energy efficiency improvement

of 56.6% as compared to conventional data center networks [83].

Based on Fat-Tree architecture the authors in [51] proposed a network-wide power

management scheme (DPM for network components). The core idea is to consolidate

traffic flows on a subset of links and switches. As a consequence of this consolidation, the

unused network components can be switched off resulting in significant power savings.

Another aspect of bringing energy efficiency to data center networks is the provision

of load-proportional energy consumption in links and switches. In such a network de-

vice the energy consumption depends on the amount of data flowing through it [26–30].

For example, authors in [26, 28, 29] have used adaptive link rates combined with load-

proportional energy-consuming network devices to reduce power consumption. One more

approach is to combine both traffic consolidation and adaptive link rates for switches to

improve power saving [213]. An extension of DFVS with the focus on scaling the voltage

and frequency of the switch core and interconnect links has also been shown to address

the energy inefficiency caused by low utilization of network links [214].

Another venue to address the energy challenges of the interconnect plane is to test

the use of unorthodox and novel technologies to replace conventional wired interconnect

fabric and routing operations. For example it has been shown that the use of wireless

networks can result in improving energy efficiency in data centers, but this idea is still

in the early stages of development [215–217].

Furthermore, software-defined networking (SDN) based traffic engineering with re-

spect to virtulization techniques achieves a better network resource utilization by dynamic

resource allocation [218]. SDN, specifically OpenFlow, has also been used to demonstrate

integrated adaption of computing and network resources with the aim of minimizing the

carbon emissions of a data center [219].


A.4 Frameworks Plane

The main driving force for the adoption of cloud data centers is rooted in improved

application performance and higher operational efficiency for big data frameworks and

services. The core of major big data frameworks is to divide huge data sets into smaller

subsets, process them at separate nodes, communicate the processed information over

the network connecting different nodes (also referred to as the communication phase),

and then combine the results in a time-efficient manner.

Energy efficiency of big data frameworks can be improved by eliminating the idle time

of IT equipment. An efficient way to reduce idle time is by shortening the communication

phase of distributed applications since a compute node keeps on consuming energy while

waiting to receive data spread across several nodes. In fact, it has been shown that bring-

ing down job completion times improves energy efficiency. It is interesting to note that

for some common big data frameworks about 50% of the job completion can be consumed

in the communication phase (the phase in which data is transferred across servers over

the data center network) [57]. The communication phase can be optimized, hence de-

creasing job completion time, by ameliorating network effects and bringing disruptive (in

terms of communication intensity) distributed workloads by managing flows [57, 68, 69],

and by using demand-driven path alterations [70–72]. Task, resource, and computation

scheduling are some of other ways to bring down job completion time [65–67].

Hadoop’s energy efficiency has been a subject of quite interest as it is the most widely

used framework for big data analytics. An interesting effort to make Hadoop greener is

the integration of solar energy in to a Hadoop cluster. This integration is achieved

through intelligent job scheduling by taking into account the solar energy forecast [220].

Some other approaches to optimize Hadoop include high-level shifting of VMs, mitiga-

tion of congestion incidents using combiners, and low-level continuous runtime network

optimization in coordination with data-intensive application orchestrators [42, 57, 221].

Similarly, information theoretic similarities in chunks of files spread across different nodes


can be used to reduce the amount of energy used by reducing network traffic for Hadoop.

A.5 Power Procurement Plane

This section surveys the efforts for optimizing the power procurement plane including

power sourcing, appropriate business models, and greener portfolio management. It has

been pointed out that the popularity of multi-tenant data centers is going to be an im-

portant contributing factor for industry-wide inefficient usage of energy. Lack of proper

pricing models, and conflicting priorities are two major reasons for the inefficient use

of energy. More specifically, the priorities at the top of the list for multi-tenant data

center providers generally include high levels of availability, reliability, privacy, security,

and affordability [6]. The inherent conflict of priorities with energy efficiency goals is

obvious, and partly explains the paucity of energy initiatives in current business models.

Furthermore, customers are not monetarily motivated in energy-efficient infrastructures

and frameworks due to lack of appropriate lease agreements that charge customers in pro-

portion to their energy usage. To address this concern there has been a growing interest

by data center service providers for an environmental chargeback model that links the

tenants to indirect costs associated with energy usage [222]. Practical implementations

of such chargeback models are being adopted by Microsoft data centers where customers

are charged for power usage [223].

The reforms in chargeback models are becoming more complex with the rise of cloud

infrastructure, where the main theme is to abstract services from IT infrastructure.

Adding to the complexity is the virtualiztion of shared resources that can be acquired

and released on demand. Such chargeback models should incorporate flexible pricing

for virtualized environments. In this context Cloud Cruiser is an initiative to integrate

chargeback models with cloud infrastructure and decision analytics, e.g., Cloud Cruiser

for Amazon Web Services (AWS) [224]. There is a need to extend such efforts to inte-


gerate energy prospects, in particular to develop special per-kWh chargeback models that

suit the needs of multi-tenant shared enterprises [225]. A possible solution can be the

inclusion of a “post-paid” component to the chargeback models reflecting energy-related

costs for IT resources at the actual time of use.

Another strategy that can help businesses to gear up for energy-aware practices is

measuring and reporting energy consumption metrics. In this context, the Green Grid

Association (TGG) has proposed measuring a data center’s energy efficiency by using

Power Usage Effectiveness (PUE) [226]. PUE is defined as the total energy used by a

facility divided by the energy used by its IT equipment. However it has been noted that

PUE often does not take into account all components of the infrastructure, but rather

focuses on one particular segment of a data center, and therefore does not adequately

reflect its efficiency [227]. In this vein, there have been efforts to redefine the metrics

that can capture the energy inefficiency in a better and more concrete fashion such as

Power to Performance Effectiveness (PPE), Data Center Productivity (DCeP), Data

center Compute Efficiency (DCcE), and Data center infrastructure efficiency (DCiE) to

name a few [6,227].

Under the coincident peak pricing (CPP) programs, the customer’s electricity bills are

not just scaled according to the usage (amount of electricity consumed) but also reflect

the peak demand and coincident peak demand [142]. Peak demand charge is based on

the hour of the month where consumer’s electricity consumption was the highest. It

has been reported that peak demand charge is the major component of the electricity

bills for all six Google data centers in U.S. [141]. Similarly, coincident peak demand

charge is based on the hour of the month when the utility received the highest total

demand from all the customers. Peak demand charge is several times higher than usage

charge. Moreover, the coincident peak demand charge is several times higher than peak

demand charge. Consumers are typically warned of the upcoming coincident peak hour

somewhere between 5 minutes to 24 hours ahead of time [142].


Demand response is defined as altering the electricity usage pattern to reflect the

change in the price of electricity at a specific time. Demand response plans also include

incentives to encourage lower electricity consumption at times of high demand. Demand

response plans can help in shifting loads from coincident peak hours as well as reducing

the peak load charges.

Due to unpredictability associated with the smooth integration of renewable energy

into the power grid, demand response by consumers is crucial both for suppliers (helping

them to increase the share of renewables) and consumers (smart use of electricity for

lower electricity bills). Since the booming big data enterprise has made data centers one

of the most rapidly expanding electricity consuming sectors in developed countries [228],

so an intelligent demand response from data centers is of greater significance for green

energy portfolios.

In the context of demand response, authors in [142] presented schemes to scale down

peak load as well as electricity expenditures of data centers by a combination of peak

load shifting and the use of local power generation. Similarly, authors in [141] presented

a scheduling scheme for the partial execution of data center workloads for reducing peak

demand charges.

Along with appropriate demand response strategies, greening data centers requires

special incentives offered by power suppliers. Such an incentivized program is offered by

Toronto Hydro-Electric System Ltd. [229], and PowerStream [230]. Under this program,

data center operators receive up to $800 for every peak kilowatt reduction, or $0.10 per

kWh for energy savings by greening their data centers [231].

Integration of renewables is yet another factor that can improve the energy profiles of

data centers. Experimental studies in [134,232,233] have shown that integrating renew-

able energy sources into data center power supplies can help to reduce energy-related costs

while reducing carbon footprints. Some of the real-world examples of wind-powered data

centers include Other World Computing [234], Baryonyx Corp. [235], and Green House


Data [236]. Similarly, Microsoft’s Virtual Earth mapping servers in Colorado are housed

in 100% wind-powered Verari container units [237].

Appendix B

Proof: A Coding Based

Optimization for Big Data

Processing

We start by defining an instance of index coding problem [97].

Definition 40 (Index Coding) An instance of index coding problem is defined by a

relay R which contains a set of k packets X = x1, · · · , xk that are to be delivered

to a set of m clients c1, · · · , cm over a broadcast channel CHL. Each client ci has

access to some side information HICi ⊆ X, and requires a set packets WICi ⊆ X from

the relay R. The relay R can transmit packets in X or their combinations (encoding).

The objective is to find a scheme that requires the minimum number of transmissions

from the relay R, and satisfies the requests of all the clients. Let OPT (IC) denote the

optimal solution of index coding problem, and |OPT (IC)| denote the minimum number

of transmissions from the relay.

151

Appendix B. Proof: A Coding Based Optimization for Big Data Processing152

B.1 Proof of Theorem 14

Given an instance of index coding problem as in Definition 40, we define an instance of

the spate coding problem as follows:

• V = AS, node1, S2,

• E = (AS, S), (S, node1),

• X1 = AS,

• X = node1, note that node1 hosts all of c1 · · · , cm

• X2 = S,

• P = x1, x2, · · · , xk,

• Wi = WICi, for i = 1, · · · ,m,

• Hi = HICi, for i = 1, · · · ,m.

Let |P (AS)| denote the number of link-level packet transmissions “made” by the AS

employing spate coding in the optimal scheme S. We show that |OPT (IC)| = |P (AS)|.

We start by analyzing the transmissions in the network N defined by the graph with

the vertex set V and the edge set E. Note that for all the packets transmitted by AS

destined to any of the clients c1, · · · , cm hosted on node1, OPT (S) requires AS to choose

the optimal solution that minimizes the number of packet transmissions on the edge

(AS, S). AS transmits the minimum number of packets by employing spate coding. It is

easier to see that the relay R can use the same scheme as of AS to minimize its number

of transmissions which shall be exactly the same as the packet transmissions by AS on

the edge (AS, S).



The proof follows from Theorem 14, and the existence of on an instance of index coding

problem for each instance of network coding problem [108].


We prove that the Problem ES is NP-hard by reducing index coding problem into it. It

has been already been shown that index coding problem is NP-hard, and NP-hard to

approximate [238].

Given an instance of index coding problem as in Definition 40, we define an instance

of the Problem ES as follows:

We construct a typical data center network N following a tree topology consisting of

two top of the rack switches S1 and S2, and an L2-aggregation switch AS as shown in

Figure B.1. There are a total of T nodes node1, · · · , nodeT in the network N; where T =

k+m. node1, · · · , nodek are connected to S1, and nodek+1, · · · , nodeT are connected to S2.

Each node nodei ∈ node1, · · · , nodek hosts only a sender service instance. Moreover,

the service instance running on nodei ∈ node1, · · · , nodek emits the packet xi. Let the

set U represents the collection of all the packets emitted by the nodes node1, · · · , nodek.

Each node nodej ∈ nodek+1, · · · , nodeT hosts two service instances, a sender service

instance So, and a receiver service instance Sr. A service instance So running on nodej ∈

nodek+1, · · · , nodeT emits a set of packets given by the set HICj. A service instance

Si running on nodej ∈ nodek+1, · · · , nodeT is responsible for processing the set of

packets uj ∈ U s.t uk+j := WICj, where j = 1, · · · ,m. As shown in Figure B.1 any

node nodei ∈ node1, · · · , nodek can communicate with any other node in nodej ∈

nodek+1, · · · , nodeT via AS. Node connected to the same top of the rack switch can

communicate to each other without going through AS. For all the packets passing via AS

it can either transmit their combinations (encoding) or can send the uncoded (routing).


Figure B.1: An instance of the Problem ES.

Let |P (AS)| denote the number of link-level packet transmissions “made” by the AS,

as shown in Figure B.1, in the optimal scheme S. We show that |OPT (IC)| = |P (AS)|.

We start by analyzing the link-level transmissions in the network N. First note that

all the nodes from the set node1, · · · , nodek do not receive any packet during the

communication phase of distributed data center application as these do not host any

service instance that intends to receive a packet. Furthermore, all the packets in the set

U have to pass through the AS; as for each of the packet in U there is a corresponding

receiver service instance hosted on one of the nodes nodek+1, · · · , nodeT. Second note

that for all the packets going through AS destined to a service instance running on any

of the nodes nodek+1, · · · , nodeT, OPT (S) requires AS to choose the optimal scheme

that minimizes the number of link-level packet transmissions on the link CHL. AS can

either encode or route the packets as a part of its optimal scheme by utilizing the side

information (i.e., output of the service instances running on nodes nodek+1, · · · , nodeT.

It is easier to see that the relay R can use the same schemes as of AS to minimize

its number of transmissions which shall be exactly the same as the link-level packet

transmissions by AS on the link CHL in Figure B.1.



Let ζ denote the number of sender service instances in the network (e.g., number of

mappers in Hadoop MapReduce). Without loss of generality we assume that each sender

service instance is required to communicate exactly one packet to its corresponding re-

ceiver service instance (if it needs to transmit more than one packet we replace it with

multiple sender service instances where each needs to communicate just one packet). We

note that d is the maximum number of hops between any sender service instance and its

corresponding receiver service instance. Hence, in a non-coding based solution a sender

service instance would require at most d link-level packets to communicate its packet

to corresponding receiver service instance. As there are ζ sender service instances so a

non-coding based solution would require at most d·ζ link-level packet transmissions. The

coding based solution, on the other hand, requires at least ζ link-level packet transmis-

sions to the coding server, one from each of sender service instances before it can even

proceed with coding. Hence, µ ≥ ζd·ζ = 1

d.


Directly follows from the fact that a coding based solution can not require more trans-

missions than a non-coding based solution. The tightness of the bound follows from the

observation that in all such instances where receiver service instance do not possess any

side information a coding based solution cannot perform better than non-coding based

solution and then µ = 1.


Without loss of generality we ignore all the sender and receiver service instances that

can communicate with each other without passing through the network’s bisection, as


such instances do not contribute to µ(bisection). We further assume, without loss of

generality, that each sender service instance requires to communicate exactly one packet

to its corresponding receiver service instance (if it needs to transmit more than one packet

we replace it with multiple sender service instances where each needs to communicate

just one packet). Next we note a non-coding based solution would result in two link-

level packet transmissions crossing the network’s bisection for each of ζ sender service

instances, one from sender towards the bisection-switch and other from bisection-switch

towards receiver, resulting in a total of 2ζ link-level packet transmissions crossing the

network’s bisection. The coding based solution first requires at least ζ link-level packet

transmissions crossing the network’s bisection to the coding server, one from each of

sender service instances. Hence µ ≥ ζ2ζ

= 12.

Regarding tightness of the bound, µ(bisection) → 12

for the tree topologies where

β sender service instances are located at one side of the bisection-switch (root), and β

receiver service instances are located on the other side of the bisection-switch. Moreover,

each receiver service instance possesses the demands of all the other receiver service in-

stances as its side information. It is easier to see in such a scenario non-coding based

solution requires 2β link-level packet transmissions crossing the network’s bisection. A

coding based solution requires β+1 link-level packet transmissions crossing the network’s

bisection, one from each of β sender service instances to the coding server, and one en-

coded transmission from the coding server to the receiver service instances. The encoded

link-level packet transmission consists of bitwise XOR of all the demands of the receiver

service instance. Hence in such scenario, µ(bisection) = β+12β

, and µ(bisection) → 12

for

β 1.



Theorem 28 proves if the coding is performed at the network bisection then µ(bisection) ≥12. We prove this theorem by presenting an example where µ < 1

2when middlebox for

performing coding is not placed at the network bisection, specifically in the example

presented in Section 4.4.2 µ(bisection) = 14

when the coding is not performed at the

bisection.

Appendix C

Proof: Distributed Power

Procurement for Data Centers

C.1 Proof of Lemma 33

Proof: Consider a set Gf consisting of f generating points, such that Gf ∈ L1∩L2.

We show that Gf does not violate the transmission line constraints, and provides service

to exactly f load areas si,∀i. First, note that since Gf ∈ L2 the f generating points

do not have transmission line conflicts. Second, note that as Gf ∈ L1, the associated

column vectors of these f generating points are linearly independent implying that they

belong to f distinct policy sets. Thus, Gf provides service to at least f different load

areas since each policy set represents the preferences of a distinct load area.

Therefore, a set Gf ∈ L1 ∩L2, such that Gf has maximum cardinality among all sets

belonging to L1∩L2 shall provide service to the maximum possible number of load areas

within constraints. Furthermore, if no other set of maximum cardinality belonging to

L1 ∩ L2, has cost lesser than Gf , then Gf provides an optimal solution to the 1-RCPP

problem. It has been shown that both TLM and SM are matroids, thus a set belonging

to L1∩L2 and having least cost and maximum cardinality among all the sets in L1∩L2 can

158

Appendix C. Proof: Distributed Power Procurement for Data Centers159

be found in polynomial time using the weighted matroid intersection algorithm [171,178].

C.2 Proof of Theorem 34

Proof: The correctness of distributed Algorithm 1-RCPP follows from Lemma 33

and correctness of the weighted matroid intersection algorithm by Edmonds [178]. Specif-

ically, the matroid intersection algorithm of [178] consist of four steps with corresponding

counterparts in Algorithm 1-RCPP.

1. In [178], Edmonds constructs a directed graph where edges represent valid replace-

ments of set of generating points selected in Gon from the set of generating points

not yet selected. This corresponds to Steps 10, 11, 16, 17 in Algorithm 1-RCPP ; in

our algorithm the entries of the routing table−→E and

←−E indicate the valid replace-

ment.

2. Similar to [178], Algorithm 1-RCPP finds vector G1 such that each generating point

in G1 corresponds to the valid addition of generating points to the already selected

generating points (Gon) with respect to L1 of the SM matroid. Similarly we find

vector G2 that corresponds to the generating points that are a valid addition with

respect to L2 of the TLM matroid. Steps 19 and 20 do the same.

3. Analogous to [178], Algorithm 1-RCPP finds a least cost path corresponding to

Phase 2 of our algorithm.

4. Step 32 of Algorithm 1-RCPP computes an updated vector Gon by necessary re-

placements and additions similar to [178].

Therefore the correctness of Algorithm 1-RCPP follows directly from the correctness of

the Edmond’s matroid intersection algorithm.

Bibliography

[1] SINTEF. (22 May 2013) Big data, for better or worse: 90% of world’s data

generated over last two years. ScienceDaily. [Online]. Available: www.sciencedaily.

com/releases/2013/05/130522085217.htm

[2] B. Marr, “Why only one of the 5 Vs of big data really matters,” IBM Big Data

Analytics & Hub, 2015.

[3] The four V’s of big data. IBM Big Data Analytics & Hub. [Online]. Available:

http://www.ibmbigdatahub.com/infographic/four-vs-big-data

[4] G. Cook and J. Van Horn, “How dirty is your data? a look at the energy choices

that power cloud computing,” Greenpeace International, The Netherlands, Tech.

Rep., April 2011.

[5] A. S. Masanet, Eric and J. Koomey, “Characteristics of low-carbon data centers,”

Nature Climate Change, vol. 3, no. 7, pp. 627–630, 2013.

[6] J. Whitney and P. Delforge, “Data center efficiency assessment,” NRDC and An-

thesis, USA, Tech. Rep. IP:14-08-A, August 2014.

[7] J. G. Koomey, “Worldwide electricity used in data centers,” Environmental Re-

search Letters, vol. 3, no. 3, p. 034008, 2008.

160

Bibliography 161

[8] (2014) Industry outlook: Data center energy efficiency. The Data

Center Journal. [Online]. Available: http://www.datacenterjournal.com/it/

industry-outlook-data-center-energy-efficiency/

[9] C. Preimesberger. (2013) Intel’s low-power atom chip making way into data

centers. eWEEK. [Online]. Available: http://www.datacenterjournal.com/it/

industry-outlook-data-center-energy-efficiency/

[10] J. Koomey, “Growth in data center electricity use 2005 to 2010,” Oakland, CA:

Analytical Press, 2011.

[11] J. Glanz. (2012) Power, pollution and the internet. The New York

Times. [Online]. Available: http://www.nytimes.com/2012/09/23/technology/

data-centers-waste-vast-amounts-of-energy-belying-industry-image.html

[12] G. Ghatikar, V. Ganti, N. Matson, and M. Piette, “Demand response opportunities

and enabling technologies for data centers: Findings from field studies,” Lawrence

Berkeley National Laboratory, Tech. Rep. LBNL-5763E, 2014.

[13] R. Brown, E. Masanet, B. Nordman, W. Tschudi, A. Shehabi, J. Stanley,

J. Koomey, D. Sartor, and P. Chan, “Report to congress on server and data center

energy efficiency: Public law 109-431,” Lawrence Berkeley National Laboratory,

Tech. Rep. LBNL-363E, 2008.

[14] G. Sauls. Measurement of data centre power con-

sumption. Falcon Electronics Pty Ltd. [Online]. Avail-

able: https://learningnetwork.cisco.com/servlet/JiveServlet/downloadBody/

3736-102-1-10478/measurement20of20data20centre20power20consumption.pdf

[15] M. Stansberry, “The global data center industry survey,” UptimeInstitute, USA,

Tech. Rep., October 2013.

Bibliography 162

[16] P. Kogge, “The tops in flops,” Spectrum, IEEE, vol. 48, no. 2, pp. 48–54, February

2011.

[17] “https://www.congress.gov/bill/113th-congress/house-bill/540.”

[18] “Clicking clean: How companies are creating the green internet,” Greenpeace In-

ternational, Tech. Rep., April 2014.

[19] “Clicking clean: A guide to building the green internet, year = 2015, institution =

Greenpeace International, month = May,,” Tech. Rep.

[20] Z. Miners. Greenpeace fingers Youtube, Netflix as threat to greener internet.

PCWorld. [Online]. Available: http://www.pcworld.idg.com.au/article/574853/

greenpeace-fingers-youtube-netflix-threat-greener-internet/

[21] Apache Hadoop. [Online]. Available: http://hadoop.apache.org/

[22] Amazon web services. [Online]. Available: http:/aws.amazon.com/

[23] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clus-

ters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

[24] J. Leverich and C. Kozyrakis, “On the energy (in) efficiency of hadoop clusters,”

ACM SIGOPS Operating Systems Review, vol. 44, no. 1, pp. 61–65, 2010.

[25] N. Zhu, L. Rao, X. Liu, J. Liu, and H. Guan, “Taming power peaks in mapreduce

clusters,” in ACM SIGCOMM Computer Communication Review, vol. 41, no. 4.

ACM, 2011, pp. 416–417.

[26] D. Abts, M. Marty, P. Wells, P. Klausler, and H. Liu, “Energy proportional data-

center networks,” in ACM SIGARCH Computer Architecture News, vol. 38, no. 3.

ACM, 2010, pp. 338–347.

Bibliography 163

[27] M. Gupta and S. Singh, “Using low-power modes for energy conservation in ethernet

lans.” in INFOCOM, vol. 7, 2007, pp. 2451–55.

[28] S. Nedevschi, L. Popa, G. Iannaccone, S. Ratnasamy, and D. Wetherall, “Reducing

network energy consumption via sleeping and rate-adaptation.” in NSDI, vol. 8,

2008, pp. 323–336.

[29] C. Gunaratne, K. Christensen, B. Nordman, and S. Suen, “Reducing the energy

consumption of ethernet with adaptive link rate (alr),” IEEE Transactions on Com-

puters, vol. 57, no. 4, 2008.

[30] A. Carrega, S. Singh, R. Bruschi, and R. Bolla, “Traffic merging for energy-efficient

datacenter networks,” in International Symposium on Performance Evaluation of

Computer and Telecommunication Systems. IEEE, 2012, pp. 1–5.

[31] R. Shields, “Cultural topology: The seven bridges of konigsburg, 1736,” Theory,

Culture & Society, vol. 29, no. 4-5, pp. 43–57, 2012.

[32] L. Euler, The seven bridges of Konigsberg. Wm. Benton, 1956.

[33] S. C. Carlson. Graph theory. Encyclopedia Britannica. [Online]. Available:

http://www.britannica.com/topic/graph-theory

[34] T. H. Cormen, Introduction to algorithms. MIT press, 2009.

[35] A. Paz and S. Moran, Non-deterministic polynomial optimization problems and

their approximation. Springer, 1977.

[36] H. Whitney, “On the abstract properties of linear dependence,” American Journal

of Mathematics, vol. 57, no. 3, pp. 509–533, 1935.

[37] J. Oxley, Matroid theory. Oxford University Press, USA, 2006.

Bibliography 164

[38] A. Schrijver, Combinatorial Optimization: Polyhedra and Efficiency. Springer

Verlag, 2003.

[39] “http://www.prnewswire.com/news-releases/altiors-altrastar—hadoop-storage-

accelerator-and-optimizer-now-certified-on-cdh4-clouderas-distribution-including-

apache-hadoop-version-4-183906141.html.”

[40] “https://www.facebook.com/notes/facebook-engineering/under-the-hood-

scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920.”

[41] T. White, Hadoop: The definitive guide. USA: ” O’Reilly Media, Inc.”, 2012.

[42] P. Costa, A. Donnelly, A. Rowstron, and G. OShea, “Camdoop: Exploiting in-

network aggregation for big data applications,” in USENIX NSDI, vol. 12, 2012,

pp. 29–42.

[43] L. Mai, L. Rupprecht, A. Alim, P. Costa, M. Migliavacca, P. Pietzuch, and A. L.

Wolf, “Netagg: Using middleboxes for application-specific on-path aggregation in

data centres,” in Proceedings of the 10th ACM International on Conference on

emerging Networking Experiments and Technologies. ACM, 2014, pp. 249–262.

[44] Y. Yu, P. Kumar, and M. Isard, “Distributed aggregation for data parallel com-

puting,” in SOSP, vol. 9, 2009, pp. 11–14.

[45] Z. Asad and M. A. R. Chaudhry, “A two-way street: Green big data processing for

a greener smart grid,” IEEE Systems Journal, vol. PP, no. 99, pp. 1–11, 2016.

[46] Cisco, “Power management in the Cisco unified computing system: An integrated

approach,” White Paper, C11-627731-01, Cisco, USA, Feburary 2011.

[47] L. Barroso, J. Clidaras, and U. Holzle, The datacenter as a computer: an intro-

duction to the design of warehouse-scale machines. USA: Morgan & Claypool

Publishers, 2013.

Bibliography 165

[48] D. Chernicoff, The shortcut guide to data center energy efficiency. Realtimepub-

lishers. com, 2009.

[49] R. Talaber, T. Brey, and L. Lamers, “Using virtualization to improve data center

efficiency,” White Paper, 19, The Green Grid, USA.

[50] Y. Shang, D. Li, and M. Xu, “Energy-aware routing in data center network,” in

ACM SIGCOMM workshop on Green networking, 2010, pp. 1–8.

[51] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee,

and N. McKeown, “Elastictree: Saving energy in data center networks,” in NSDI,

vol. 10, 2010, pp. 249–264.

[52] A. Greenberg, J. Hamilton, D. Maltz, and P. Patel, “The cost of a cloud: research

problems in data center networks,” ACM SIGCOMM computer communication

review, vol. 39, no. 1, pp. 68–73, 2008.

[53] Z. Asad, M. A. R. Chaudhry, and D. Malone, “Greener data exchange in the cloud:

A coding-based optimization for big data processing,” IEEE Journal on Selected

Areas in Communications, vol. 34, no. 5, pp. 1360–1377, May 2016.

[54] ——, “Codhoop: A system for optimizing big data processing,” in 2015 Annual

IEEE Systems Conference (SysCon) Proceedings, April 2015, pp. 295–300.

[55] ——, “A coding based optimization for hadoop,” in Extended Abstract and poster

ACM/IEEE International Conference for High Performance Computing, Network-

ing, Storage and Analysis (SC), 2015.

[56] “Cisco global cloud index: Forecast and methodology, 20122017,” 2013, White

Paper. [Online]. Available: http://www.cisco.com/c/en/us/solutions/collateral/

service-provider/global-cloud-index-gci/Cloud Index White Paper.pdf/

Bibliography 166

[57] M. Chowdhury, M. Zaharia, J. Ma, M. Jordan, and I. Stoica, “Managing data trans-

fers in computer clusters with orchestra,” SIGCOMM-Computer Communication

Review, vol. 41, no. 4, pp. 98–109, 2011.

[58] C. Gunaratne, K. Christensen, and B. Nordman, “Managing energy consumption

costs in desktop PCs and LAN switches with proxying, split TCP connections,

and scaling of link speed,” International Journal of Network Management, vol. 15,

no. 5, pp. 297–310, 2005.

[59] “https://polastre.com/2010/03/establishing-performance-metrics-for-the-data-

centre/.”

[60] A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz,

P. Patel, and S. Sengupta, “Vl2: a scalable and flexible data center network,” in

ACM SIGCOMM Computer Communication Review, vol. 39, no. 4. ACM, 2009,

pp. 51–62.

[61] A. Greenberg, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta, “Towards a next

generation data center architecture: scalability and commoditization,” in Workshop

on Programmable routers for extensible services of tomorrow. ACM, 2008.

[62] Apache storm. [Online]. Available: http://storm-project.net/

[63] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-

parallel programs from sequential building blocks,” in ACM SIGOPS Operating

Systems Review, 2007, pp. 59–72.

[64] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. Gunda, and J. Cur-

rey, “Dryadlinq: A system for general-purpose distributed data-parallel computing

using a high-level language,” in USENIX OSDI, 2008, pp. 1–14.

Bibliography 167

[65] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica,

“Dominant resource fairness: fair allocation of multiple resource types,” in USENIX

NSDI, 2011, pp. 323–336.

[66] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg,

“Quincy: fair scheduling for distributed computing clusters,” in ACM SIGOPS.

ACM, 2009, pp. 261–276.

[67] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica,

“Delay scheduling: a simple technique for achieving locality and fairness in cluster

scheduling,” in European conference on Computer systems. ACM, 2010, pp. 265–

278.

[68] A. Das, C. Lumezanu, Y. Zhang, V. Singh, G. Jiang, and C. Yu, “Transparent and

flexible network management for big data processing in the cloud,” in USENIX

Workshop on Hot Topics in Cloud Computings. USENIX, 2013.

[69] A. Shieh, S. Kandula, A. Greenberg, C. Kim, and B. Saha, “Sharing the data center

network,” in USENIX NSDI, pp. 309–322.

[70] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, “Hedera:

Dynamic flow scheduling for data center networks,” in NSDI, vol. 10, 2010, pp.

19–19.

[71] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya, Y. Fain-

man, G. Papen, and A. Vahdat, “Helios: a hybrid electrical/optical switch archi-

tecture for modular data centers,” ACM SIGCOMM Computer Communication

Review, vol. 41, no. 4, pp. 339–350, 2011.

[72] G. Wang, T. E. Ng, and A. Shaikh, “Programming your network at run-time for

big data applications,” in HotSDN. ACM, 2012, pp. 103–108.

Bibliography 168

[73] M. Chaudhry, Z. Asad, A. Sprintson, and M. Langberg, “On the Complementary

Index Coding Problem,” in IEEE ISIT 2011.

[74] ——, “Finding sparse solutions for the index coding problem,” in IEEE GLOBE-

COM 2011.

[75] A. Curtis, K. Wonho, and P. Yalagandula, “Mahout: Low-overhead datacenter

traffic management using end-host-based elephant detection,” in IEEE INFOCOM

2011.

[76] D. Perino, M. Varvello, and K. Puttaswamy, “ICN-RE: redundancy elimination

for information-centric networking,” in ACM SIGCOMM ICN workshop, 2012, pp.

91–96.

[77] Cisco WAAS 4.4.1 context-aware DRE, the adaptive cache architecture. Cisco.

[Online]. Available: http://www.cisco.com/c/en/us/products/collateral/routers/

wide-area-application-services-waas-software/white\ paper\ c11-676350.html

[78] Steelhead. Riverbed. [Online]. Available: http://www.riverbed.com/products/

wan-optimization/

[79] M. V. Neves, C. A. De Rose, K. Katrinis, and H. Franke, “Pythia: Faster big data

in motion through predictive software-defined network optimization at runtime,”

in IEEE IPDPS, 2014, pp. 82–90.

[80] A. Das, C. Lumezanu, Y. Zhang, V. Singh, G. Jiang, and C. Yu, “Transparent and

flexible network management for big data processing in the cloud,” in USENIX

HotCloud, 2013.

[81] H. Xu and B. Li, “Tinyflow: Breaking elephants down into mice in data center

networks,” in Local & Metropolitan Area Networks (LANMAN), 2014 IEEE 20th

International Workshop on. IEEE, 2014, pp. 1–6.

Bibliography 169

[82] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu,

“Bcube: a high performance, server-centric network architecture for modular data

centers,” ACM SIGCOMM Computer Communication Review, vol. 39, no. 4, 2009.

[83] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center net-

work architecture,” in ACM SIGCOMM Computer Communication Review, vol. 38,

no. 4. ACM, 2008, pp. 63–74.

[84] F. A. Silva, A. Boukerche, T. R. Silva, L. B. Ruiz, E. Cerqueira, and A. A. Loureiro,

“Vehicular networks: A new challenge for content-delivery-based applications,”

ACM Computing Surveys (CSUR), vol. 49, no. 1, p. 11, 2016.

[85] M. Li, Z. Yang, and W. Lou, “Codeon: Cooperative popular content distribution for

vehicular networks using symbol level network coding,” IEEE Journal on Selected

Areas in Communications, vol. 29, no. 1, pp. 223–235, January 2011.

[86] B. Hassanabadi and S. Valaee, “Reliable periodic safety message broadcasting in

vanets using network coding,” IEEE Transactions on Wireless Communications,

vol. 13, no. 3, pp. 1284–1297, March 2014.

[87] A. Khlass, T. Bonald, and S. E. Elayoubi, “Performance evaluation of intra-site

coordination schemes in cellular networks,” Performance Evaluation, vol. 98, pp.

1–18, 2016.

[88] S. Katti, H. Rahul, W. Hu, D. Katabi, M. Medard, and J. Crowcroft, “Xors in the

air: practical wireless network coding,” in ACM SIGCOMM computer communi-

cation review, vol. 36, no. 4, 2006, pp. 243–254.

[89] I.-H. Hou, Y.-E. Tsai, T. F. Abdelzaher, and I. Gupta, “Adapcode: Adaptive

network coding for code updates in wireless sensor networks,” in INFOCOM 2008.

The 27th Conference on Computer Communications. IEEE. IEEE, 2008.

Bibliography 170

[90] J. Li and B. Li, “Cooperative repair with minimum-storage regenerating codes

for distributed storage,” in IEEE INFOCOM 2014-IEEE Conference on Computer

Communications. IEEE, 2014, pp. 316–324.

[91] Y. Wu, “Existence and construction of capacity-achieving network codes for dis-

tributed storage,” in IEEE International Symposium on Information Theory, 2009,

pp. 1150–1154.

[92] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. Dimakis, R. Vadali, S. Chen,

and D. Borthakur, “Xoring elephants: Novel erasure codes for big data,” Proc.

VLDB Endow., vol. 6, no. 5, Mar. 2013.

[93] R. Ahlswede, N. Cai, S. Li, and R. Yeung, “Network information flow,” IEEE

Transactions on Information Theory, vol. 46, no. 4, 2000.

[94] Y. Birk and T. Kol, “Informed-Source Coding-on-Demand (ISCOD) over Broadcast

Channels,” in Proceedings of INFOCOM’98, 1998.

[95] Z. Bar-Yossef, Y. Birk, T. S. Jayram, and T. Kol, “Index Coding with Side In-

formation,” In Proceedings of 47th Annual IEEE Symposium on Foundations of

Computer Science(FOCS), pp. 197–206, 2006.

[96] S. Katti, H. Rahul, W. Hu, D. Katabi, M. Medard, and J. Crowcroft, “XORs in

the Air: Practical Wireless Network Coding,” in Proceedings of SIGCOMM ’06,

2006, pp. 243–254.

[97] M. Chaudhry and A. Sprintson, “Efficient algorithms for index coding,” in IEEE

INFOCOM Workshops 2008.

[98] S. Sahoo, D. Nikovski, T. Muso, and K. Tsuru, “Electricity theft detection using

smart meter data,” in IEEE ISGT, 2015, pp. 1–5.

Bibliography 171

[99] Irish social science data archive. ISSDA. [Online]. Available: http://www.ucd.ie/

issda/

[100] Apache storm concepts. [Online]. Available: http://storm.apache.org/about/

simple-api.html

[101] “http://cassandra.apache.org/.”

[102] R. Lapuh, Y. Zhao, W. Tawbi, J. M. Regan, and D. Head, “System, device, and

method for improving communication network reliability using trunk splitting,”

Feb. 6 2007, uS Patent 7,173,934.

[103] “http://www.cisco.com/c/en/us/td/docs/solutions/enterprise/data center/dc infra2 5/dcinfra 4.pdf.”

[104] “Doctors will routinely use your dna to keep you well,” IBM Research, Tech. Rep.,

2013.

[105] “https://storm.incubator.apache.org/documentation/concepts.html.”

[106] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Network Information Flow,”

IEEE Transactions on Information Theory, vol. 46, no. 4, pp. 1204–1216, 2000.

[107] H. Maleki, V. R. Cadambe, and S. Jafar, “Index coding: An interference alignment

perspective,” IEEE Transactions on Information Theory, vol. 60, no. 9, pp. 5402–

5432, 2014.

[108] M. Effros, S. El Rouayheb, and M. Langberg, “An equivalence between network

coding and index coding,” vol. 61, no. 3, 2015.

[109] S. El Rouayheb, A. Sprintson, and C. Georghiades, “On the index coding problem

and its relation to network coding and matroid theory,” IEEE Transactions on

Information Theory, vol. 56, no. 7, pp. 3187–3195, 2010.

Bibliography 172

[110] R. Koetter and M. Medard, “An algebraic approach to network coding,”

IEEE/ACM Transactions on Networking (TON), vol. 11, no. 5, pp. 782–795, 2003.

[111] B. Hassanabadi, L. Zhang, and S. Valaee, “Index coded repetition-based MAC in

vehicular ad-hoc networks,” in Proceedings of the 6th IEEE Conference on Con-

sumer Communications and Networking Conference. Institute of Electrical and

Electronics Engineers Inc., The, 2009, pp. 1180–1185.

[112] S. Sorour and S. Valaee, “Adaptive network coded retransmission scheme for wire-

less multicast,” in Proceedings of the 2009 IEEE international Symposium on In-

formation Theory. Institute of Electrical and Electronics Engineers Inc., The,

2009, pp. 2577–2581.

[113] N. Chiba and T. Nishizeki, “Arboricity and subgraph listing algorithms,” SIAM

Journal on Computing, vol. 14, no. 1, pp. 210–223, 1985.

[114] Sequence file reader. [Online]. Available: https://hadoop.apache.org/docs/r1.0.4/

api/org/apache/hadoop/io/package-summary.html

[115] Hortonworks Data Platform System Administration Guides. USA: Hortonworks,

2015.

[116] J. Norris. (2013) Package org.apache.hadoop.examples.terasort. Apache Hadoop.

[Online]. Available: https://hadoop.apache.org/docs/current/api/org/apache/

hadoop/examples/terasort/package-summary.html

[117] Cisco, “Big data in the enterprise - network design considerations,” White Paper,

USA.

[118] W. Yu, Y. Wang, and X. Que, “Design and evaluation of network-levitated merge

for hadoop acceleration,” IEEE Transactions on Parallel and Distributed Systems,,

vol. 25, no. 3, pp. 602–611, 2014.

Bibliography 173

[119] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vijaykumar, “Shuffle-

watcher: Shuffle-aware scheduling in multi-tenant mapreduce clusters,” in Annual

Technical Conference. USENIX, 2014, pp. 1–12.

[120] F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar, “Mapreduce with commu-

nication overlap (marco),” Journal of Parallel and Distributed Computing, vol. 73,

no. 5, pp. 608–620, 2013.

[121] redhat. [Online]. Available: http://www.redhat.com/

[122] J. W. Eaton, D. Bateman, and S. Hauberg, GNU Octave version 3.0.1 manual:

a high-level interactive language for numerical computations. CreateSpace

Independent Publishing Platform, 2009, ISBN 1441413006. [Online]. Available:

http://www.gnu.org/software/octave/doc/interpreter

[123] CentOS. [Online]. Available: https://www.centos.org/

[124] Citrix. [Online]. Available: http://www.citrix.com/products/xenserver/

[125] XenServer. [Online]. Available: http://xenserver.org/partners/

developing-products-for-xenserver.html

[126] Open vSwitch. [Online]. Available: http://openvswitch.org/

[127] P. Mahadevan, P. Sharma, S. Banerjee, and P. Ranganathan, “A power bench-

marking framework for network devices,” in Networking 2009. Springer, 2009.

[128] B. Zhang, K. Sabhanatarajan, A. Gordon-Ross, and A. George, “Real-time perfor-

mance analysis of adaptive link rate,” in Local Computer Networks. IEEE, 2008,

pp. 282–288.

[129] C. Gunaratne and K. Christensen, “Ethernet adaptive link rate: System design

and performance evaluation,” in Local Computer Networks, Proceedings 2006 31st

IEEE Conference on. IEEE, 2006, pp. 28–35.

Bibliography 174

[130] Z. Asad, M. A. R. Chaudhry, and D. Kundur, “On the use of matroid theory for

distributed cyber-physical-constrained generator scheduling in smart grid,” IEEE

Transactions on Industrial Electronics, vol. 62, no. 1, pp. 299–309, Jan 2015.

[131] L. Bishop, “Energy procurement: The new data center power model,” The Data

Cnter Journal, 2014.

[132] M.-A. Wolf, “Environmentally sound data centres: Policy measures, metrics, and

methodologies,” Tech. Rep. 04.07.2014, April, 2014.

[133] N. Darukhanawalla, “Geographically dispersed data centers,” Tech. Rep., Sep,

2009.

[134] Z. Liu, M. Lin, A. Wierman, S. H. Low, and L. L. Andrew, “Geographical load

balancing with renewables,” ACM SIGMETRICS Performance Evaluation Review,

vol. 39, no. 3, pp. 62–66, 2011.

[135] A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, and B. Maggs, “Cutting the

electric bill for internet-scale systems,” in ACM SIGCOMM computer communica-

tion review, vol. 39, no. 4. ACM, 2009, pp. 123–134.

[136] “State Energy Conservation Office,” http://www.seco.cpa.state.tx.us/.

[137] R. Bianchini, “Leveraging renewable energy in data centers: present and future,”

in Proceedings of the 21st international symposium on High-Performance Parallel

and Distributed Computing. ACM, 2012, pp. 135–136.

[138] K.-K. Nguyen, M. Cheriet, M. Lemay, M. Savoie, and B. Ho, “Powering a data

center network via renewable energy: A green testbed,” IEEE Internet Computing,

vol. 17, no. 1, pp. 40–49, 2013.

Bibliography 175

[139] B. Aksanli, T. Rosing, and E. Pettis, “Distributed battery control for peak

power shaving in datacenters,” in 2013 International Green Computing Confer-

ence (IGCC). IEEE, 2013, pp. 1–8.

[140] M. Ghamkhari, H. Mohsenian-Rad, and A. Wierman, “Optimal risk-aware power

procurement for data centers in day-ahead and real-time electricity markets,” in

IEEE Conference on Computer Communications Workshops. IEEE, 2014, pp.

610–615.

[141] H. Xu and B. Li, “Reducing electricity demand charge for data centers with partial

execution,” in International conference on Future energy systems. ACM, 2014,

pp. 51–61.

[142] Z. Liu, A. Wierman, Y. Chen, B. Razon, and N. Chen, “Data center demand

response: Avoiding the coincident peak via workload shifting and local generation,”

SIGMETRICS Performance Evaluation Review, vol. 41, no. 1, pp. 341–342, 2013.

[143] Z. Liu, I. Liu, S. Low, and A. Wierman, “Pricing data center demand response,” in

ACM SIGMETRICS Performance Evaluation Review, vol. 42, no. 1. ACM, 2014,

pp. 111–123.

[144] S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, and P. Mar-

wedel, “Reducing energy consumption by dynamic copying of instructions onto

onchip memory,” in International Symposium on System Synthesis. IEEE, 2002,

pp. 213–218.

[145] E. Pinheiro, R. Bianchini, E. V. Carrera, and T. Heath, “Dynamic cluster recon-

figuration for power and performance,” in Compilers and operating systems for low

power. Springer, 2003, pp. 75–93.

Bibliography 176

[146] F. Pilo, G. Pisano, and G. Soma, “Optimal coordination of energy resources with a

two-stage online active management,” IEEE Trans. Ind. Electron., vol. 58, no. 10,

pp. 4526–4537, 2011.

[147] A. Timbus, M. Larsson, and C. Yuen, “Active management of distributed energy

resources using standardized communications and modern information technolo-

gies,” IEEE Trans. Ind. Electron., vol. 56, no. 10, pp. 4029–4037, 2009.

[148] P. C. Loh, L. Zhang, and F. Gao, “Compact integrated energy systems for dis-

tributed generation,” IEEE Trans. Ind. Electron., vol. 60, no. 4, pp. 1492–1502,

April 2013.

[149] E. Pouresmaeil, C. Miguel-Espinar, M. Massot-Campos, D. Montesinos-Miracle,

and O. Gomis-Bellmunt, “A control technique for integration of dg units to the

electrical networks,” IEEE Trans. Ind. Electron., vol. 60, no. 7, pp. 2881–2893,

July 2013.

[150] A. Pantoja and N. Quijano, “A population dynamics approach for the dispatch of

distributed generators,” IEEE Trans. Ind. Electron., vol. 58, no. 10, pp. 4559–4567,

2011.

[151] D. Q. Hung and N. Mithulananthan, “Multiple distributed generator placement

in primary distribution networks for loss reduction,” IEEE Trans. Ind. Electron.,

vol. 60, no. 4, pp. 1700–1708, April 2013.

[152] S. Paudyal, C. Canizares, and K. Bhattacharya, “Optimal operation of distribution

feeders in smart grids,” IEEE Trans. Ind. Electron., vol. 58, no. 10, pp. 4495–4503,

2011.

[153] S. Bruno, S. Lamonaca, G. Rotondo, U. Stecchi, and M. La Scala, “Unbalanced

three-phase optimal power flow for smart grids,” IEEE Trans. Ind. Electron.,

vol. 58, no. 10, pp. 4504–4513, 2011.

Bibliography 177

[154] S. Chen and H. Gooi, “Jump and shift method for multi-objective optimization,”

IEEE Trans. Ind. Electron., vol. 58, no. 10, pp. 4538–4548, 2011.

[155] A. Bhardwaj, N. S. Tung, and V. Kamboj, “Unit commitment in power system: A

review,” Int. J. Elect. Power Engg., vol. 6, no. 1, pp. 51–57, 2012.

[156] H. Happ, “Optimal power dispatch: A comprehensive survey,” IEEE Trans. on

Power App. and Sys., vol. 96, no. 3, pp. 841–854, 1977.

[157] D. Kirshen and G. Strbac, Fundamentals of Power System Economics. John Wiley

& Sons Ltd, 2004.

[158] J. Lopez, J. Meza, I. Guillen, and R. Gomez, “A heuristic algorithm to solve the

unit commitment problem for real-life large-scale power systems,” Int. J. Elect.

Power Energy Syst., vol. 49, pp. 287–295, 2013.

[159] S. Ng and J. Zhong, “Security-constrained dispatch with controllable loads for

integrating stochastic wind energy,” in IEEE PES ISGT 2012.

[160] K. Zare and S. Hashemi, “A solution to transmission-constrained unit commitment

using hunting search algorithm,” in IEEE EEEIC, 2012, pp. 941–946.

[161] R. Bacher, “Power system models, objectives and constraints in optimal power

flow calculations,” in Optimization in Planning and Operation of Electric Power

Systems. Springer, 1993, pp. 217–263.

[162] F. Capitanescu, W. Rosehart, and L. Wehenkel, “Optimal power flow computations

with constraints limiting the number of control actions,” in IEEE PowerTech, 2009,

pp. 1–8.

[163] E. N. Azadani, S. H. Hosseinian, P. H. Divshali, and B. Vahidi, “Stability con-

strained optimal power flow in deregulated power systems,” Elec. Power Compo-

nents and Systems, vol. 39, no. 8, pp. 713–732, 2011.

Bibliography 178

[164] J. J. Shaw, “A direct method for security-constrained unit commitment,” IEEE

Trans. Power Sys., vol. 10, no. 3, pp. 1329–1342, 1995.

[165] A. Chaouachi, R. Kamel, R. Andoulsi, and K. Nagasaka, “Multiobjective intelligent

energy management for a microgrid,” IEEE Trans. Ind. Electron., vol. 60, no. 4,

pp. 1688–1699, April 2013.

[166] R. M. Karp, Reducibility among Combinatorial Problems. Springer, 1972.

[167] R. Kannan and C. Monma, “On the computational complexity of integer

programming problems,” in Optimization and Operations Research, ser. Lecture

Notes in Economics and Mathematical Systems, R. Henn, B. Korte, and W. Oettli,

Eds. Springer Berlin Heidelberg, 1978, vol. 157, pp. 161–172. [Online]. Available:

http://dx.doi.org/10.1007/978-3-642-95322-4 17

[168] S. S. Skiena, The Algorithm Design Manual, 2nd ed. Springer Publishing Company,

Incorporated, 2008.

[169] R. Impagliazzo and R. Paturi, “Complexity of k-sat,” in Computational Complexity,

1999. Proceedings. Fourteenth Annual IEEE Conference on. IEEE, 1999, pp. 237–

240.

[170] R. Impagliazzo, S. Lovett, R. Paturi, and S. Schneider, “0-1 integer linear pro-

gramming with a linear number of constraints,” arXiv preprint arXiv:1401.5512,

2014.

[171] L. Lawler, “Matroid intersection algorithms,” Mathematical Programming, vol. 9,

1975.

[172] C. M. Papadimitriou, Computational Complexity. Reading, Massachusetts:

Addison-Wesley, 1994.

Bibliography 179

[173] B. Krishnamoorthy, “Bounds on the size of branch-and-bound proofs for integer

knapsacks,” Operations Research Letters, vol. 36, no. 1, pp. 19–25, 2008.

[174] S. Bu, F. R. Yu, P. X. Liu, and P. Zhang, “Distributed scheduling in smart grid

communications with dynamic power demands and intermittent renewable energy

resources,” in IEEE Int. Conf. on Comm. Workshops. IEEE, 2011, pp. 1–5.

[175] C. Gong, X. Wang, W. Xu, and A. Tajer, “Distributed real-time energy scheduling

in smart grid: Stochastic model and fast optimization,” IEEE Trans. Smart Grid,

vol. 4, no. 3, pp. 1476–1489, 2013.

[176] R. E. Bixby, W. Cook, A. Cox, and E. K. Lee, “Parallel mixed integer program-

ming,” Rice University Center for Research on Parallel Computation Research

Monograph CRPC-TR95554, 1995.

[177] J. T. Linderoth and M. W. P. Savelsbergh, “A computational study of search

strategies for mixed integer programming,” INFORMS J. on Computing, vol. 11,

pp. 173–187, 1997.

[178] J. Edmonds, “Matroid Intersection,” Annals of discrete Mathematics, vol. 4, pp.

39–49, 1979.

[179] S. Haldar, “An All Pairs Paths Distributed Algorithm using 2n2 Messages,” in

Graph-Theoretic Concepts in Computer Science. Springer, 1994, pp. 350–363.

[180] H. Hamacher, “A Time Expanded Matroid Algorithm for Finding Optimal Dy-

namic Matroid Intersections,” Mathematical Methods of Operations Research,

vol. 29, no. 5, pp. 203–215, 1985.

[181] D. Zuckerman, “Linear degree extractors and the inapproximability of max clique

and chromatic number,” in Proc. of ACM Symposium on Theory of Computing,

2006.

Bibliography 180

[182] IEEE Power Systems Test Case Archive. [Online]. Available: http://www.ee.

washington.edu/research/pstca/index.html/

[183] A. Diniz, “Test cases for unit commitment and hydrothermal scheduling problems,”

in Proc. IEEE Power Eng. Soc. Gen. Meeting, 2010.

[184] A. Singla, C. Hong, L. Popa, and P. Godfrey, “Jellyfish: Networking data centers

randomly,” in NSDI. USENIX, 2012, pp. 225–238.

[185] Ieee p802.3az energy efficient ethernet task force. [Online]. Available: www.

ieee802.org/3/az

[186] Solar-powered data centers. Data Center Knowledge. [Online]. Available:

http://www.datacenterknowledge.com/solar-powered-data-centers/

[187] V. Valancius, N. Laoutaris, L. Massoulie, C. Diot, and P. Rodriguez, “Greening

the internet with nano data centers,” in International conference on Emerging

networking experiments and technologies. ACM, 2009, pp. 37–48.

[188] B. Khargharia, S. Hariri, F. Szidarovszky, M. Houri, H. El-Rewini, S. Khan, I. Ah-

mad, and M. Yousif, “Autonomic power & performance management for large-scale

data centers,” in IEEE IPDPS. IEEE, 2007.

[189] L. Mastroleon, N. Bambos, C. Kozyrakis, and D. Economou, “Automatic power

management schemes for internet servers and data centers,” in IEEE GLOBECOM,

vol. 2. IEEE, 2005, pp. 5–pp.

[190] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu, “No power

struggles: Coordinated multi-level power management for the data center,” in ACM

SIGARCH Computer Architecture News, vol. 36, no. 1. ACM, 2008, pp. 48–59.

Bibliography 181

[191] A. Gandhi, M. Harchol-Balter, R. Das, and C. Lefurgy, “Optimal power allocation

in server farms,” in ACM SIGMETRICS Performance Evaluation Review, vol. 37,

no. 1, pp. 157–168.

[192] C. Hsu and W. Feng, “A power-aware run-time system for high-performance com-

puting,” in ACM/IEEE conference on Supercomputing, 2005, p. 1.

[193] Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and R. Bianchini, “Multiscale:

memory system dvfs with multiple memory controllers,” in IEEE international

symposium on low power electronics and design. ACM, 2012, pp. 297–302.

[194] L. Rao, X. Liu, L. Xie, and W. Liu, “Minimizing electricity cost: optimization

of distributed internet data centers in a multi-electricity-market environment,” in

Proceedings of IEEE INFOCOM. IEEE, 2010, pp. 1–9.

[195] D. Meisner, B. Gold, and T. Wenisch, “Powernap: eliminating server idle power,”

ACM SIGARCH Computer Architecture News, vol. 37, no. 1, pp. 205–216, 2009.

[196] L. Liu, H. Wang, X. Liu, X. Jin, W. He, Q. Wang, and Y. Chen, “Greencloud: a new

architecture for green data center,” in International conference industry session on

Autonomic computing and communications. ACM, 2009, pp. 29–38.

[197] H. Chen, M. Kesavan, K. Schwan, A. Gavrilovska, P. Kumar, and Y. Joshi,

“Spatially-aware optimization of energy consumption in consolidated data center

systems,” in ASME Pacific Rim Technical Conference and Exhibition on Pack-

aging and Integration of Electronic and Photonic Systems. American Society of

Mechanical Engineers, 2011, pp. 461–470.

[198] V. Anagnostopoulou, S. Biswas, A. Savage, R. Bianchini, T. Yang, and F. Chong,

“Energy conservation in datacenters through cluster memory management and

barely-alive memory servers,” in Workshop on Energy Efficient Design, 2009.

Bibliography 182

[199] A. Verma, G. Dasgupta, T. Nayak, P. De, and R. Kothari, “Server workload anal-

ysis for power minimization using consolidation,” in USENIX Annual technical

conference. USENIX Association, 2009, pp. 28–28.

[200] G. Da Costa, M. D. De Assuncao, J. Gelas, Y. Georgiou, L. Lefevre, A. Orgerie,

J. Pierson, O. Richard, and A. Sayah, “Multi-facet approach to reduce energy

consumption in clouds and grids: the green-net framework,” in International con-

ference on energy-efficient computing and networking. ACM, 2010, pp. 95–104.

[201] R. Nathuji, C. Isci, and E. Gorbatov, “Exploiting platform heterogeneity for power

efficient data centers,” in International Conference on Autonomic Computing.

IEEE, 2007, pp. 5–5.

[202] M. Zapater, J. Ayala, and J. Moya, “Leveraging heterogeneity for energy minimiza-

tion in data centers,” in IEEE/ACM International Symposium on Cluster, Cloud

and Grid Computing. IEEE, 2012.

[203] K. Le, R. Bianchini, M. Martonosi, and T. Nguyen, “Cost- and energy-aware load

distribution across data centers,” HotPower, pp. 1–5, 2009.

[204] J. Berral, I. Goiri, R. Nou, F. Julia, J. Guitart, R. Gavalda, and J. Torres, “Towards

energy-aware scheduling in data centers using machine learning,” in International

Conference on energy-Efficient Computing and Networking. ACM, 2010, pp. 215–

224.

[205] J. Kolodziej, S. Khan, and F. Xhafa, “Genetic algorithms for energy-aware schedul-

ing in computational grids,” in International Conference on P2P, Parallel, Grid,

Cloud and Internet Computing. IEEE, 2011, pp. 17–24.

[206] L. Keys, S. Rivoire, and J. Davis, “The search for energy-efficient building blocks for

the data center,” in International Conference on Computer Architecture. Springer,

2012, pp. 172–182.

Bibliography 183

[207] P. Microsystems, “2010 state of virtualization security survey,” Prism Microsys-

tems, USA, Tech. Rep., April 2010.

[208] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu, “Dcell: a scalable and fault-

tolerant network structure for data centers,” ACM SIGCOMM Computer Commu-

nication Review, vol. 38, no. 4, pp. 75–86, 2008.

[209] R. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan,

V. Subramanya, and A. Vahdat, “Portland: a scalable fault-tolerant layer 2 data

center network fabric,” in ACM SIGCOMM Computer Communication Review,

vol. 39, no. 4. ACM, 2009, pp. 39–50.

[210] virt-p2v and virt-v2v. [Online]. Available: http://libguestfs.org/virt-v2v/

[211] K. Bilal, S. Malik, O. Khalid, A. Hameed, E. Alvarez, V. Wijaysekara, R. Irfan,

S. Shrestha, D. Dwivedy, M. Ali et al., “A taxonomy and survey on green data

center networks,” Future Generation Computer Systems, vol. 36, pp. 189–208, 2013.

[212] K. Bilal, S. Khan, L. Zhang, H. Li, K. Hayat, S. Madani, N. Min-Allah, L. Wang,

D. Chen, M. Iqbal, C. Xu, and A. Zomaya, “Quantitative comparisons of the state-

of-the-art data center architectures,” Concurrency and Computation: Practice and

Experience, vol. 25, no. 12, pp. 1771–1783, 2013.

[213] X. Wang, Y. Yao, X. Wang, K. Lu, and Q. Cao, “Carpo: Correlation-aware power

optimization in data center networks,” in IEEE INFOCOM. IEEE, 2012.

[214] L. Shang, L. Peh, and N. Jha, “Dynamic voltage scaling with links for power

optimization of interconnection networks,” in International Symposium on High-

Performance Computer Architecture. IEEE, 2003.

Bibliography 184

[215] D. Halperin, S. Kandula, J. Padhye, P. Bahl, and D. Wetherall, “Augmenting data

center networks with multi-gigabit wireless links,” in ACM SIGCOMM Computer

Communication Review, vol. 41, no. 4. ACM, 2011, pp. 38–49.

[216] J. Shin, E. Sirer, H. Weatherspoon, and D. Kirovski, “On the feasibility of com-

pletely wireless datacenters,” IEEE/ACM Transactions on Networking, vol. 21,

no. 5, pp. 1666–1679, 2013.

[217] Y. Cui, H. Wang, X. Cheng, and B. Chen, “Wireless data center networking,”

IEEE Wireless Communications, vol. 18, no. 6, 2011.

[218] M. Gharbaoui, B. Martini, D. Adami, G. Antichi, S. Giordano, and P. Castoldi,

“On virtualization-aware traffic engineering in openflow data centers networks,” in

NOMS. IEEE, 2014.

[219] M. Jarschel and R. Pries, “An openflow-based energy-efficient data center ap-

proach,” in Conference on Applications, technologies, architectures, and protocols

for computer communication. ACM, 2012, pp. 87–88.

[220] I. Goiri, K. Le, T. Nguyen, J. Guitart, J. Torres, and R. Bianchini, “Greenhadoop:

leveraging green energy in data-processing frameworks,” in European conference on

Computer Systems. ACM, 2012, pp. 57–70.

[221] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop

mapreduce,” in 2012 IEEE International Conference on Cluster Computing Work-

shops.

[222] E. Curry, S. Hasan, M. White, and H. Melvin, “An environmental chargeback

for data center and cloud computing consumers,” in International Conference on

Energy Efficient Data Centers. Springer, 2012, pp. 117–128.

Bibliography 185

[223] A. Souarez, “Charging customers for power usage in microsoft data centers.” Msdn,

2008.

[224] Cloud cruiser for amazon web services. [Online]. Available: http://www.

cloudcruiser.com/partners/amazon/

[225] Cisco, “Managing the real cost of on-demand enterprise cloud services with charge-

back models,” White Paper, C11-617390-00, Cisco, USA, August, 2010.

[226] V. Avelar, D. Azevedo, and A. French(Editors), “Pue: A comprehensive examina-

tion of the metric,” White Paper, 49, The Green Grid, USA, 2014.

[227] D. Azevedo, J. Cooley, M. Patterson, and M. Blackburn. (2011) Data center

efficiency metrics: mpue, partial pue, ere, dcce. The Green Grid. [Online].

Available: http://www.thegreengrid.org/∼/media/TechForumPresentations2011/

Data Center Efficiency Metrics 2011.pdf

[228] P. DELFORGE. (2015) America’s data centers consuming and wasting growing

amounts of energy. The Natural Resources Defense Council. [Online]. Available:

http://www.nrdc.org/energy/data-center-efficiency-assessment.asp

[229] Toronto hydro corporation. [Online]. Available: http://www.torontohydro.com/

[230] Powerstream. [Online]. Available: http://www.powerstream.ca/

[231] Data centre incentive program. [Online]. Available: http://www.cita.utoronto.ca/

∼dubinski/DCIP/DCIP\ SellSheet\ FINALcb.pdf

[232] M. Ghamkhari and H. Mohsenian-Rad, “Optimal integration of renewable energy

resources in data centers with behind-the-meter renewable generator,” in ICC.

IEEE, 2012.

Bibliography 186

[233] S. Ghamkhari, “Energy management and profit maximization of green data cen-

ters,” Master’s thesis, Electrical and Computer Engineering, Texas Tech University,

2012.

[234] Other world computing. [Online]. Available: http://owc.net/about.php

[235] Baryonyx corp. [Online]. Available: http://www.baryonyxcorp.com/

[236] Green house data. [Online]. Available: http://www.greenhousedata.com/

[237] Wind-powered data centers. Data Center Knowledge. [Online]. Available:

http://www.datacenterknowledge.com/wind-powered-data-centers/

[238] M. Langberg and A. Sprintson, “On the hardness of approximating the network

coding capacity,” in IEEE International Symposium on Information Theory(ISIT),

2008, 2008, pp. 315–319.

greener big data: optimizing data exchange and power ... · fashion. in particular, the rise of the...

Documents